Pure IDs (0-Core)#
Pros of 0-Core: Comprehensive data, Maximum diversity, Rich context, Potential form, Generalization.
Cons of 0-Core: Data noise, Large size and high resource demands, Quality variability, Imbalanced distribution.
Before Data Splitting#
We provide the pure ID files with “0-core” and “de-duplication” processing as below.
import pandas as pd
file = # e.g., "All_Beauty.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp
0AGKHLEW2SOWHNMFQIJGBECAF7INQ,B081TJ8YS3,4.0,1588615855070
Statistics#
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
632.0K |
112.6K |
693.9K |
|
Amazon_Fashion |
2.0M |
825.9K |
2.5M |
|
Appliances |
1.8M |
94.3K |
2.1M |
|
Arts_Crafts_and_Sewing |
4.6M |
801.3K |
8.8M |
|
Automotive |
8.0M |
2.0M |
19.7M |
|
Baby_Products |
3.4M |
217.7K |
6.0M |
|
Beauty_and_Personal_Care |
11.3M |
1.0M |
23.6M |
|
Books |
10.3M |
4.4M |
29.1M |
|
CDs_and_Vinyl |
1.8M |
701.7K |
4.8M |
|
Cell_Phones_and_Accessories |
11.6M |
1.3M |
20.6M |
|
Clothing_Shoes_and_Jewelry |
22.6M |
7.2M |
65.2M |
|
Digital_Music |
101.0K |
70.5K |
128.8K |
|
Electronics |
18.3M |
1.6M |
43.4M |
|
Gift_Cards |
132.7K |
1.1K |
149.9K |
|
Grocery_and_Gourmet_Food |
7.0M |
603.2K |
14.1M |
|
Handmade_Products |
586.6K |
164.7K |
656.1K |
|
Health_and_Household |
12.5M |
797.4K |
25.3M |
|
Health_and_Personal_Care |
461.7K |
60.3K |
488.2K |
|
Home_and_Kitchen |
23.2M |
3.7M |
66.6M |
|
Industrial_and_Scientific |
3.4M |
427.5K |
5.1M |
|
Kindle_Store |
5.6M |
1.6M |
25.3M |
|
Magazine_Subscriptions |
60.1K |
3.4K |
70.9K |
|
Movies_and_TV |
6.5M |
747.8K |
17.2M |
|
Musical_Instruments |
1.8M |
213.6K |
3.0M |
|
Office_Products |
7.6M |
710.4K |
12.7M |
|
Patio_Lawn_and_Garden |
8.6M |
851.7K |
16.3M |
|
Pet_Supplies |
7.8M |
492.7K |
16.6M |
|
Software |
2.6M |
89.2K |
4.8M |
|
Sports_and_Outdoors |
10.3M |
1.6M |
19.3M |
|
Subscription_Boxes |
15.2K |
641 |
16.0K |
|
Tools_and_Home_Improvement |
12.2M |
1.5M |
26.6M |
|
Toys_and_Games |
8.1M |
890.7K |
16.1M |
|
Video_Games |
2.8M |
137.2K |
4.6M |
|
Unknown |
23.1M |
13.2M |
63.1M |
After Data Splitting#
We provide some commonly used data splitting strategies on the pure ID files to encourage benchmarking. The history
field in the processed files represents the historical user interactions before the current timestamp.
import pandas as pd
file = # e.g., "All_Beauty.test.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp,history
AFZUK3MTBIBEDQOPAK3OATUOUKLA,B0BFR5WF1R,1.0,1675826333052,B0020MKBNW B082FLP15V
Leave-Last-Out Splitting#
Note
The test and validation splits may be larger than the training split. This is because we place users with only 1 interaction into the test split, and users with only 2 interactions into the validation and test splits, respectively. Such users are not included in the training split.
A data-splitting strategy to pick up the lastest two item interactions for evaluation. This strategy is widely used in many recommendation papers.
Specially, given a chronological user interaction sequence of length N:
Training part: the first N-2 items;
Validation part: the (N-1)-th item;
Testing part: the N-th item.
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
632.0K |
112.6K |
19.1K / 42.8K / 632.0K |
|
Amazon_Fashion |
2.0M |
825.9K |
157.5K / 281.4K / 2.0M |
|
Appliances |
1.8M |
94.3K |
104.8K / 243.4K / 1.8M |
|
Arts_Crafts_and_Sewing |
4.6M |
801.3K |
2.8M / 1.4M / 4.6M |
|
Automotive |
8.0M |
2.0M |
8.4M / 3.3M / 8.0M |
|
Baby_Products |
3.4M |
217.7K |
1.6M / 1.0M / 3.4M |
|
Beauty_and_Personal_Care |
11.3M |
1.0M |
8.0M / 4.2M / 11.3M |
|
Books |
10.3M |
4.4M |
14.5M / 4.3M / 10.3M |
|
CDs_and_Vinyl |
1.8M |
701.7K |
2.4M / 648.8K / 1.8M |
|
Cell_Phones_and_Accessories |
11.6M |
1.3M |
4.9M / 4.1M / 11.6M |
|
Clothing_Shoes_and_Jewelry |
22.6M |
7.2M |
31.1M / 11.5M / 22.6M |
|
Digital_Music |
101.0K |
70.5K |
15.8K / 12.0K / 101.0K |
|
Electronics |
18.3M |
1.6M |
17.2M / 7.8M / 18.3M |
|
Gift_Cards |
132.7K |
1.1K |
5.7K / 11.4K / 132.7K |
|
Grocery_and_Gourmet_Food |
7.0M |
603.2K |
4.7M / 2.4M / 7.0M |
|
Handmade_Products |
586.6K |
164.7K |
19.1K / 50.4K / 586.6K |
|
Health_and_Household |
12.5M |
797.4K |
8.2M / 4.5M / 12.5M |
|
Health_and_Personal_Care |
461.7K |
60.3K |
8.1K / 18.4K / 461.7K |
|
Home_and_Kitchen |
23.2M |
3.7M |
31.7M / 11.7M / 23.2M |
|
Industrial_and_Scientific |
3.4M |
427.5K |
921.3K / 759.3K / 3.4M |
|
Kindle_Store |
5.6M |
1.6M |
16.9M / 2.7M / 5.6M |
|
Magazine_Subscriptions |
60.1K |
3.4K |
4.0K / 6.8K / 60.1K |
|
Movies_and_TV |
6.5M |
747.8K |
8.0M / 2.7M / 6.5M |
|
Musical_Instruments |
1.8M |
213.6K |
769.6K / 443.2K / 1.8M |
|
Office_Products |
7.6M |
710.4K |
2.9M / 2.2M / 7.6M |
|
Patio_Lawn_and_Garden |
8.6M |
851.7K |
4.7M / 3.0M / 8.6M |
|
Pet_Supplies |
7.8M |
492.7K |
5.9M / 3.0M / 7.8M |
|
Software |
2.6M |
89.2K |
1.4M / 826.8K / 2.6M |
|
Sports_and_Outdoors |
10.3M |
1.6M |
5.5M / 3.5M / 10.3M |
|
Subscription_Boxes |
15.2K |
641 |
144 / 572 / 15.2K |
|
Tools_and_Home_Improvement |
12.2M |
1.5M |
9.7M / 4.7M / 12.2M |
|
Toys_and_Games |
8.1M |
890.7K |
5.1M / 2.8M / 8.1M |
|
Video_Games |
2.8M |
137.2K |
1.0M / 743.8K / 2.8M |
|
Unknown |
23.1M |
13.2M |
28.5M / 11.5M / 23.1M |
Absolute-Timestamp Splitting#
A data-splitting strategy to use specific absolute timestamps to cut item sequences for training and evaluation, respectively. This strategy aligns with real-world scenarios but is not widely used in research. Researchers are encouraged to experiment with this splitting strategy.
Specially, given a chronological user interaction sequence of length N:
Training part: item interactions with timestamp range (-∞, t_1);
Validation part: item interactions with timestamp range [t_1, t_2);
Testing part: item interactions with timestamp range [t_2, +∞).
Here, we set t_1 = 1628643414042, and t_2 = 1658002729837.
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
632.0K |
112.6K |
583.2K / 71.8K / 39.0K |
|
Amazon_Fashion |
2.0M |
825.9K |
2.2M / 194.8K / 108.4K |
|
Appliances |
1.8M |
94.3K |
1.6M / 260.5K / 268.8K |
|
Arts_Crafts_and_Sewing |
4.6M |
801.3K |
6.7M / 1.1M / 1.1M |
|
Automotive |
8.0M |
2.0M |
14.9M / 2.4M / 2.4M |
|
Baby_Products |
3.4M |
217.7K |
4.9M / 517.4K / 550.4K |
|
Beauty_and_Personal_Care |
11.3M |
1.0M |
17.2M / 3.0M / 3.5M |
|
Books |
10.3M |
4.4M |
26.2M / 1.6M / 1.4M |
|
CDs_and_Vinyl |
1.8M |
701.7K |
4.5M / 138.8K / 105.9K |
|
Cell_Phones_and_Accessories |
11.6M |
1.3M |
16.3M / 2.1M / 2.1M |
|
Clothing_Shoes_and_Jewelry |
22.6M |
7.2M |
48.8M / 8.1M / 8.3M |
|
Digital_Music |
101.0K |
70.5K |
112.8K / 8.2K / 7.8K |
|
Electronics |
18.3M |
1.6M |
35.3M / 4.0M / 4.1M |
|
Gift_Cards |
132.7K |
1.1K |
123.6K / 14.0K / 12.4K |
|
Grocery_and_Gourmet_Food |
7.0M |
603.2K |
10.4M / 1.9M / 1.7M |
|
Handmade_Products |
586.6K |
164.7K |
457.1K / 94.1K / 104.9K |
|
Health_and_Household |
12.5M |
797.4K |
18.7M / 3.2M / 3.3M |
|
Health_and_Personal_Care |
461.7K |
60.3K |
413.3K / 45.4K / 29.4K |
|
Home_and_Kitchen |
23.2M |
3.7M |
49.5M / 8.4M / 8.7M |
|
Industrial_and_Scientific |
3.4M |
427.5K |
3.8M / 690.9K / 639.4K |
|
Kindle_Store |
5.6M |
1.6M |
22.0M / 1.7M / 1.5M |
|
Magazine_Subscriptions |
60.1K |
3.4K |
67.7K / 2.1K / 1.1K |
|
Movies_and_TV |
6.5M |
747.8K |
16.1M / 590.9K / 494.7K |
|
Musical_Instruments |
1.8M |
213.6K |
2.4M / 266.8K / 274.2K |
|
Office_Products |
7.6M |
710.4K |
9.9M / 1.4M / 1.4M |
|
Patio_Lawn_and_Garden |
8.6M |
851.7K |
11.6M / 2.4M / 2.3M |
|
Pet_Supplies |
7.8M |
492.7K |
12.4M / 2.1M / 2.1M |
|
Software |
2.6M |
89.2K |
4.6M / 140.7K / 94.1K |
|
Sports_and_Outdoors |
10.3M |
1.6M |
15.7M / 1.8M / 1.8M |
|
Subscription_Boxes |
15.2K |
641 |
12.6K / 2.1K / 1.2K |
|
Tools_and_Home_Improvement |
12.2M |
1.5M |
19.9M / 3.3M / 3.4M |
|
Toys_and_Games |
8.1M |
890.7K |
12.4M / 1.7M / 1.9M |
|
Video_Games |
2.8M |
137.2K |
3.8M / 344.6K / 363.9K |
|
Unknown |
23.1M |
13.2M |
58.0M / 3.0M / 2.1M |