Pure IDs (5-Core)#
Pros of 5-Core: Higher quality reviews, Reduced noise, Balanced distribution and Computational efficiency.
Cons of 5-Core: Limited diversity, Misalignment with original data distribution, Loss of context, Generalizability and Limited data size for scaling up.
Before Data Splitting#
We provide the pure ID files with “5-core” and “de-duplication” processing as below.
import pandas as pd
file = # e.g., "All_Beauty.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp
0AGKHLEW2SOWHNMFQIJGBECAF7INQ,B081TJ8YS3,4.0,1588615855070
Statistics#
Note
28 subcategories for 5-core data (fewer than 33), due to some categories becoming empty after 5-core processing.
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
253 |
356 |
2.5K |
|
Arts_Crafts_and_Sewing |
197.3K |
90.0K |
1.8M |
|
Automotive |
632.1K |
267.3K |
6.1M |
|
Baby_Products |
150.8K |
36.0K |
1.2M |
|
Beauty_and_Personal_Care |
729.6K |
207.6K |
6.6M |
|
Books |
776.4K |
495.1K |
9.5M |
|
CDs_and_Vinyl |
123.9K |
89.4K |
1.6M |
|
Cell_Phones_and_Accessories |
381.0K |
111.5K |
2.8M |
|
Clothing_Shoes_and_Jewelry |
2.5M |
715.0K |
23.1M |
|
Electronics |
1.6M |
368.2K |
15.5M |
|
Gift_Cards |
377 |
129 |
2.4K |
|
Grocery_and_Gourmet_Food |
404.8K |
132.9K |
3.9M |
|
Health_and_Household |
796.1K |
184.3K |
7.2M |
|
Home_and_Kitchen |
2.9M |
763.6K |
28.2M |
|
Industrial_and_Scientific |
51.0K |
25.8K |
412.9K |
|
Kindle_Store |
892.2K |
466.6K |
16.1M |
|
Magazine_Subscriptions |
30 |
21 |
176 |
|
Movies_and_TV |
657.2K |
197.9K |
7.4M |
|
Musical_Instruments |
57.4K |
24.6K |
511.8K |
|
Office_Products |
223.3K |
77.6K |
1.8M |
|
Patio_Lawn_and_Garden |
416.6K |
133.4K |
3.4M |
|
Pet_Supplies |
594.8K |
114.6K |
5.3M |
|
Software |
146.4K |
17.6K |
1.3M |
|
Sports_and_Outdoors |
409.8K |
156.2K |
3.5M |
|
Tools_and_Home_Improvement |
842.0K |
274.7K |
7.9M |
|
Toys_and_Games |
432.3K |
162.0K |
3.9M |
|
Video_Games |
94.8K |
25.6K |
814.6K |
|
Unknown |
1.6M |
694.3K |
14.6M |
After Data Splitting#
We provide some commonly used data splitting strategies on the pure ID files to encourage benchmarking. The history
field in the processed files represents the historical user interactions before the current timestamp.
import pandas as pd
file = # e.g., "All_Beauty.test.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp,history
AFZUK3MTBIBEDQOPAK3OATUOUKLA,B0BFR5WF1R,1.0,1675826333052,B0020MKBNW B082FLP15V
Leave-Last-Out Splitting#
A data-splitting strategy to pick up the lastest two item interactions for evaluation. This strategy is widely used in many recommendation papers.
Specially, given a chronological user interaction sequence of length N:
Training part: the first N-2 items;
Validation part: the (N-1)-th item;
Testing part: the N-th item.
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
253 |
356 |
2.0K / 253 / 253 |
|
Arts_Crafts_and_Sewing |
197.3K |
90.0K |
1.4M / 197.3K / 197.3K |
|
Automotive |
632.1K |
267.3K |
4.8M / 632.1K / 632.1K |
|
Baby_Products |
150.8K |
36.0K |
939.5K / 150.8K / 150.8K |
|
Beauty_and_Personal_Care |
729.6K |
207.6K |
5.2M / 729.6K / 729.6K |
|
Books |
776.4K |
495.1K |
7.9M / 776.4K / 776.4K |
|
CDs_and_Vinyl |
123.9K |
89.4K |
1.3M / 123.9K / 123.9K |
|
Cell_Phones_and_Accessories |
381.0K |
111.5K |
2.0M / 381.0K / 381.0K |
|
Clothing_Shoes_and_Jewelry |
2.5M |
715.0K |
18.1M / 2.5M / 2.5M |
|
Electronics |
1.6M |
368.2K |
12.2M / 1.6M / 1.6M |
|
Gift_Cards |
377 |
129 |
1.7K / 377 / 377 |
|
Grocery_and_Gourmet_Food |
404.8K |
132.9K |
3.1M / 404.8K / 404.8K |
|
Health_and_Household |
796.1K |
184.3K |
5.6M / 796.1K / 796.1K |
|
Home_and_Kitchen |
2.9M |
763.6K |
22.4M / 2.9M / 2.9M |
|
Industrial_and_Scientific |
51.0K |
25.8K |
311.0K / 51.0K / 51.0K |
|
Kindle_Store |
892.2K |
466.6K |
14.3M / 892.2K / 892.2K |
|
Magazine_Subscriptions |
30 |
21 |
116 / 30 / 30 |
|
Movies_and_TV |
657.2K |
197.9K |
6.1M / 657.2K / 657.2K |
|
Musical_Instruments |
57.4K |
24.6K |
397.0K / 57.4K / 57.4K |
|
Office_Products |
223.3K |
77.6K |
1.4M / 223.3K / 223.3K |
|
Patio_Lawn_and_Garden |
416.6K |
133.4K |
2.6M / 416.6K / 416.6K |
|
Pet_Supplies |
594.8K |
114.6K |
4.1M / 594.8K / 594.8K |
|
Software |
146.4K |
17.6K |
984.0K / 146.4K / 146.4K |
|
Sports_and_Outdoors |
409.8K |
156.2K |
2.7M / 409.8K / 409.8K |
|
Tools_and_Home_Improvement |
842.0K |
274.7K |
6.2M / 842.0K / 842.0K |
|
Toys_and_Games |
432.3K |
162.0K |
3.0M / 432.3K / 432.3K |
|
Video_Games |
94.8K |
25.6K |
625.1K / 94.8K / 94.8K |
|
Unknown |
1.6M |
694.3K |
11.4M / 1.6M / 1.6M |
Absolute-Timestamp Splitting#
A data-splitting strategy to use specific absolute timestamps to cut item sequences for training and evaluation, respectively. This strategy aligns with real-world scenarios but is not widely used in research. Researchers are encouraged to experiment with this splitting strategy.
Specially, given a chronological user interaction sequence of length N:
Training part: item interactions with timestamp range (-∞, t_1);
Validation part: item interactions with timestamp range [t_1, t_2);
Testing part: item interactions with timestamp range [t_2, +∞).
Here, we set t_1 = 1628643414042, and t_2 = 1658002729837.
Category |
#User |
#Item |
#Rating |
Download |
---|---|---|---|---|
All_Beauty |
253 |
356 |
2.2K / 276 / 22 |
|
Arts_Crafts_and_Sewing |
197.3K |
90.0K |
1.4M / 192.8K / 210.8K |
|
Automotive |
632.1K |
267.3K |
4.8M / 649.2K / 604.7K |
|
Baby_Products |
150.8K |
36.0K |
1.1M / 74.2K / 99.5K |
|
Beauty_and_Personal_Care |
729.6K |
207.6K |
4.8M / 773.6K / 1.0M |
|
Books |
776.4K |
495.1K |
8.7M / 426.2K / 328.2K |
|
CDs_and_Vinyl |
123.9K |
89.4K |
1.5M / 31.1K / 21.2K |
|
Cell_Phones_and_Accessories |
381.0K |
111.5K |
2.2M / 239.4K / 304.5K |
|
Clothing_Shoes_and_Jewelry |
2.5M |
715.0K |
16.8M / 3.0M / 3.3M |
|
Electronics |
1.6M |
368.2K |
13.1M / 1.2M / 1.2M |
|
Gift_Cards |
377 |
129 |
2.2K / 155 / 107 |
|
Grocery_and_Gourmet_Food |
404.8K |
132.9K |
3.0M / 516.5K / 437.7K |
|
Health_and_Household |
796.1K |
184.3K |
5.4M / 821.5K / 915.7K |
|
Home_and_Kitchen |
2.9M |
763.6K |
21.4M / 3.3M / 3.6M |
|
Industrial_and_Scientific |
51.0K |
25.8K |
297.9K / 44.1K / 70.9K |
|
Kindle_Store |
892.2K |
466.6K |
13.8M / 1.3M / 1.0M |
|
Magazine_Subscriptions |
30 |
21 |
166 / 9 / 1 |
|
Movies_and_TV |
657.2K |
197.9K |
7.1M / 204.8K / 146.0K |
|
Musical_Instruments |
57.4K |
24.6K |
428.0K / 41.3K / 42.6K |
|
Office_Products |
223.3K |
77.6K |
1.4M / 140.7K / 218.8K |
|
Patio_Lawn_and_Garden |
416.6K |
133.4K |
2.5M / 432.0K / 461.8K |
|
Pet_Supplies |
594.8K |
114.6K |
4.1M / 551.7K / 588.1K |
|
Software |
146.4K |
17.6K |
1.2M / 35.4K / 19.4K |
|
Sports_and_Outdoors |
409.8K |
156.2K |
2.9M / 260.3K / 344.0K |
|
Tools_and_Home_Improvement |
842.0K |
274.7K |
6.0M / 833.9K / 1.0M |
|
Toys_and_Games |
432.3K |
162.0K |
3.1M / 306.9K / 440.0K |
|
Video_Games |
94.8K |
25.6K |
736.8K / 34.5K / 43.2K |
|
Unknown |
1.6M |
694.3K |
14.0M / 357.3K / 278.7K |