Pure IDs (5-Core)#

  1. Pros of 5-Core: Higher quality reviews, Reduced noise, Balanced distribution and Computational efficiency.

  2. Cons of 5-Core: Limited diversity, Misalignment with original data distribution, Loss of context, Generalizability and Limited data size for scaling up.

Before Data Splitting#

We provide the pure ID files with “5-core” and “de-duplication” processing as below.

import pandas as pd

file = # e.g., "All_Beauty.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp
0AGKHLEW2SOWHNMFQIJGBECAF7INQ,B081TJ8YS3,4.0,1588615855070

Statistics#

Note

28 subcategories for 5-core data (fewer than 33), due to some categories becoming empty after 5-core processing.

Category

#User

#Item

#Rating

Download

All_Beauty

253

356

2.5K

link

Arts_Crafts_and_Sewing

197.3K

90.0K

1.8M

link

Automotive

632.1K

267.3K

6.1M

link

Baby_Products

150.8K

36.0K

1.2M

link

Beauty_and_Personal_Care

729.6K

207.6K

6.6M

link

Books

776.4K

495.1K

9.5M

link

CDs_and_Vinyl

123.9K

89.4K

1.6M

link

Cell_Phones_and_Accessories

381.0K

111.5K

2.8M

link

Clothing_Shoes_and_Jewelry

2.5M

715.0K

23.1M

link

Electronics

1.6M

368.2K

15.5M

link

Gift_Cards

377

129

2.4K

link

Grocery_and_Gourmet_Food

404.8K

132.9K

3.9M

link

Health_and_Household

796.1K

184.3K

7.2M

link

Home_and_Kitchen

2.9M

763.6K

28.2M

link

Industrial_and_Scientific

51.0K

25.8K

412.9K

link

Kindle_Store

892.2K

466.6K

16.1M

link

Magazine_Subscriptions

30

21

176

link

Movies_and_TV

657.2K

197.9K

7.4M

link

Musical_Instruments

57.4K

24.6K

511.8K

link

Office_Products

223.3K

77.6K

1.8M

link

Patio_Lawn_and_Garden

416.6K

133.4K

3.4M

link

Pet_Supplies

594.8K

114.6K

5.3M

link

Software

146.4K

17.6K

1.3M

link

Sports_and_Outdoors

409.8K

156.2K

3.5M

link

Tools_and_Home_Improvement

842.0K

274.7K

7.9M

link

Toys_and_Games

432.3K

162.0K

3.9M

link

Video_Games

94.8K

25.6K

814.6K

link

Unknown

1.6M

694.3K

14.6M

link

After Data Splitting#

We provide some commonly used data splitting strategies on the pure ID files to encourage benchmarking. The history field in the processed files represents the historical user interactions before the current timestamp.

import pandas as pd

file = # e.g., "All_Beauty.test.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp,history
AFZUK3MTBIBEDQOPAK3OATUOUKLA,B0BFR5WF1R,1.0,1675826333052,B0020MKBNW B082FLP15V

Leave-Last-Out Splitting#

A data-splitting strategy to pick up the lastest two item interactions for evaluation. This strategy is widely used in many recommendation papers.

Specially, given a chronological user interaction sequence of length N:

  1. Training part: the first N-2 items;

  2. Validation part: the (N-1)-th item;

  3. Testing part: the N-th item.

Category

#User

#Item

#Rating

Download

All_Beauty

253

356

2.0K / 253 / 253

train / valid / test

Arts_Crafts_and_Sewing

197.3K

90.0K

1.4M / 197.3K / 197.3K

train / valid / test

Automotive

632.1K

267.3K

4.8M / 632.1K / 632.1K

train / valid / test

Baby_Products

150.8K

36.0K

939.5K / 150.8K / 150.8K

train / valid / test

Beauty_and_Personal_Care

729.6K

207.6K

5.2M / 729.6K / 729.6K

train / valid / test

Books

776.4K

495.1K

7.9M / 776.4K / 776.4K

train / valid / test

CDs_and_Vinyl

123.9K

89.4K

1.3M / 123.9K / 123.9K

train / valid / test

Cell_Phones_and_Accessories

381.0K

111.5K

2.0M / 381.0K / 381.0K

train / valid / test

Clothing_Shoes_and_Jewelry

2.5M

715.0K

18.1M / 2.5M / 2.5M

train / valid / test

Electronics

1.6M

368.2K

12.2M / 1.6M / 1.6M

train / valid / test

Gift_Cards

377

129

1.7K / 377 / 377

train / valid / test

Grocery_and_Gourmet_Food

404.8K

132.9K

3.1M / 404.8K / 404.8K

train / valid / test

Health_and_Household

796.1K

184.3K

5.6M / 796.1K / 796.1K

train / valid / test

Home_and_Kitchen

2.9M

763.6K

22.4M / 2.9M / 2.9M

train / valid / test

Industrial_and_Scientific

51.0K

25.8K

311.0K / 51.0K / 51.0K

train / valid / test

Kindle_Store

892.2K

466.6K

14.3M / 892.2K / 892.2K

train / valid / test

Magazine_Subscriptions

30

21

116 / 30 / 30

train / valid / test

Movies_and_TV

657.2K

197.9K

6.1M / 657.2K / 657.2K

train / valid / test

Musical_Instruments

57.4K

24.6K

397.0K / 57.4K / 57.4K

train / valid / test

Office_Products

223.3K

77.6K

1.4M / 223.3K / 223.3K

train / valid / test

Patio_Lawn_and_Garden

416.6K

133.4K

2.6M / 416.6K / 416.6K

train / valid / test

Pet_Supplies

594.8K

114.6K

4.1M / 594.8K / 594.8K

train / valid / test

Software

146.4K

17.6K

984.0K / 146.4K / 146.4K

train / valid / test

Sports_and_Outdoors

409.8K

156.2K

2.7M / 409.8K / 409.8K

train / valid / test

Tools_and_Home_Improvement

842.0K

274.7K

6.2M / 842.0K / 842.0K

train / valid / test

Toys_and_Games

432.3K

162.0K

3.0M / 432.3K / 432.3K

train / valid / test

Video_Games

94.8K

25.6K

625.1K / 94.8K / 94.8K

train / valid / test

Unknown

1.6M

694.3K

11.4M / 1.6M / 1.6M

train / valid / test

Absolute-Timestamp Splitting#

A data-splitting strategy to use specific absolute timestamps to cut item sequences for training and evaluation, respectively. This strategy aligns with real-world scenarios but is not widely used in research. Researchers are encouraged to experiment with this splitting strategy.

Specially, given a chronological user interaction sequence of length N:

  1. Training part: item interactions with timestamp range (-∞, t_1);

  2. Validation part: item interactions with timestamp range [t_1, t_2);

  3. Testing part: item interactions with timestamp range [t_2, +∞).

Here, we set t_1 = 1628643414042, and t_2 = 1658002729837.

Category

#User

#Item

#Rating

Download

All_Beauty

253

356

2.2K / 276 / 22

train / valid / test

Arts_Crafts_and_Sewing

197.3K

90.0K

1.4M / 192.8K / 210.8K

train / valid / test

Automotive

632.1K

267.3K

4.8M / 649.2K / 604.7K

train / valid / test

Baby_Products

150.8K

36.0K

1.1M / 74.2K / 99.5K

train / valid / test

Beauty_and_Personal_Care

729.6K

207.6K

4.8M / 773.6K / 1.0M

train / valid / test

Books

776.4K

495.1K

8.7M / 426.2K / 328.2K

train / valid / test

CDs_and_Vinyl

123.9K

89.4K

1.5M / 31.1K / 21.2K

train / valid / test

Cell_Phones_and_Accessories

381.0K

111.5K

2.2M / 239.4K / 304.5K

train / valid / test

Clothing_Shoes_and_Jewelry

2.5M

715.0K

16.8M / 3.0M / 3.3M

train / valid / test

Electronics

1.6M

368.2K

13.1M / 1.2M / 1.2M

train / valid / test

Gift_Cards

377

129

2.2K / 155 / 107

train / valid / test

Grocery_and_Gourmet_Food

404.8K

132.9K

3.0M / 516.5K / 437.7K

train / valid / test

Health_and_Household

796.1K

184.3K

5.4M / 821.5K / 915.7K

train / valid / test

Home_and_Kitchen

2.9M

763.6K

21.4M / 3.3M / 3.6M

train / valid / test

Industrial_and_Scientific

51.0K

25.8K

297.9K / 44.1K / 70.9K

train / valid / test

Kindle_Store

892.2K

466.6K

13.8M / 1.3M / 1.0M

train / valid / test

Magazine_Subscriptions

30

21

166 / 9 / 1

train / valid / test

Movies_and_TV

657.2K

197.9K

7.1M / 204.8K / 146.0K

train / valid / test

Musical_Instruments

57.4K

24.6K

428.0K / 41.3K / 42.6K

train / valid / test

Office_Products

223.3K

77.6K

1.4M / 140.7K / 218.8K

train / valid / test

Patio_Lawn_and_Garden

416.6K

133.4K

2.5M / 432.0K / 461.8K

train / valid / test

Pet_Supplies

594.8K

114.6K

4.1M / 551.7K / 588.1K

train / valid / test

Software

146.4K

17.6K

1.2M / 35.4K / 19.4K

train / valid / test

Sports_and_Outdoors

409.8K

156.2K

2.9M / 260.3K / 344.0K

train / valid / test

Tools_and_Home_Improvement

842.0K

274.7K

6.0M / 833.9K / 1.0M

train / valid / test

Toys_and_Games

432.3K

162.0K

3.1M / 306.9K / 440.0K

train / valid / test

Video_Games

94.8K

25.6K

736.8K / 34.5K / 43.2K

train / valid / test

Unknown

1.6M

694.3K

14.0M / 357.3K / 278.7K

train / valid / test