Pure IDs (0-Core)#

  1. Pros of 0-Core: Comprehensive data, Maximum diversity, Rich context, Potential form, Generalization.

  2. Cons of 0-Core: Data noise, Large size and high resource demands, Quality variability, Imbalanced distribution.

Before Data Splitting#

We provide the pure ID files with “0-core” and “de-duplication” processing as below.

import pandas as pd

file = # e.g., "All_Beauty.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp
0AGKHLEW2SOWHNMFQIJGBECAF7INQ,B081TJ8YS3,4.0,1588615855070

Statistics#

Category

#User

#Item

#Rating

Download

All_Beauty

632.0K

112.6K

693.9K

link

Amazon_Fashion

2.0M

825.9K

2.5M

link

Appliances

1.8M

94.3K

2.1M

link

Arts_Crafts_and_Sewing

4.6M

801.3K

8.8M

link

Automotive

8.0M

2.0M

19.7M

link

Baby_Products

3.4M

217.7K

6.0M

link

Beauty_and_Personal_Care

11.3M

1.0M

23.6M

link

Books

10.3M

4.4M

29.1M

link

CDs_and_Vinyl

1.8M

701.7K

4.8M

link

Cell_Phones_and_Accessories

11.6M

1.3M

20.6M

link

Clothing_Shoes_and_Jewelry

22.6M

7.2M

65.2M

link

Digital_Music

101.0K

70.5K

128.8K

link

Electronics

18.3M

1.6M

43.4M

link

Gift_Cards

132.7K

1.1K

149.9K

link

Grocery_and_Gourmet_Food

7.0M

603.2K

14.1M

link

Handmade_Products

586.6K

164.7K

656.1K

link

Health_and_Household

12.5M

797.4K

25.3M

link

Health_and_Personal_Care

461.7K

60.3K

488.2K

link

Home_and_Kitchen

23.2M

3.7M

66.6M

link

Industrial_and_Scientific

3.4M

427.5K

5.1M

link

Kindle_Store

5.6M

1.6M

25.3M

link

Magazine_Subscriptions

60.1K

3.4K

70.9K

link

Movies_and_TV

6.5M

747.8K

17.2M

link

Musical_Instruments

1.8M

213.6K

3.0M

link

Office_Products

7.6M

710.4K

12.7M

link

Patio_Lawn_and_Garden

8.6M

851.7K

16.3M

link

Pet_Supplies

7.8M

492.7K

16.6M

link

Software

2.6M

89.2K

4.8M

link

Sports_and_Outdoors

10.3M

1.6M

19.3M

link

Subscription_Boxes

15.2K

641

16.0K

link

Tools_and_Home_Improvement

12.2M

1.5M

26.6M

link

Toys_and_Games

8.1M

890.7K

16.1M

link

Video_Games

2.8M

137.2K

4.6M

link

Unknown

23.1M

13.2M

63.1M

link

After Data Splitting#

We provide some commonly used data splitting strategies on the pure ID files to encourage benchmarking. The history field in the processed files represents the historical user interactions before the current timestamp.

import pandas as pd

file = # e.g., "All_Beauty.test.csv"
df = pd.read_csv(file)
user_id,parent_asin,rating,timestamp,history
AFZUK3MTBIBEDQOPAK3OATUOUKLA,B0BFR5WF1R,1.0,1675826333052,B0020MKBNW B082FLP15V

Leave-Last-Out Splitting#

Note

The test and validation splits may be larger than the training split. This is because we place users with only 1 interaction into the test split, and users with only 2 interactions into the validation and test splits, respectively. Such users are not included in the training split.

A data-splitting strategy to pick up the lastest two item interactions for evaluation. This strategy is widely used in many recommendation papers.

Specially, given a chronological user interaction sequence of length N:

  1. Training part: the first N-2 items;

  2. Validation part: the (N-1)-th item;

  3. Testing part: the N-th item.

Category

#User

#Item

#Rating

Download

All_Beauty

632.0K

112.6K

19.1K / 42.8K / 632.0K

train / valid / test

Amazon_Fashion

2.0M

825.9K

157.5K / 281.4K / 2.0M

train / valid / test

Appliances

1.8M

94.3K

104.8K / 243.4K / 1.8M

train / valid / test

Arts_Crafts_and_Sewing

4.6M

801.3K

2.8M / 1.4M / 4.6M

train / valid / test

Automotive

8.0M

2.0M

8.4M / 3.3M / 8.0M

train / valid / test

Baby_Products

3.4M

217.7K

1.6M / 1.0M / 3.4M

train / valid / test

Beauty_and_Personal_Care

11.3M

1.0M

8.0M / 4.2M / 11.3M

train / valid / test

Books

10.3M

4.4M

14.5M / 4.3M / 10.3M

train / valid / test

CDs_and_Vinyl

1.8M

701.7K

2.4M / 648.8K / 1.8M

train / valid / test

Cell_Phones_and_Accessories

11.6M

1.3M

4.9M / 4.1M / 11.6M

train / valid / test

Clothing_Shoes_and_Jewelry

22.6M

7.2M

31.1M / 11.5M / 22.6M

train / valid / test

Digital_Music

101.0K

70.5K

15.8K / 12.0K / 101.0K

train / valid / test

Electronics

18.3M

1.6M

17.2M / 7.8M / 18.3M

train / valid / test

Gift_Cards

132.7K

1.1K

5.7K / 11.4K / 132.7K

train / valid / test

Grocery_and_Gourmet_Food

7.0M

603.2K

4.7M / 2.4M / 7.0M

train / valid / test

Handmade_Products

586.6K

164.7K

19.1K / 50.4K / 586.6K

train / valid / test

Health_and_Household

12.5M

797.4K

8.2M / 4.5M / 12.5M

train / valid / test

Health_and_Personal_Care

461.7K

60.3K

8.1K / 18.4K / 461.7K

train / valid / test

Home_and_Kitchen

23.2M

3.7M

31.7M / 11.7M / 23.2M

train / valid / test

Industrial_and_Scientific

3.4M

427.5K

921.3K / 759.3K / 3.4M

train / valid / test

Kindle_Store

5.6M

1.6M

16.9M / 2.7M / 5.6M

train / valid / test

Magazine_Subscriptions

60.1K

3.4K

4.0K / 6.8K / 60.1K

train / valid / test

Movies_and_TV

6.5M

747.8K

8.0M / 2.7M / 6.5M

train / valid / test

Musical_Instruments

1.8M

213.6K

769.6K / 443.2K / 1.8M

train / valid / test

Office_Products

7.6M

710.4K

2.9M / 2.2M / 7.6M

train / valid / test

Patio_Lawn_and_Garden

8.6M

851.7K

4.7M / 3.0M / 8.6M

train / valid / test

Pet_Supplies

7.8M

492.7K

5.9M / 3.0M / 7.8M

train / valid / test

Software

2.6M

89.2K

1.4M / 826.8K / 2.6M

train / valid / test

Sports_and_Outdoors

10.3M

1.6M

5.5M / 3.5M / 10.3M

train / valid / test

Subscription_Boxes

15.2K

641

144 / 572 / 15.2K

train / valid / test

Tools_and_Home_Improvement

12.2M

1.5M

9.7M / 4.7M / 12.2M

train / valid / test

Toys_and_Games

8.1M

890.7K

5.1M / 2.8M / 8.1M

train / valid / test

Video_Games

2.8M

137.2K

1.0M / 743.8K / 2.8M

train / valid / test

Unknown

23.1M

13.2M

28.5M / 11.5M / 23.1M

train / valid / test

Absolute-Timestamp Splitting#

A data-splitting strategy to use specific absolute timestamps to cut item sequences for training and evaluation, respectively. This strategy aligns with real-world scenarios but is not widely used in research. Researchers are encouraged to experiment with this splitting strategy.

Specially, given a chronological user interaction sequence of length N:

  1. Training part: item interactions with timestamp range (-∞, t_1);

  2. Validation part: item interactions with timestamp range [t_1, t_2);

  3. Testing part: item interactions with timestamp range [t_2, +∞).

Here, we set t_1 = 1628643414042, and t_2 = 1658002729837.

Category

#User

#Item

#Rating

Download

All_Beauty

632.0K

112.6K

583.2K / 71.8K / 39.0K

train / valid / test

Amazon_Fashion

2.0M

825.9K

2.2M / 194.8K / 108.4K

train / valid / test

Appliances

1.8M

94.3K

1.6M / 260.5K / 268.8K

train / valid / test

Arts_Crafts_and_Sewing

4.6M

801.3K

6.7M / 1.1M / 1.1M

train / valid / test

Automotive

8.0M

2.0M

14.9M / 2.4M / 2.4M

train / valid / test

Baby_Products

3.4M

217.7K

4.9M / 517.4K / 550.4K

train / valid / test

Beauty_and_Personal_Care

11.3M

1.0M

17.2M / 3.0M / 3.5M

train / valid / test

Books

10.3M

4.4M

26.2M / 1.6M / 1.4M

train / valid / test

CDs_and_Vinyl

1.8M

701.7K

4.5M / 138.8K / 105.9K

train / valid / test

Cell_Phones_and_Accessories

11.6M

1.3M

16.3M / 2.1M / 2.1M

train / valid / test

Clothing_Shoes_and_Jewelry

22.6M

7.2M

48.8M / 8.1M / 8.3M

train / valid / test

Digital_Music

101.0K

70.5K

112.8K / 8.2K / 7.8K

train / valid / test

Electronics

18.3M

1.6M

35.3M / 4.0M / 4.1M

train / valid / test

Gift_Cards

132.7K

1.1K

123.6K / 14.0K / 12.4K

train / valid / test

Grocery_and_Gourmet_Food

7.0M

603.2K

10.4M / 1.9M / 1.7M

train / valid / test

Handmade_Products

586.6K

164.7K

457.1K / 94.1K / 104.9K

train / valid / test

Health_and_Household

12.5M

797.4K

18.7M / 3.2M / 3.3M

train / valid / test

Health_and_Personal_Care

461.7K

60.3K

413.3K / 45.4K / 29.4K

train / valid / test

Home_and_Kitchen

23.2M

3.7M

49.5M / 8.4M / 8.7M

train / valid / test

Industrial_and_Scientific

3.4M

427.5K

3.8M / 690.9K / 639.4K

train / valid / test

Kindle_Store

5.6M

1.6M

22.0M / 1.7M / 1.5M

train / valid / test

Magazine_Subscriptions

60.1K

3.4K

67.7K / 2.1K / 1.1K

train / valid / test

Movies_and_TV

6.5M

747.8K

16.1M / 590.9K / 494.7K

train / valid / test

Musical_Instruments

1.8M

213.6K

2.4M / 266.8K / 274.2K

train / valid / test

Office_Products

7.6M

710.4K

9.9M / 1.4M / 1.4M

train / valid / test

Patio_Lawn_and_Garden

8.6M

851.7K

11.6M / 2.4M / 2.3M

train / valid / test

Pet_Supplies

7.8M

492.7K

12.4M / 2.1M / 2.1M

train / valid / test

Software

2.6M

89.2K

4.6M / 140.7K / 94.1K

train / valid / test

Sports_and_Outdoors

10.3M

1.6M

15.7M / 1.8M / 1.8M

train / valid / test

Subscription_Boxes

15.2K

641

12.6K / 2.1K / 1.2K

train / valid / test

Tools_and_Home_Improvement

12.2M

1.5M

19.9M / 3.3M / 3.4M

train / valid / test

Toys_and_Games

8.1M

890.7K

12.4M / 1.7M / 1.9M

train / valid / test

Video_Games

2.8M

137.2K

3.8M / 344.6K / 363.9K

train / valid / test

Unknown

23.1M

13.2M

58.0M / 3.0M / 2.1M

train / valid / test