Skip to content

[🐛BUG] filepath join when setting benchmark_files #2190

@kickerzz

Description

@kickerzz

Describe the bug
When loading a custom dataset that has been pre split into three parts (not by recbole) we can load the data via the config file by setting benchmark parameter (see https://recbole.io/docs/user_guide/config/data_settings.html#benchmark-file). When doing so (windows at least), the files are not found.

I traced the issue to the configurator file (see lib\site-packages\recbole\config\configurator.py), line 335-338.

The current code is checking if a custom data set is configured (yes, thats case), then continues in the else statement. There it joins the OS path with the dataset name and sets it as the final "data_path" in the config dict.

else:
        self.final_config_dict["data_path"] = os.path.join(
             self.final_config_dict["data_path"], self.dataset
        )

Later in dataset.py (around line 310) it checks if the directory exists, but this turns out to be set as the directory joined with the name of the dataset. Which of course doesn't exist.

for filename in self.benchmark_filename_list:
                file_path = os.path.join(dataset_path, f"{token}.{filename}.inter")
                # print(f"WRONG FILEPATH = {file_path}")
                if os.path.isfile(file_path): # doesn't pass this line....!
                    temp = self._load_feat(file_path, FeatureSource.INTERACTION)

a similar issue arises with a normal data set. I think the lines in configurator.py 335-338 need to set a directory path instead of directory + dataset name.

To Reproduce
Steps to reproduce the behavior:

config_dict = {
            'model': 'UserKNN',
            'dataset': 'CustomDataSet',
            'data_path': 'load_data', # tried multi variants e.g. full path 'C:/<location of python env>/load_data'
            'benchmark_filename': ['train','valid','test'],
            'USER_ID_FIELD': 'user_id',
            'ITEM_ID_FIELD': 'item_id',
            'RATING_FIELD': 'rating',
            'TIME_FIELD': 'time_field',
            'load_col': {'inter': ['user_id', 'item_id', 'rating']},
            'eval_setting': 'RO_RS,split', 
            'split_ratio': [0.8, 0.1, 0.1],
            'metrics': ['mrr', 'precision', 'recall', 'ndcg', 'map'],
            'valid_metrics': 'MRR@10', 
            'n_neighbors': 2,
            'similarity_type': 'cosine',
            'normalization': 'z-score',
            'train_batch_size': 4096,
            'epochs': 1, 
            'seed': 42,
        }
        
        run_recbole(model='BPR', config_dict=config_dict2)

Desktop (please complete the following information):

  • OS: Windows 11
  • RecBole Version 1.2.1
  • Python Version 9.9.23
  • PyTorch Version 2.8.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions