The library’s development was driven by an extensive survey of 55 papers published between 2020 and 2024, covering areas such as graph neural networks, contrastive learning, and reinforcement learning. This meta-analysis identified inconsistencies in how datasets are referenced, filtered, and split—issues that DataRec explicitly seeks to correct.
Dataset referencing, for example, was found to be unreliable: only 35% of papers referenced original sources; others linked to modified versions or broken links. DataRec counters this with built-in dataset access and public checksums. It supports transformation of raw data using filtering methods that mirror common practice, and provides traceable exports to major frameworks.
In contrast to monolithic recommendation frameworks, which are often non-interoperable, DataRec is modular and library-focused. This enables it to act as a shared layer for dataset handling, without duplicating model training or evaluation logic. The architecture is centred on a primary DataRec class backed by modules for I/O, processing, and splitting. Version control, detailed logging, and exportable configurations ensure that results can be reliably reproduced across different environments and research groups.
You can learn more by reading the entire paper or accessing the library on GitHub.