Benchmarks

Explore the benchmarks used in NodeRAG experiments This section details the benchmark datasets and how they were processed.

1: RAG-QA-ARENA

Benchmarks

We observe that many current benchmarks no longer align with modern RAG settings. Traditional RAG benchmarks work with paragraphs, where relevant paragraphs are selected from a limited set before LLM generates an answer. Current RAG settings more closely resemble real-world scenarios where we have a raw corpus that is processed directly for retrieval and answering. Therefore, we modified existing multi-hop datasets by merging all paragraphs into a single corpus and evaluating final answers. Recognizing that RAG primarily focuses on retrieval system quality, we maintained consistent question-answering settings across evaluations to ensure fair comparisons.

raw corpus

You should merge all corpora into a single corpus, then use the indexing functionality of each RAG system to index it into their respective databases.

answer and evaluation

Save your questions and answers as keys in a parquet file format. You can then directly use our provided “LLM as judge” script for testing.

benchmarks

We provide most of the datasets we used, which have been processed into an easy-to-use format. However, due to copyright requirements for some datasets, please contact any of the authors to obtain our processed datasets and evaluation parquet files.

1 - RAG-QA-ARENA

RAG-QA-ARENA is a preference-based comparison dataset, for which we provide detailed tutorials

Request for data

You can obtain the dataset by emailing any of the authors.

Process

You will find a RAG Arena folder in Google Drive. Place the data files from this folder into the rag-qa-arena folder in your GitHub repository.

Index and Answer

Add the -a flag to the original command to skip evaluation and obtain the raw parquet file.

For example,

python -m /eval/eval_node -f path/to/main_folder -q path/to/question_parquet -a

Use change.ipynb in the rag-qa-arena folder to convert the parquet to the evaluation JSON format. Place the processed JSON files in the data/pairwise_eval folder, following this structure:

📁 rag-qa-arena
└── 📁 data
    └── 📁 pairwise_eval
        └── 📁 GraphRAG
            ├── 📄 fiqa.json
            ├── 📄 lifestyle.json
            ├── 📄 recreation.json
            ├── 📄 science.json
            ├── 📄 technology.json
            └── 📄 writing.json
        └── 📁 NodeRAG
            ├── 📄 fiqa.json
            ├── 📄 lifestyle.json
            ├── 📄 recreation.json
            ├── 📄 science.json
            ├── 📄 technology.json
            └── 📄 writing.json
        └── 📁 NaiveRAG
            ├── 📄 fiqa.json
            ├── 📄 lifestyle.json
            ├── 📄 recreation.json
            ├── 📄 science.json
            ├── 📄 technology.json
            └── 📄 writing.json

Compare with LFRQA directly.

Modify the script by adding your openai_key.

For mac and linux,

bash run_pairwise_eval_lfrqa.sh

For Windows,

run_pairwise_eval_lfrqa.bat

Compare a pair of LLM generations.

Modify the script by adding your openai_key.

For mac and linux,

bash run_pairwise_eval_llms.sh

For windows,

run_pairwise_eval_llm.bat

You should modify model1 and model2 to ensure each model is compared against every other model. For example, you can compare NaiveRAG against the other four models, then compare Hyde against the remaining three models (excluding NaiveRAG), and so on until all pairwise comparisons are complete.

3.2 Complete Pairs

python code/report_results.py --use_complete_pairs

This script reports win and win+tie rate for all comparison, and output an all_battles.json .