RAG-QA-ARENA
RAG-QA-ARENA is a preference-based comparison dataset, for which we provide detailed tutorials
less than a minute
We observe that many current benchmarks no longer align with modern RAG settings. Traditional RAG benchmarks work with paragraphs, where relevant paragraphs are selected from a limited set before LLM generates an answer. Current RAG settings more closely resemble real-world scenarios where we have a raw corpus that is processed directly for retrieval and answering. Therefore, we modified existing multi-hop datasets by merging all paragraphs into a single corpus and evaluating final answers. Recognizing that RAG primarily focuses on retrieval system quality, we maintained consistent question-answering settings across evaluations to ensure fair comparisons.
You should merge all corpora into a single corpus, then use the indexing functionality of each RAG system to index it into their respective databases.
Save your questions and answers as keys in a parquet file format. You can then directly use our provided “LLM as judge” script for testing.
We provide most of the datasets we used, which have been processed into an easy-to-use format. However, due to copyright requirements for some datasets, please contact any of the authors to obtain our processed datasets and evaluation parquet files.
RAG-QA-ARENA is a preference-based comparison dataset, for which we provide detailed tutorials
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.