NodeRAG 结构

了解 NodeRAG 的核心数据结构和组件。本指南解释了 NodeRAG 中使用的基本结构。

less than a minute

文件结构概览

以下结构是索引过程完成后生成的。每个文件夹和文件都有特定用途，用于高效检索和基于图的推理。

main_foulder/
├── cache/
│   ├── attributes.parquet
│   ├── documents.parquet
│   ├── entities.parquet
│   ├── graph.pkl
│   ├── high_level_elements.parquet
│   ├── high_level_elements_titles.parquet
│   ├── hnsw_graph.pkl
│   ├── HNSW.bin
│   ├── id_map.parquet
│   ├── relationship.parquet
│   ├── semantic_units.parquet
│   ├── text_decomposition.jsonl
│   └── text.parquet
│
├── info/
│   ├── document_hash.json
│   ├── indices.json
│   ├── info.log
│   └── state.json
│
├── input/
│   └── J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt
│
└── Node_config.yaml

目录和文件说明

`cache/`

存储所有处理过的数据，包括语义结构、嵌入向量和图数据，优化用于快速检索和推理。

attributes.parquet：存储从语料库中提取的元数据属性。
documents.parquet：包含处理后的文档级数据条目。
entities.parquet：提取的命名实体，用于链接和图构建。
graph.pkl：基于哈希 ID 的序列化异构图。
high_level_elements.parquet：聚合的高级单元（例如，高级概念）。
high_level_elements_titles.parquet：高级元素的标题，用于结构化导航。
hnsw_graph.pkl / HNSW.bin：HNSW（层次可导航小世界）索引，用于快速相似性搜索。
id_map.parquet：将内部 ID 映射到节点。
relationship.parquet：实体或语义单元之间的关系数据。
semantic_units.parquet：用于细粒度查询的核心语义内容单元。
text_decomposition.jsonl：以 JSON Lines 格式存储的分解文本数据，用于索引。
text.parquet：以高效方式存储的原始或轻度处理的文本段。

`info/`

包含索引状态、日志和元数据，用于跟踪和可重现性。

document_hash.json：输入文档的哈希值，用于变更检测和增量更新。
indices.json：关于各节点数量的信息。
info.log：捕获处理步骤和时间的日志文件。
state.json：工作流状态快照，用于恢复或审计索引过程。

`input/`

索引过程的输入文件和配置。

*.(txt, md)：示例输入语料库。

`配置`

Node_config.yaml：配置文件，指定索引参数、模型设置和路径。

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified April 5, 2025: update reproduce (f23a25c)