NodeRAG Structures

Learn about NodeRAG’s core data structures and components. This guide explains the fundamental structures used in NodeRAG.

File Structure Overview

The following structure is generated after the indexing process is completed. Each folder and file serves a specific purpose for efficient retrieval and graph-based reasoning.

main_foulder/
├── cache/
│   ├── attributes.parquet
│   ├── documents.parquet
│   ├── entities.parquet
│   ├── graph.pkl
│   ├── high_level_elements.parquet
│   ├── high_level_elements_titles.parquet
│   ├── hnsw_graph.pkl
│   ├── HNSW.bin
│   ├── id_map.parquet
│   ├── relationship.parquet
│   ├── semantic_units.parquet
│   ├── text_decomposition.jsonl
│   └── text.parquet
│
├── info/
│   ├── document_hash.json
│   ├── indices.json
│   ├── info.log
│   └── state.json
│
├── input/
│   └── J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt
│
└── Node_config.yaml

Directory and File Descriptions

cache/

Stores all processed data, including semantic structures, embeddings, and graph data, optimized for fast retrieval and reasoning.

  • attributes.parquet: Stores metadata attributes extracted from the corpus.
  • documents.parquet: Contains processed document-level data entries.
  • entities.parquet: Extracted named entities for linking and graph construction.
  • graph.pkl: Serialized heterogeneous graph based on hash ids.
  • high_level_elements.parquet: Aggregated high-level units (e.g., High level ponies).
  • high_level_elements_titles.parquet: Titles of the high-level elements for structured navigation.
  • hnsw_graph.pkl / HNSW.bin: HNSW (Hierarchical Navigable Small World) index for fast similarity search.
  • id_map.parquet: Maps internal IDs to Nodes.
  • relationship.parquet: Relationship data between entities or semantic units.
  • semantic_units.parquet: Core semantic content units for fine-grained querying.
  • text_decomposition.jsonl: Decomposed text data in JSON Lines format, used for indexing.
  • text.parquet: Raw or lightly processed text segments stored efficiently.

info/

Contains indexing status, logs, and metadata for tracking and reproducibility.

  • document_hash.json: Hashes of input documents for change detection and incremental updates.
  • indices.json: Information about numbers of each nodes.
  • info.log: Log file capturing processing steps and times.
  • state.json: Workflow state snapshot used for resuming or auditing the indexing process.

input/

Input files and configuration for the indexing process.

  • *.(txt, md): Example input corpus.

Config

  • Node_config.yaml: Configuration file specifying indexing parameters, model settings, and paths.

Last modified April 5, 2025: update reproduce (f23a25c)