CONTEXT ARCHITECTURE
500k Lines. Local LLM. Bulletproof.
A 4-layer system that makes a 7B model navigate a massive codebase
like a senior developer — without ever exceeding context limits.
6.25M
Total tokens
in 500k lines
8K
Model context
window
<400
Tokens used
per query
💀
The Problem — By The Numbers
500,000 lines of code. Even a 256K context window model can't hold it. And even if it could — you'd blow your RAM trying to process it locally. We need a system that is smarter than brute force.
500k lines × ~12.5 tokens/line = 6.25 million tokens  |  Your model's context: 8K–32K tokens  |  Gap: 195x–780x too large
6.25M
Total tokens in codebase
~15K
After semantic search (top chunks)
~2K
After graph traversal (traced functions)
<400
Final context sent to LLM
THE 4-LAYER FUNNEL
1
Hierarchical Summarisation Index
Pre-computed, stored in DB — the codebase's "table of contents"
LAYER 1
Every function, class, and file gets a pre-computed plain-English summary stored in the DB. These summaries are tiny (2–4 sentences each) but capture the intent — not the implementation. Built in a pyramid: function summaries → file summaries → module summaries → system overview.

When a query comes in, the agent reads summaries first — never raw code. This lets it navigate 500k lines with the same ease as reading a table of contents.
SUMMARY PYRAMID
🏛 System: "A real-time 3D renderer with shader pipeline, framebuffer management, and asset loader"
📁 Module: "renderer/ handles framebuffer lifecycle, shader compilation and draw calls"
📄 File: "renderer.go — manages framebuffer creation, binding, and resize events"
⚡ Fn: "setFramebuffer(id int) — binds a framebuffer by ID, validates bounds, emits resize event"
SQLite storage Pre-computed once Incremental updates Qwen 7B generates summaries
2
Code Knowledge Graph
Structural relationships — not similarity, but connections
LAYER 2
RAG finds similar code. But bugs don't care about similarity — they care about relationships. Function A breaks because Function B changed, which is called by File C, which imports from Package D.

Tree-sitter parses every file and builds a graph: every function, class, and variable is a node. Every call, import, and inheritance is an edge. Stored in SQLite as an adjacency list. The agent traverses this graph to find what's truly connected to the issue — not just what sounds similar.
GRAPH TRAVERSAL EXAMPLE
Issue: "target is wrong"
→ Semantic hit: setFramebuffer() in renderer.go
→ Graph: setFramebuffer() ← called by drawFrame() ← called by RenderLoop.tick()
→ Graph: setFramebuffer() → writes to FramebufferRegistry (shared state!)
→ Agent knows: changing setFramebuffer affects ALL callers + the registry
Tree-sitter AST SQLite graph (adjacency list) Python / Go / Rust / JS Blast radius detection
3
Semantic Vector Search (RAG)
Meaning-based retrieval — finds code that talks about the same thing
LAYER 3
Every function is chunked and embedded using nomic-embed-text (runs locally, ~270MB). Chunks are stored in ChromaDB with metadata: file path, function name, language, last modified.

RAG is Layer 3 — not Layer 1 — because alone it's not enough. But combined with the graph and summaries, it becomes extremely precise. It finds semantically related code that the graph might miss (e.g. a comment in a different file that explains why the framebuffer ID scheme was designed this way).
WHAT RAG FINDS THAT GRAPH MISSES
Query embedding: "framebuffer target binding wrong"
→ Finds: a comment in ARCHITECTURE.md explaining the framebuffer ID convention
→ Finds: a similar bug fix from 6 months ago in a different file
→ Finds: the test that was written for this exact scenario
ChromaDB nomic-embed-text (local) Chunk size: ~40 lines Top-K retrieval
4
Dynamic Context Budget Manager
The enforcer — ensures the LLM never gets more than it can handle
LAYER 4
Layers 1–3 might return more context than the model can handle. Layer 4 is the gatekeeper. It has a fixed token budget (e.g. 3,000 tokens for code context) and fills it intelligently — highest-relevance chunks first, then graph-connected context, then summaries for anything that didn't fit.

If something important is too large to fit fully, it substitutes its pre-computed summary instead. The LLM always gets a complete, coherent picture — even if parts of it are compressed summaries. This is the key insight: summaries are lossless for reasoning, lossy only for exact syntax.
BUDGET ALLOCATION EXAMPLE (3000 token budget)
✓ setFramebuffer() full code — 180 tokens (high relevance, fits)
✓ drawFrame() full code — 240 tokens (caller, fits)
✓ FramebufferRegistry summary — 60 tokens (too large to fit fully, summary used)
✓ Related bug fix context from RAG — 150 tokens
✓ Web docs snippet — 200 tokens
✓ Issue + plan — 400 tokens
Total: 1,230 tokens. Well within budget. LLM has everything it needs.
tiktoken token counting Relevance scoring Summary substitution Configurable budget