Context Architecture — 500k Lines, Local LLM

💀

The Problem — By The Numbers

500,000 lines of code. Even a 256K context window model can't hold it. And even if it could — you'd blow your RAM trying to process it locally. We need a system that is smarter than brute force.

500k lines × ~12.5 tokens/line = 6.25 million tokens | Your model's context: 8K–32K tokens | Gap: 195x–780x too large

6.25M

Total tokens in codebase

↓

~15K

After semantic search (top chunks)

↓

~2K

After graph traversal (traced functions)

↓

<400

Final context sent to LLM

THE 4-LAYER FUNNEL

Hierarchical Summarisation Index

Pre-computed, stored in DB — the codebase's "table of contents"

LAYER 1

Every function, class, and file gets a pre-computed plain-English summary stored in the DB. These summaries are tiny (2–4 sentences each) but capture the intent — not the implementation. Built in a pyramid: function summaries → file summaries → module summaries → system overview.

When a query comes in, the agent reads summaries first — never raw code. This lets it navigate 500k lines with the same ease as reading a table of contents.

SUMMARY PYRAMID

🏛 System: "A real-time 3D renderer with shader pipeline, framebuffer management, and asset loader"
📁 Module: "renderer/ handles framebuffer lifecycle, shader compilation and draw calls"
📄 File: "renderer.go — manages framebuffer creation, binding, and resize events"
⚡ Fn: "setFramebuffer(id int) — binds a framebuffer by ID, validates bounds, emits resize event"

SQLite storage Pre-computed once Incremental updates Qwen 7B generates summaries

Code Knowledge Graph

Structural relationships — not similarity, but connections

LAYER 2

RAG finds similar code. But bugs don't care about similarity — they care about relationships. Function A breaks because Function B changed, which is called by File C, which imports from Package D.

Tree-sitter parses every file and builds a graph: every function, class, and variable is a node. Every call, import, and inheritance is an edge. Stored in SQLite as an adjacency list. The agent traverses this graph to find what's truly connected to the issue — not just what sounds similar.

GRAPH TRAVERSAL EXAMPLE

Issue: "target is wrong"
→ Semantic hit: setFramebuffer() in renderer.go
→ Graph: setFramebuffer() ← called by drawFrame() ← called by RenderLoop.tick()
→ Graph: setFramebuffer() → writes to FramebufferRegistry (shared state!)
→ Agent knows: changing setFramebuffer affects ALL callers + the registry

Tree-sitter AST SQLite graph (adjacency list) Python / Go / Rust / JS Blast radius detection

Semantic Vector Search (RAG)

Meaning-based retrieval — finds code that talks about the same thing

LAYER 3

Every function is chunked and embedded using nomic-embed-text (runs locally, ~270MB). Chunks are stored in ChromaDB with metadata: file path, function name, language, last modified.

RAG is Layer 3 — not Layer 1 — because alone it's not enough. But combined with the graph and summaries, it becomes extremely precise. It finds semantically related code that the graph might miss (e.g. a comment in a different file that explains why the framebuffer ID scheme was designed this way).

WHAT RAG FINDS THAT GRAPH MISSES

Query embedding: "framebuffer target binding wrong"
→ Finds: a comment in ARCHITECTURE.md explaining the framebuffer ID convention
→ Finds: a similar bug fix from 6 months ago in a different file
→ Finds: the test that was written for this exact scenario

ChromaDB nomic-embed-text (local) Chunk size: ~40 lines Top-K retrieval

Dynamic Context Budget Manager

The enforcer — ensures the LLM never gets more than it can handle

LAYER 4

Layers 1–3 might return more context than the model can handle. Layer 4 is the gatekeeper. It has a fixed token budget (e.g. 3,000 tokens for code context) and fills it intelligently — highest-relevance chunks first, then graph-connected context, then summaries for anything that didn't fit.

If something important is too large to fit fully, it substitutes its pre-computed summary instead. The LLM always gets a complete, coherent picture — even if parts of it are compressed summaries. This is the key insight: summaries are lossless for reasoning, lossy only for exact syntax.

BUDGET ALLOCATION EXAMPLE (3000 token budget)

✓ setFramebuffer() full code — 180 tokens (high relevance, fits)
✓ drawFrame() full code — 240 tokens (caller, fits)
✓ FramebufferRegistry summary — 60 tokens (too large to fit fully, summary used)
✓ Related bug fix context from RAG — 150 tokens
✓ Web docs snippet — 200 tokens
✓ Issue + plan — 400 tokens
Total: 1,230 tokens. Well within budget. LLM has everything it needs.

tiktoken token counting Relevance scoring Summary substitution Configurable budget

Everything lives in two databases. SQLite for structure, relationships, summaries, and metadata. ChromaDB for vector embeddings. Together they hold the entire intelligence about your codebase.

🗃️ SQLite — Structural DB

Relationships, summaries, metadata, graph

-- NODES TABLE

node_id · fn:renderer.go:setFramebuffer

type · function | class | file | module

summary · "Binds framebuffer by ID..."

file_path · renderer/renderer.go

line_start / line_end

language · go

-- EDGES TABLE (the graph)

from_node · to_node · edge_type

calls | imports | inherits | writes | reads

-- SUMMARIES TABLE

level · function | file | module | system

summary_text · token_count

last_indexed · git_hash

🔮 ChromaDB — Vector DB

Embeddings for semantic similarity search

-- COLLECTION: codebase_chunks

id · fn:renderer.go:setFramebuffer:0

embedding · [0.023, -0.841, ...] (768-dim)

document · "func setFramebuffer(id int) {..."

-- METADATA per chunk

file_path · function_name

language · chunk_index

node_id ← links back to SQLite

-- COLLECTION: summaries

Embeds summaries too

Lets agent search by meaning

even at file/module level

⚡ Indexing Pipeline

How the databases get built and stay fresh

Initial index: walk all files → Tree-sitter parse → extract nodes/edges → LLM generates summaries → embed everything → store in both DBs

Incremental update: git diff detects changed files → re-parse only those → update nodes/edges → regenerate summaries → re-embed → update DBs

Trigger: watchdog monitors file changes + post-merge git hook

Speed: 500k line repo initial index ~15–30min. Incremental updates <10 seconds per changed file.

🧠 What The Agent Sees

The assembled context window, always under budget

System summary (1–2 sentences, always included)

Module summaries for affected areas

Full code for directly relevant functions

Summaries for connected functions (not full code)

Web context (docs, Stack Overflow)

The plan + issue statement

Total: typically 1,000–2,500 tokens. Always under limit.

Why not just use one of the simpler approaches? Here's exactly why each one fails at scale.

Approach	How it works	Breaks at 500k lines?	Why
Full codebase in context	Dump everything into the prompt	✗ Instantly	6.25M tokens. Your model has 8K–32K. Not even close.
Plain keyword search	grep for words in the issue	✗ Yes	"target is wrong" — zero keywords match anything. Useless on vague issues.
Basic RAG only	Embed code, retrieve similar chunks	✗ Often	Finds similar code, not connected code. Misses call chains and shared state bugs.
Sliding window	Feed code in 8K chunks, loop	✗ Yes	No way to know which chunk is relevant. Reads 1000 chunks to find 1. Painfully slow.
Summarise everything first	LLM reads summaries, asks for files	✗ Partially	Too many round-trips. Summaries miss critical implementation details. Slow in a loop.
Our 4-Layer System	Summaries + Graph + RAG + Budget Manager	✓ Handles it	Each layer solves what the others can't. Result: <400 tokens, complete picture, <3 seconds.

The Core Insight

A senior developer working on a 500k line codebase doesn't read 500k lines to fix a bug. They know the system's structure, they know where to look, and they read only the 50–100 lines that matter.

Our 4-layer system replicates exactly this. The summaries = their knowledge of the system. The graph = their mental model of relationships. The RAG = their memory of similar past problems. The budget manager = their focus — only reading what's necessary.