Two-node compute setup. Your local machine handles fast, lightweight agents.
Your server handles the heavy thinking — Orchestrator and Planner.
Both nodes run Ollama and communicate over your local network.
⚡ PERFORMANCE NOTE
Qwen2.5-Coder:7B fits entirely in your 6GB VRAM in Q4_K_M quantization — that's ~4.5GB loaded.
Response time is ~2–4 tokens/sec for code generation. Fast enough for an agent loop.
The server's 100GB RAM can run Qwen2.5-Coder:32B comfortably in CPU inference (~8–10 tokens/sec). We hit the server only for Orchestrator + Planner — heavy reasoning tasks that happen once per job, not in the tight loop.
The server's 100GB RAM can run Qwen2.5-Coder:32B comfortably in CPU inference (~8–10 tokens/sec). We hit the server only for Orchestrator + Planner — heavy reasoning tasks that happen once per job, not in the tight loop.