open source · local-first

Local code context for AI agents, priced honestly.

A local MCP server and CLI that indexes a repo, stores team memory locally, and gives coding agents bounded files, symbols, and notes before they edit.

Claude Code · one command

claude mcp add codebase-context -- npx -y codebase-context

See the run GitHub npm

20frozen tasks

100retrieval rows

99officially scored

66×indexing cost delta

narrow case study · not a leaderboard

problem

Agents can drown in context

Agents often start by opening half the repo before they know what matters.
Memory turns into noise when every question gets the same notes.
Search tools can sound confident even when they did not retrieve enough code to edit safely.
Most benchmark writeups bury the install pain, indexing time, hardware assumptions, token bill, and failed setup paths.

product

What the tool gives the agent

Local search that combines text matches with semantic retrieval
AST-aligned chunks tied to real source structure
Symbol references, imports, and project metadata
Team memory and conventions stored locally
CLI workflows for indexing, maps, and memory
MCP server interface for coding agents

engineering

The tool had to fit normal dev machines

Local-first default

code stays on the machine unless the user explicitly opts into a cloud provider.

CPU by default

the tool has to run on a normal dev machine before any optional cloud profile is considered.

Framework-neutral core

specialized analyzers self-register instead of leaking framework rules into shared code.

Claim discipline

the write-up says what the run showed and blocks claims the evidence cannot support.

cost

Context is not free

A higher retrieval score is not enough if the tool takes forever to install, needs a hosted service, burns the token budget, or fails before the evaluator can judge the result.

How long does install and first setup take?
How long does the first index take?
How fast is a normal query after indexing?
How much disk and cache footprint remains?
How many tokens reach the model?
What did the benchmark orchestration itself cost?
What failed before quality scoring?

evidence

One honest run

what the run showed

Close retrieval. Very different indexing cost.

Full benchmark

jCodeMunch

27.1%recall

11smedian index

Codebase Context

25.7%recall

750smedian index

1.4 pp recall gap. 66× indexing cost difference.

100 rows · 20 frozen open-source tasks · narrow case study, not a winner claim

limits

What I will not claim

It does not show that codebase-context writes better patches.
It does not show that developers ship faster.
It does not prove codebase-context is generally superior.
It does not prove repeat stability because this run has one repeat per task and lane.
It does not prove provider-dollar or normalized clean-install cost.
The current evidence supports a narrow retrieval-overlap engineering note, not a winner table.

roadmap

Before stronger claims

1Normalize cold install, dependency cache, and index reuse before making cost-efficiency claims.
2Decide whether provider-dollar spend is out of scope or explicitly measured.
3Collect enough repeated judged runs per tool/task before stability or confidence claims.
4Improve Codebase Context cold indexing and exact code-range matching, then rerun the same frozen tasks.
5Keep GrepAI in the appendix unless the owner reopens a capped readiness run.

local-first · open source

MCP server and CLI · runs fully offline

GitHub npm See the benchmark