Tools compared
- raw search (no index tool)
- Codebase Context
- jCodeMunch full MCP
- Repowise self-hosted MCP
- context-mode
Local benchmark case study
A local retrieval-overlap case study across 100 rows and 20 frozen open-source tasks. Not a winner claim — a transparent read of cost, quality, and failures in one place.
jCodeMunch and Codebase Context were within 1.4 percentage points on file recall. Their median indexing times were 11s vs 750s. That is the useful signal.
snapshot
| Tool | File recall | Index | Scored rows | What it means |
|---|---|---|---|---|
| jCodeMunch | 27.1% | 11.3s | 20 | Strongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context. |
| Codebase Context | 25.7% | 750.8s | 20 | Reliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings. |
| Repowise | 20.1% | 294.3s | 19 | Indexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context. |
| context-mode | 16.8% | 18.6s | 20 | Lightweight and stable, with lower recall in this run. |
| raw search | 4.3% | 0 ms | 20 | Deterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling. |
Retrieval
Scale: 0–30%. jCodeMunch and Codebase Context are within 1.4 pp of each other. Raw search is the lexical baseline.
setup
GrepAI is appendix-only: two bounded readiness attempts failed to produce usable indexed content. context-mode is the lightweight local lane — present in all tables but not a primary comparison target.
The raw run files are still local. Every public number should trace back to a result file before any public winner claim. Counting rule: I only count a retrieval when the task finished and the evaluator could score the returned context.
Results
Coverage columns show how much of the benchmark's known relevant files, symbols, code ranges, and lines the agent returned — higher is better. Failed or unscored rows count as zero in the failure-inclusive view.
| Tool | Scored | Failures | File recall | Symbols | Ranges | Lines | Index | Notes |
|---|---|---|---|---|---|---|---|---|
| jCodeMunch | 20 | — | 27.1% | 19.0% | 8.5% | 8.7% | 11.3s | Strongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context. |
| Codebase Context | 20 | — | 25.7% | 14.0% | 8.6% | 8.8% | 750.8s | Reliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings. |
| Repowise | 19 | 1 | 20.1% | 4.6% | 3.0% | 2.5% | 294.3s | Indexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context. |
| context-mode | 20 | — | 16.8% | 2.7% | 1.2% | 1.1% | 18.6s | Lightweight and stable, with lower recall in this run. |
| raw search | 20 | — | 4.3% | 0.7% | 0.5% | 0.4% | 0 ms | Deterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling. |
If a run crashed, timed out, or could not be judged, it goes in the failure table — not into a retroactive score.
jCodeMunch and Codebase Context are close on file and span retrieval in this fixed adapter run. jCodeMunch indexed much faster; Codebase Context stayed reliable but paid a heavy local embedding cost.
Failed attempts
| Tool | Affected rows | Failure type | What happened |
|---|---|---|---|
| Repowise | 1 row | official evaluator reported no_context_extracted | The failed row is kept in the denominator and counted as zero in the failure-inclusive metric view. |
| Repowise | 20 rows | index exited nonzero | Rows were scoreable because durable artifacts existed, but the editor-config registration failure has to be disclosed. |
| GrepAI | 2 bounded readiness attempts | appendix only | GrepAI installed/versioned cleanly, then failed to become ready after 631s and 934s with 0 files/chunks. It is not a headline quality result. |
Cost accounting
| Metric | raw search | Codebase Context | jCodeMunch | Repowise | context-mode |
|---|---|---|---|---|---|
| Median index | 0 ms | 750.8s | 11.3s | 294.3s | 18.6s |
| Total task time | 6.8 min | 231.5 min | 10.9 min | 200.7 min | 16.0 min |
| Peak memory | 0.06 GB | 2.90 GB | 0.33 GB | 4.82 GB | 0.15 GB |
| Infrastructure | none | local Transformers embeddings | self-hosted MCP target | self-hosted MCP target | local npx/MCP path |
No index step. This is a deterministic baseline, not the full cost of a raw coding agent.
Reliability was good in this run, but cold semantic indexing is the obvious cost problem.
Strong value signal in this adapter run, but do not call the index cost universal until clean install/cache reuse is normalized.
Every index exited nonzero with a documented editor-config registration failure, despite durable artifacts and scored rows.
Cheap and reliable, but lower recall in this run.
Owner-reported 2.29B+ input tokens and 7.48M+ output tokens before this corrected scoring pass
This is the cost of building, debugging, documenting, and auditing the benchmark in Codex goal mode. It is not a Codebase Context runtime cost, not a lane score, and not a provider-dollar ledger. The final Gate 5 scoring pass itself took about 72 seconds.
Scope
This is enough for a transparent engineering note. It is not enough for a leaderboard, academic-style benchmark, or product win claim.
Why it matters
The useful signal is not a trophy. Codebase Context has to justify itself on reliability, local privacy, and agent ergonomics while making its cold indexing cost clear.
The graph-vs-hybrid question is less clean than it sounds. A graph index only helps the agent if the normal tool surface can retrieve useful context from the task language. In this run, Repowise indexed locally and returned many scored rows, but it was heavy, brittle, and still produced one no-context evaluator row.
The useful question is what you get for the install, first index, runtime, token spend, memory use, and failure rate.
Limits
Next run
narrow case study · not a leaderboard
100 rows · 20 frozen tasks · 5 tools