Skip to main content

Local benchmark case study

Retrieval was close.Index cost was the story.

A local retrieval-overlap case study across 100 rows and 20 frozen open-source tasks. Not a winner claim — a transparent read of cost, quality, and failures in one place.

jCodeMunch and Codebase Context were within 1.4 percentage points on file recall. Their median indexing times were 11s vs 750s. That is the useful signal.

snapshot

All five tools summarized

ToolFile recallIndexScored rowsWhat it means
jCodeMunch27.1%11.3s20Strongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context.
Codebase Context25.7%750.8s20Reliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings.
Repowise20.1%294.3s19Indexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context.
context-mode16.8%18.6s20Lightweight and stable, with lower recall in this run.
raw search4.3%0 ms20Deterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling.

Retrieval

File recall by tool

jCodeMunch27.1%
Codebase Context25.7%
Repowise20.1%
context-mode16.8%
raw search4.3%

Scale: 0–30%. jCodeMunch and Codebase Context are within 1.4 pp of each other. Raw search is the lexical baseline.

setup

How the run worked

Tools compared

  • raw search (no index tool)
  • Codebase Context
  • jCodeMunch full MCP
  • Repowise self-hosted MCP
  • context-mode

Tool placement

GrepAI is appendix-only: two bounded readiness attempts failed to produce usable indexed content. context-mode is the lightweight local lane — present in all tables but not a primary comparison target.

Evidence and caveats

The raw run files are still local. Every public number should trace back to a result file before any public winner claim. Counting rule: I only count a retrieval when the task finished and the evaluator could score the returned context.

Results

Full quality metrics across all tools

Coverage columns show how much of the benchmark's known relevant files, symbols, code ranges, and lines the agent returned — higher is better. Failed or unscored rows count as zero in the failure-inclusive view.

ToolScoredFailuresFile recallSymbolsRangesLinesIndexNotes
jCodeMunch2027.1%19.0%8.5%8.7%11.3sStrongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context.
Codebase Context2025.7%14.0%8.6%8.8%750.8sReliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings.
Repowise19120.1%4.6%3.0%2.5%294.3sIndexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context.
context-mode2016.8%2.7%1.2%1.1%18.6sLightweight and stable, with lower recall in this run.
raw search204.3%0.7%0.5%0.4%0 msDeterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling.

If a run crashed, timed out, or could not be judged, it goes in the failure table — not into a retroactive score.

jCodeMunch and Codebase Context are close on file and span retrieval in this fixed adapter run. jCodeMunch indexed much faster; Codebase Context stayed reliable but paid a heavy local embedding cost.

Failed attempts

Disclosed, not discarded

ToolAffected rowsFailure typeWhat happened
Repowise1 rowofficial evaluator reported no_context_extractedThe failed row is kept in the denominator and counted as zero in the failure-inclusive metric view.
Repowise20 rowsindex exited nonzeroRows were scoreable because durable artifacts existed, but the editor-config registration failure has to be disclosed.
GrepAI2 bounded readiness attemptsappendix onlyGrepAI installed/versioned cleanly, then failed to become ready after 631s and 934s with 0 files/chunks. It is not a headline quality result.

Cost accounting

What each tool cost

Metricraw searchCodebase ContextjCodeMunchRepowisecontext-mode
Median index0 ms750.8s11.3s294.3s18.6s
Total task time6.8 min231.5 min10.9 min200.7 min16.0 min
Peak memory0.06 GB2.90 GB0.33 GB4.82 GB0.15 GB
Infrastructurenonelocal Transformers embeddingsself-hosted MCP targetself-hosted MCP targetlocal npx/MCP path

raw search

No index step. This is a deterministic baseline, not the full cost of a raw coding agent.

Codebase Context

Reliability was good in this run, but cold semantic indexing is the obvious cost problem.

jCodeMunch

Strong value signal in this adapter run, but do not call the index cost universal until clean install/cache reuse is normalized.

Repowise

Every index exited nonzero with a documented editor-config registration failure, despite durable artifacts and scored rows.

context-mode

Cheap and reliable, but lower recall in this run.

Benchmark orchestration

Owner-reported 2.29B+ input tokens and 7.48M+ output tokens before this corrected scoring pass

This is the cost of building, debugging, documenting, and auditing the benchmark in Codex goal mode. It is not a Codebase Context runtime cost, not a lane score, and not a provider-dollar ledger. The final Gate 5 scoring pass itself took about 72 seconds.

Missing measurements

  • The run has one repeat per tool/task, so stability and confidence claims are blocked.
  • Patch correctness and downstream agent edit quality were not measured.
  • Provider-dollar spend was not measured because no paid provider path was approved.
  • Clean-machine install, package-cache reuse, and cold-vs-warm dependency state still need normalization.
  • Query latency is not uniformly separated from index/setup time for every lane.

Scope

What is not in this run

  • The public winner-claim gate is still closed.
  • The current report does not allow a public winner claim.
  • The blockers are one repeat per task/lane, retrieval-overlap only, no provider-dollar ledger, and one Repowise no-context row.

This is enough for a transparent engineering note. It is not enough for a leaderboard, academic-style benchmark, or product win claim.

Why it matters

The question isn't who won

The useful signal is not a trophy. Codebase Context has to justify itself on reliability, local privacy, and agent ergonomics while making its cold indexing cost clear.

The graph-vs-hybrid question is less clean than it sounds. A graph index only helps the agent if the normal tool surface can retrieve useful context from the task language. In this run, Repowise indexed locally and returned many scored rows, but it was heavy, brittle, and still produced one no-context evaluator row.

The useful question is what you get for the install, first index, runtime, token spend, memory use, and failure rate.

Limits

What this does not prove

  • It does not prove Codebase Context beats raw-native.
  • It does not prove patch correctness.
  • It does not prove productivity gains.
  • It does not prove repeat stability or confidence intervals.
  • It does not turn setup failures into competitor losses.
  • It does not support resource-use or generalized coding-quality claims.

Next run

Before making stronger claims

  1. 1Rerun enough repeated judged runs per tool/task to talk about stability.
  2. 2Normalize install, dependency cache, cold index, and warm query measurement.
  3. 3Decide whether provider-dollar spend will be measured or explicitly excluded.
  4. 4Keep setup failures, tool failures, and evaluator failures visible instead of hiding them from the result.
  5. 5Improve Codebase Context indexing speed and exact code-range matching, then rerun the same frozen tasks.
  6. 6Publish only the narrow engineering note until a fresh evidence audit passes.

narrow case study · not a leaderboard

100 rows · 20 frozen tasks · 5 tools