Local benchmark case study

Retrieval was close.Index cost was the story.

A local retrieval-overlap case study across 100 rows and 20 frozen open-source tasks. Not a winner claim — a transparent read of cost, quality, and failures in one place.

jCodeMunch and Codebase Context were within 1.4 percentage points on file recall. Their median indexing times were 11s vs 750s. That is the useful signal.

Scoring and limits GitHub Product page

snapshot

All five tools summarized

Tool	File recall	Index	Scored rows	What it means
jCodeMunch	27.1%	11.3s	20	Strongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context.
Codebase Context	25.7%	750.8s	20	Reliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings.
Repowise	20.1%	294.3s	19	Indexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context.
context-mode	16.8%	18.6s	20	Lightweight and stable, with lower recall in this run.
raw search	4.3%	0 ms	20	Deterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling.

Retrieval

File recall by tool

jCodeMunch27.1%

Codebase Context25.7%

Repowise20.1%

context-mode16.8%

raw search4.3%

Scale: 0–30%. jCodeMunch and Codebase Context are within 1.4 pp of each other. Raw search is the lexical baseline.

setup

How the run worked

Tools compared

raw search (no index tool)
Codebase Context
jCodeMunch full MCP
Repowise self-hosted MCP
context-mode

Tool placement

GrepAI is appendix-only: two bounded readiness attempts failed to produce usable indexed content. context-mode is the lightweight local lane — present in all tables but not a primary comparison target.

Evidence and caveats

The raw run files are still local. Every public number should trace back to a result file before any public winner claim. Counting rule: I only count a retrieval when the task finished and the evaluator could score the returned context.

Results

Full quality metrics across all tools

Coverage columns show how much of the benchmark's known relevant files, symbols, code ranges, and lines the agent returned — higher is better. Failed or unscored rows count as zero in the failure-inclusive view.

Tool	Scored	Failures	File recall	Symbols	Ranges	Lines	Index	Notes
jCodeMunch	20	—	27.1%	19.0%	8.5%	8.7%	11.3s	Strongest retrieval-overlap row average in this fixed adapter run, with much cheaper indexing than Codebase Context.
Codebase Context	20	—	25.7%	14.0%	8.6%	8.8%	750.8s	Reliable in this run and close to jCodeMunch on file/span retrieval, but expensive to index with local Transformers embeddings.
Repowise	19	1	20.1%	4.6%	3.0%	2.5%	294.3s	Indexed and returned rows, but was heavy, every index exited nonzero with a documented benign failure mode, and one row had no extracted context.
context-mode	20	—	16.8%	2.7%	1.2%	1.1%	18.6s	Lightweight and stable, with lower recall in this run.
raw search	20	—	4.3%	0.7%	0.5%	0.4%	0 ms	Deterministic lexical baseline. Useful sanity check, but not a free-form raw-agent ceiling.

If a run crashed, timed out, or could not be judged, it goes in the failure table — not into a retroactive score.

jCodeMunch and Codebase Context are close on file and span retrieval in this fixed adapter run. jCodeMunch indexed much faster; Codebase Context stayed reliable but paid a heavy local embedding cost.

Failed attempts

Disclosed, not discarded

Tool	Affected rows	Failure type	What happened
Repowise	1 row	official evaluator reported no_context_extracted	The failed row is kept in the denominator and counted as zero in the failure-inclusive metric view.
Repowise	20 rows	index exited nonzero	Rows were scoreable because durable artifacts existed, but the editor-config registration failure has to be disclosed.
GrepAI	2 bounded readiness attempts	appendix only	GrepAI installed/versioned cleanly, then failed to become ready after 631s and 934s with 0 files/chunks. It is not a headline quality result.

Cost accounting

What each tool cost

Metric	raw search	Codebase Context	jCodeMunch	Repowise	context-mode
Median index	0 ms	750.8s	11.3s	294.3s	18.6s
Total task time	6.8 min	231.5 min	10.9 min	200.7 min	16.0 min
Peak memory	0.06 GB	2.90 GB	0.33 GB	4.82 GB	0.15 GB
Infrastructure	none	local Transformers embeddings	self-hosted MCP target	self-hosted MCP target	local npx/MCP path

raw search

No index step. This is a deterministic baseline, not the full cost of a raw coding agent.

Codebase Context

Reliability was good in this run, but cold semantic indexing is the obvious cost problem.

jCodeMunch

Strong value signal in this adapter run, but do not call the index cost universal until clean install/cache reuse is normalized.

Repowise

Every index exited nonzero with a documented editor-config registration failure, despite durable artifacts and scored rows.

context-mode

Cheap and reliable, but lower recall in this run.

Benchmark orchestration

Owner-reported 2.29B+ input tokens and 7.48M+ output tokens before this corrected scoring pass

This is the cost of building, debugging, documenting, and auditing the benchmark in Codex goal mode. It is not a Codebase Context runtime cost, not a lane score, and not a provider-dollar ledger. The final Gate 5 scoring pass itself took about 72 seconds.

Missing measurements

The run has one repeat per tool/task, so stability and confidence claims are blocked.
Patch correctness and downstream agent edit quality were not measured.
Provider-dollar spend was not measured because no paid provider path was approved.
Clean-machine install, package-cache reuse, and cold-vs-warm dependency state still need normalization.
Query latency is not uniformly separated from index/setup time for every lane.

Scope

What is not in this run

The public winner-claim gate is still closed.
The current report does not allow a public winner claim.
The blockers are one repeat per task/lane, retrieval-overlap only, no provider-dollar ledger, and one Repowise no-context row.

This is enough for a transparent engineering note. It is not enough for a leaderboard, academic-style benchmark, or product win claim.

Why it matters

The question isn't who won

The useful signal is not a trophy. Codebase Context has to justify itself on reliability, local privacy, and agent ergonomics while making its cold indexing cost clear.

The graph-vs-hybrid question is less clean than it sounds. A graph index only helps the agent if the normal tool surface can retrieve useful context from the task language. In this run, Repowise indexed locally and returned many scored rows, but it was heavy, brittle, and still produced one no-context evaluator row.

The useful question is what you get for the install, first index, runtime, token spend, memory use, and failure rate.

Limits

What this does not prove

It does not prove Codebase Context beats raw-native.
It does not prove patch correctness.
It does not prove productivity gains.
It does not prove repeat stability or confidence intervals.
It does not turn setup failures into competitor losses.
It does not support resource-use or generalized coding-quality claims.

Next run

Before making stronger claims

1Rerun enough repeated judged runs per tool/task to talk about stability.
2Normalize install, dependency cache, cold index, and warm query measurement.
3Decide whether provider-dollar spend will be measured or explicitly excluded.
4Keep setup failures, tool failures, and evaluator failures visible instead of hiding them from the result.
5Improve Codebase Context indexing speed and exact code-range matching, then rerun the same frozen tasks.
6Publish only the narrow engineering note until a fresh evidence audit passes.

narrow case study · not a leaderboard

100 rows · 20 frozen tasks · 5 tools

GitHub Product page