Comparing investigation efficiency and accuracy across retrieval modes.
The benchmark indicates that advanced retrieval materially improves investigation speed in CoursePro. Synapse provides the biggest acceleration effect, while the combined documentation-plus-semantic-retrieval mode offers the best balance of speed and answer quality.
Plain codebase access only. File search, grep, git, and standard repository navigation tools.
Adds the coursepro-docs MCP server for structured documentation lookup on top of baseline tooling.
HTTP semantic code-intelligence layer — symbols, definitions, callers, routes, and pattern lookups, in addition to baseline tools.
Both documentation and semantic-retrieval layers active simultaneously — the richest available setup.
The first point at which the agent identifies the correct subsystem, module, or file cluster.
The main implementation file, route handler, or core runtime location relevant to the task.
The point at which the agent has enough information to produce a developer-usable answer.
% alignment with hidden benchmark answer key. Higher is better.
Seconds to T1 / T2 / T3. Lower is better.
| Metric | Baseline | Docs Only | Synapse Only | Combined |
|---|---|---|---|---|
| Weighted Accuracy | 60% | 70% | 65% | 70% |
| Avg T1 | 28.0s | 28.0s | 2.1s | 5.9s |
| Avg T2 | 106.5s | 64.0s | 4.7s | 22.0s |
| Avg T3 | 255.0s | 123.0s | 7.7s | 59.0s |
Useful as a control, operationally the least attractive mode.
Strong for stable, grounded answers — especially for intended-behaviour understanding. Slower than Combined, but high quality.
The best pure acceleration mode — ideal for fast technical orientation, and strongest when paired with code confirmation.
The most robust and defensible operating mode. Recommended default for real-world CoursePro investigation.
The benchmark repeatedly showed the importance of distinguishing documented behaviour from runtime behaviour. In HomePortal member flows, Swagger and supporting documentation implied broader support, whereas runtime applied stricter conditions — specifically booking-strategy and bridge capability checks referenced by files such as PhotoVideoConsent.php and related handlers.
Retrieval quality is therefore not only about finding code quickly — it is also about knowing whether the code confirms or contradicts the documented model.
Where docs imply unconditional support for behaviour that is actually conditional at runtime, the docs should be updated to make those conditions explicit.
Synapse and Combined located PhotoVideoConsent.php and associated handlers rapidly via symbol and route lookup. Baseline and Docs Only required broader filename and grep-style scanning to reach the same endpoint cluster. Accuracy was comparable across modes once code was confirmed; speed was the dominant differentiator.
Docs Only and Combined had a meaningful edge in describing the intended send behaviour. Synapse was fastest at identifying the runtime dispatch path, but missed one architectural nuance captured by the documentation — a clear example of documentation acting as a stabiliser.
Synapse excelled at tracing the checkout handler chain and provider adapters. Combined mode reached the same answer and additionally aligned the explanation to the documented provider contracts. Baseline took the longest, repeatedly revisiting unrelated payment helpers.
Rule discovery favoured Combined mode — docs supplied business context, while Synapse pinpointed the conditional branches in the allocation service. Docs Only reached the correct answer but needed extra code verification steps. Baseline struggled to bridge business terminology to code identifiers.
A tightly scoped regex and validator task. All modes converged on the correct validator, but Synapse was dramatically faster at locating it. Docs were thin for this scenario, which reduced any Combined-mode docs advantage.
The IS_BETWEEN operator required understanding both the configured operator set and the runtime evaluator. Combined mode gave the clearest end-to-end explanation. Synapse alone missed the documented semantic wrapper; Docs Only alone missed the runtime evaluator edge cases.
A UI-to-backend tracing task. Synapse and Combined both traced the selection flow quickly via route and symbol lookups. Docs Only was accurate but slower, and Baseline needed the most scanning to connect front-end components to backend selection services.
This scenario was more open-ended than intended and benefits from a tighter definition in future rounds. Synapse was strongest on the route-to-handler hop; Combined added value at the handler-to-tests hop where documentation referenced test fixtures.
Form-to-model mapping was handled well by Combined mode, which combined documented form field contracts with Synapse-located mappers. Baseline and Docs Only arrived at the answer more slowly and with more dead-ends on misleadingly named helper files.
The clearest demonstration of the case study in Section 8. The documented behaviour implied a simpler model than the runtime enforced. Only Combined mode reliably surfaced and reconciled the difference. Like Scenario 8, this scenario is open-ended and would benefit from a more prescriptive answer key.
The benchmark is directionally strong but not a perfect instrumented evaluation. Key limitations include:
Baseline is workable but slow. Docs Only materially improves quality. Synapse is the dominant speed accelerator. Combined is the strongest overall profile.
Synapse
Combined
It reduces Baseline investigation time by 77% while maintaining peak accuracy.