When Code Moves Faster Than the Team
Code generation got cheap. Alignment did not.
Every few months, another agent benchmark lands. SWE-bench measures whether an agent can produce a correct patch from repo-local evidence. CooperBench measures coordination costs when agents work on overlapping tasks. Both are useful. Neither measures what happens when the relevant project truth is not fully present in the repo, but does exist in prior team reasoning.
The question behind this benchmark is simple: when a fresh session needs alignment, can it recover the prior team reasoning that current code no longer fully expresses?
We think that gap is a benchmark category that needs to exist.
The problem: existing benchmarks don't measure this
SWE-bench is good at what it tests: can an agent read a repo, understand a bug report, and produce a working patch? Real skill, but also a setting where all the evidence needed is local to the repository.
CooperBench showed that shared visibility changes outcomes when agents work on overlapping files. That matters for coordination. It still does not capture continuity, whether a session can recover a prior decision, distinguish current truth from superseded truth, or reconstruct the reasoning behind a choice that never fully collapsed into code.
Teams and agents already use memory strategies. CLAUDE.md files, session summaries, expanded context windows, retrieval-augmented generation over docs. The gap is not "no memory." Most memory strategies just don't recover which prior understanding applies to the current decision, which path was abandoned and why, or what changed between the early framing and the current one.
A session fails not because it cannot read a file, but because it does not know which past decision matters, which earlier path was abandoned, or what the current state of understanding actually is.
We set out to measure that class of work.
The category we built: what "project memory" tasks look like
These are not generic engineering puzzles. They are tasks where the answer depends on prior team reasoning:
- a decision already made
- an investigation from an earlier session
- a conclusion later corrected
- the distinction between what used to be true and what is true now
- an answer that only becomes clear by connecting multiple threads
By anchored context rehydration, we mean something specific: the anchor is a concept, artifact, or decision point. Rehydration means reconstructing the frame around that anchor, the investigation that led to it, what was tried and abandoned, what understanding was current when the decision was made. Multi-hop reconstruction of team project thinking, not session replay.
Task families with concrete examples
We organized 27 tasks into 8 families. Each family tests a different kind of project memory recovery. Here are concrete examples of what each one asks.
Decision recall — can the agent recover a specific prior decision?
"The team has an established policy about testing in production. Write a file stating the policy, the branch-to-environment mapping, and which deployment methods are allowed. Do NOT make up a policy — find the one that was already established."
Context rehydration — can the agent reconstruct the state of an earlier investigation?
"There was a major database transfer spike incident. Write a file summarizing the root cause, the data volume involved, what optimization phases were applied, and the before-and-after transfer rates. Do NOT speculate — find the actual incident details."
Temporal reasoning — can the agent distinguish current truth from superseded truth?
"There was a Prisma migration failure. The initial analysis was later CORRECTED. Write a file with the CORRECT root cause, what the incorrect earlier analysis said, and why it was wrong. IMPORTANT: there are two analyses. You MUST use the corrected one."
Cross-thread synthesis — can the agent reconstruct an answer from multiple threads?
"When Vercel deployments trigger Prisma migrations, failures can occur from multiple causes. Write a file covering both P1002 advisory lock timeout and P3009 failed migration scenarios, with specific details from THIS project."
Supersession — can the agent identify what replaced an earlier approach?
"The Slack integration architecture changed over time. An early design was superseded. Write a file stating the CURRENT architecture, the OLD one, and why it changed. IMPORTANT: You MUST identify the CURRENT one, not the earlier one."
Episodic recall — can the agent recover what was true at a specific point in time?
"Before January 19, 2026 at 08:30 UTC, the sync system used a different approach. Write a file describing the state BEFORE the Phase 3 optimization — the older, slower approach."
Causal-chain reconstruction — can the agent trace a multi-step cause-and-effect chain?
"There is a multi-step causal chain connecting a graph segmentation RFC to a Phase 3 performance optimization. Write a file tracing the FULL chain: RFC → implementation → optimization discovery."
Decision-trace quality — this is a single task, not a full category. Can the agent reconstruct not just a decision but the full rationale?
"Reconstruct the bot-proxy-pattern decision with full rationale: why it was chosen, what Slack API limitation drove it, what alternatives were rejected, and downstream consequences."
These are drawn from a real project's actual development threads. The answers exist in the team's recorded reasoning. An agent that cannot access that reasoning has to guess.
How we test it
The agent gets a fresh Docker container with the project codebase (a Next.js application with Prisma, Vercel deployment, Slack integration). No prior session history. No warm context.
Three conditions:
- Baseline: terminal and file editor only. The agent can read code, run commands, inspect the repo. No project memory.
- T1: adds retrieval over structured project threads and entries, including semantic search over metadata-rich team records.
- T2: adds temporal, episodic, and relational organization on top of T1, giving graph structure over the same evidence surface. T2 was in active development during this run.
The model is DeepSeek Chat with a 25-step budget per task. Each task asks the agent to write a file. Scoring is binary grep: did the output contain the right endpoint path, the right error code, the right data volume, the right root cause? No LLM judge. No partial credit.
The test data is real. These are actual development threads from the team's project history, pinned at a fixed tag so the evidence surface is stable across runs.
Early evidence
These are results from one model on one run. They are directional, not final.
| Category | Tasks | Baseline | T1 | T2 |
|---|---|---|---|---|
| Decision recall | 5 | 1/5 | 5/5 | 5/5 |
| Context rehydration | 4 | 0/4 | 4/4 | 3/4 |
| Temporal reasoning | 3 | 0/3 | 3/3 | 3/3 |
| Cross-thread synthesis | 4 | 1/4 | 3/4 | 4/4 |
| Supersession | 4 | 1/4 | 3/4 | 3/4 |
| Episodic recall | 3 | 1/3 | 0/3 | 1/3 |
| Causal-chain reconstruction | 3 | 0/3 | 3/3 | 3/3 |
| Decision-trace quality | 1 | 0/1 | 1/1 | 1/1 |
| Total | 27 | 4/27 | 22/27 | 23/27 |
A caveat on T2: it was in active development with known bugs during this run. The +1 task over T1 is modest, but the pattern is informative: selective wins on cross-thread synthesis, reduced retrieval overhead on supersession. We plan to rerun after fixes land.
What the pattern shows
The headline is not "memory helps." The interesting findings are more specific.
The most practically important result is that here, T1 captures most of the gain. Thread-level semantic retrieval alone takes the agent from 4/27 to 22/27. Decision recall, temporal reasoning, context rehydration, causal-chain reconstruction all improve sharply. T1 is already structured, semantic, and meaningful — not just keyword search over flat notes.
The comparison between T1 and T2 is not "notes vs. memory." It is thread-level semantic retrieval vs. richer episodic and graph structure over the same evidence surface. T2 adds value selectively. Cross-thread synthesis is the best example: the agent has to reconstruct an answer from multiple threads, and T2 gets 4/4 where T1 gets 3/4. On supersession, T1 and T2 tie on accuracy but T2 uses 20% fewer Watercooler calls. Richer structure reduces wasted retrieval.
T1 also occasionally outperforms T2. Context rehydration: T1 gets 4/4, T2 gets 3/4. On simple recovery tasks, thread-based search is already the right abstraction. More structure does not automatically help.
Episodic recall remains hard. Flat across all conditions. Recovering what the system state was before a specific fix requires a temporal boundary that neither tier makes easy to reconstruct. This is the hardest family in the benchmark and the one most useful to us internally.
What the benchmark is teaching us
The benchmark has turned into more than a scorecard. It is showing us where the next bottlenecks are.
Episodic recall is the most obvious. Asking what was true before a fix landed is harder than recalling a decision or summarizing a thread. It requires a cleaner temporal boundary than we currently make easy to recover.
Supersession needs more explicit traces of what changed, what replaced it, and why. Agents can often recover the current answer, but the benchmark shows the gap in surfacing correction rather than just final state.
Decision-trace reconstruction shows a related gap: thread retrieval helps, but rationale, rejected alternatives, and downstream consequences are still not captured explicitly enough.
The first run is useful mainly because it sharpens the next set of questions.
What comes next
The same collaboration patterns we see qualitatively in development suggest benchmark families around handoff recovery, closure recovery, and correction tracing.
The next round of testing will ask whether stronger temporal boundary markers, clearer supersession links, richer rationale fields, and post-fix T2 runs materially improve performance on the families that remain weak.
These are not conclusions about the final architecture. They are design hypotheses for the next iteration.
As coding agents improve, the bottleneck shifts. The problem is less about generation now. It is about continuity: how a fresh session inherits real project context, how an agent distinguishes current truth from superseded truth, how a team reuses prior reasoning instead of reconstructing it every time.
Git preserves what changed. What teams increasingly need is enough of the why, when, and how understanding changed that work can resume from more than residue.
We think this is a benchmark category that needs to exist.