Ship's Log

Compressing Coding Agent Context at Models' Native Representation Layer

Vishnu Kalugotla — Mon, 11 May 2026 17:29:50 GMT

Introduction & problem

Modern agent/model workflows rely on context (essentially conversation history) to be effective. As context grows1, models “forget” details, slow down, and otherwise display eroded performance, especially as this growing context is compacted2.

Every major coding agent today handles context compression through text summarization (according to public documentation).3 This post builds on past work and explores compressing at the representation level, rather than the text level to see if it’s possible to more-efficiently preserve important details.

Below, we get deep into the details—skip to the practical implications section for conclusions and takeaways.

What we’re tackling

Every time you send a message to an LLM, the entire conversation history gets re-sent as context in the form of English. Coding agents feel this most acutely because their conversations are long (full of tool calls) and the details matter (a wrong file name or path results in broken code). In Part 3 of this series, I discussed how prompt caching (widely used today) reduces compute costs by ~70% by avoiding redundant computation. However, caching doesn’t mitigate the fact that infinitely growing context is difficult to manage and impacts output quality.

Today’s coding agents handle this by truncating old turns, summarizing them into shorter text, or increasing context window size.4 Claude Code, for example, auto-compacts at 95% capacity, while Cursor truncates old history. Note that these are text-level interventions—they modify the English conversation before sending it back to the model, while internally representing this information as numerical vectors.

Each call re-sends the full conversation as text. The model processes it, generates a response, and that gets added back for the next trip.

Seeing this made me ask: what if we interacted with internal representations directly?

When a model processes your conversation, it builds representations in the form of numerical vectors5 that encode its understanding of each token in context. When the model generates a response, that understanding gets turned back to English text, and the internal state is discarded6. In the next turn, the model rebuilds everything all over again.

Below, we investigate compressing conversation context into the model’s own representation format. Our experiments compressed a 10-turn coding agent conversation to ~50% of its token count while correctly answering 7 of 8 factual questions about the conversation's content.

What we’re actually doing

When a model reads the text “the project uses bcryptjs for password hashing7,” it creates internal vectors at each of its layers8. These vectors encode important context9 in the human-literal sense—that bcryptjs is a specific npm package and that it’s being used for some security function.

The way we compress is as follows: we create “virtual” tokens, which are new vectors initialized from mean-pooled embeddings10, and we tune them using gradient descent11 until the model’s internal response to them matches its response to the original text. After optimization, these virtual tokens become representations that produce approximately the same internal state when the model processes them as if it had read the full text.

How the optimization works: we run the original tokens through the frozen model to get “target” V-vectors. Then, we iteratively adjust virtual token embeddings until they match the target.

The model itself is frozen, meaning that we’re not training or fine-tuning it. In other words, we’re finding inputs that, when fed through the frozen model, reproduce the internal states that represent the content we want to maintain that are smaller than the original inputs.

How we tested this

The setup

We ran all of the experiments on a local RTX 509012 with 32 GB of VRAM using Qwen 2.5 Coder 7B-Instruct at 8-bit quantization. This particular Qwen model satisfied a few requirements:

We needed a local open-weight model because our technique required gradient access, which API-based models don’t expose.
We couldn’t fill the entire 32 GB VRAM capacity with Qwen parameters because we needed room for gradient computation alongside the loaded model.

The data

We used a (very) generic coding agent transcript recorded from our benchmark in Part 3: a 57-turn conversation where an AI agent builds a task management API with Express and SQLite as an example of a coding workflow. The transcript includes (brief) system prompts, tool calls, file reads, code generation, and debugging.

For single-turn experiments, we worked with a 4,096-token window from this transcript (12 turns). For sequential experiments, we split it into:

Prefix: system prompt
Middle: X turns
Suffix: most recent exchange

The evaluation

We test whether the model can answer factual questions about the conversation using compressed context, measured against full context.

“What library does the project use for password hashing?” (answer: bcryptjs, not just bcrypt)
“What is its Node.js test runner?” (answer: node:test, not Jest)
“Does the markdown converter use a library or regex?” (answer: regex-based, no external library)

The scoring is keyword-based to keep complexity low, and scores unambiguous and deterministic. This tests whether specific details survive compression and won’t get fooled by plausible-sounding text.

Why this evaluation is useful

Each question has three properties that make it a reasonable test of compression quality:

The answer exists only in the compressed turns: We verify that the system prompt and what I call the suffix (latest input) don’t contain the answer. The model can only get it right by extracting information from the compressed representation.

The model has a wrong “default”: When given no context, the model guesses bcrypt (not bcryptjs), Jest (not node:test), and marked (not regex). These are what I expect are the most popular options from training data. Correct answers require overriding these “defaults” with specific context, which is what our compression methodology needs to preserve.

There are specific ways to be wrong: bcryptjs vs bcrypt is a tiny change and node:test vs Jest is a completely different tool. These are the kind of specific details that text-level summarization (or context compression in general) could easily lose.

The compression recipe that worked

In subsequent sections, I’ve gone into depth on several paths (and dead ends) that led me to a working “recipe” for efficiently compressing context. That recipe is as follows.

Use Value-only optimization targets (vs. Query and Key)

Optimize virtual tokens so their Value vectors (at all layers) approximately match the Value vectors from the full text. This means excluding Keys entirely from the loss (more on this later).

V-vectors carry the actual content that gets processed through attention, the data payload. K-vectors serve as an index—they help the model decide which tokens to attend to. During inference, the model applies RoPE to the virtual tokens’ K vectors at their positions automatically. Since we only optimize V (which is position-independent), we don’t need to worry about positional encoding during the optimization step.

This finding was critical because it circumvents the RoPE problem I hit in Failure 2 below—pooling K vectors from different positions averages together incompatible rotational encodings. There’s been significant work done on KV optimization, and I’ll continue to explore this in future studies. Optimizing V-vectors is simple and more architecture-general.

Use perplexity-weights when V pooling

As in any robust compression, when pooling (i.e., “averaging” or compressing) the V-vectors above into blocks, not all tokens should get equal weight.

We computed per-token perplexity (also known as surprisal or “cross-entropy” when averaged across tokens), which calculates how “surprised” the model was by each token. Boring or “unsurprising” tokens (articles, whitespace) get low weight. Surprising tokens (specific library names, unique identifiers) get high weight.

Uniform pooling averages all tokens equally, potentially diluting specific details. Perplexity-weighted pooling down-weights predictable tokens, better preserving the signal carried by more informative ones.

This weighting mechanism doubled the evaluated compression threshold from 1.6x to 3.2x. This mechanism was a bit unintuitive13—it works by down-weighting “boring” tokens rather than boosting important ones. When predictable tokens carry less weight in the average, the resulting pooled representation is naturally closer to the informative tokens. The optimizing step then has an easier path to matching these improved targets. For example, the loss at 100 virtual tokens dropped from 8.03 to 4.85 compared to uniform pooling in our experiment.

Interpolate positions

Virtual tokens need positions for RoPE on their K vectors during inference. By default, virtual tokens would get consecutive positions (e.g., 144, 145, ..., 218), while the original tokens might occupy different positions (e.g., 146, ..., 385). We spread the virtual tokens’ positions across the original range: torch.linspace(144, 385, 75).

The critical change here is that it preserves the positional distribution that downstream attention expects14 in models. When the suffix (i.e., most important) tokens compute attention, the relative distances to each virtual token approximate the distances to the original tokens. The suffix tokens start at position 386 in our example, regardless of how many virtual tokens there are, acting as if the full middle were present.

The effect is small at low compression (positions are already dense) and more meaningful at high compression, where position clustering would otherwise distort attention patterns.

An important note on practicality

This is a proof of concept—per-instance gradient optimization the way we’ve done it here is too computationally intensive and slow for production use. The path to get it production-ready likely requires training a neural net to predict the compressed virtual token embeddings.

What I tried, what failed, and lessons

To get to the working recipe above, I tried numerous approaches and hit several hurdles. Below is a map of that path and how each failed test narrowed the design space that led to the results.

Note that when I’m discussing “loss” below, I’m referring to mean-squared error between two sets of vectors that gradient descent is minimizing. The two sets in this case were virtual tokens and the real tokens they were attempting to represent.

Failure 1: Optimizing for final layer is the wrong objective

My first approach was to create virtual tokens and optimize them so that the model’s output at the final layer matches what the real tokens produce. Using this method, it looked like the virtual tokens were producing “correct” final representations.

However, in testing, our eval answers looked like there was no context at all, and the model was “guessing” its training defaults.

The lesson: When the model generates text, it isn’t solely reliant on its final layer15. At every layer independently, each generated token attends to the KV cache entries from previous tokens. This means that optimizing only for the final layer results in significant loss in previous layers.

Failure 2: Including Keys in optimization difficult due to RoPE

In my next attempt, I switched to optimizing the KV entries at all 28 layers. I noticed that loss declined (and later significantly plateaued), but the eval answers were still often wrong.

The problem: Modern models apply position-dependent rotations to Key vectors, which is called Rotary Position Embeddings, or RoPE. The purpose of RoPE is to encode token positions in a way such that when the model calculates attention, it naturally takes into account relative positions between each token. When we start pooling K vectors from various positions, however, we start averaging vectors with different rotational encodings. The result is a rotational mess that no single-position virtual token can match.

I also confirmed that RoPE isn’t unique to Qwen models, and is used by LLaMA, Gemma, and virtually every modern open-weight model. Any compression approach that targets K vectors will hit this wall.

The lesson: RoPE makes K vectors position-dependent, but Value vectors are position-independent. We verified in Qwen’s source: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin). This means we can modify and optimize V without worrying about positional encoding.

There’s room for improvement here since RoPE rotations are deterministic and reversible, a future approach could undo the rotation, optimize K pre-RoPE, then re-apply it for inference

Failure 3: Attention-weighted pooling is both question-specific and regresses results

Once V-only optimization was working, I tried improving the pooling targets (it’s something we’ll have to come back to in future work). Instead of giving all V vectors equal weight when averaging them into blocks, I tried weighting them by how much the evaluation question’s tokens attended to each position. The thought was: high-attention tokens should be preserved better, right?

It’s important to note, though, that this was a rudimentary approach chosen for simplicity given the risk of “overfitting.” The results actually got worse. 150 tokens went from 3/3 to 2/ 3 correct answers.

The lesson: Attention-weighted pooling is inherently question-specific. Weighting toward one question’s attention pattern down-weights tokens that other questions need. There may be better weighting mechanisms for query-agnostic compression that we can explore in the future.

An aside on the helpfulness of AI

It goes without saying that AI coding agents were integral in running these experiments. Of course, AI was vital in writing the necessary Python code to create and run the tests. What I didn’t expect, though, was how I used it to help me run feasibility tests to avoid hitting unnecessary obstacles. For example:

Mr. AI, please help me figure out which Qwen model to use so I don’t overflow to system DRAM if I have to run gradient descent on my loaded model and therefore have it take days to run.

Results Summary

Shows single-turn compression across three different content types: 3.2x compression preserves all tested facts regardless of content type.

3.2x is the lossless (in this eval) threshold for all three turns. Larger turns compress equally well. Turn 24 has the lowest optimization loss at every ratio, suggesting more tokens give the optimizer more material to work with.

At 4.8x, failures follow a similar pattern of degrading names or strings:

“bcryptjs” → “bcrypt” (dropped the JS suffix)
“dev-secret” → “dev-dev-dev-dev-dev...” (strange repetition of the prefix)
“done” → “completed” (fell back to a training synonym)

Structural facts seem to survive much longer than exact names. For instance, the model knows which library is used for password hashing well past the point where it “forgets” the exact name of that library.

Sequential multi-turn: 2x preserves most facts across 10 turns

Single-turn compression gives us a great signal on the compression mechanism we tested, but coding agents rely on long, multi-turn conversations. The next step was testing sequential compression: compressing each turn independently and building up a running KV cache.

Sequential Compression in practice

Each turn is compressed independently against the running cache. The compressed representation grows incrementally as the conversation progresses.

Sequential Compression Results

Multi-turn compression across 10 turns: 2.0x compression preserves 7 of 8 factual answers, which is the same quality as 1.5x at lower token cost.

The main finding was that there is minimal error accumulation across turns: meaning the optimization process works equally well at each step. The remaining question, whether the targets themselves drift as compressed context accumulates, is harder to measure directly in this set of experiments. That said, the fact that 2x compression still preserves 7 of 8 answers suggests this drift (if it exists) remains bounded.

Practical Implications and final thoughts

Getting deep into manipulating internal representations to optimize coding agents, as we’ve done here, can help us gain an intuitive and applicable understanding of several concepts and model behaviors. Taking that understanding and applying it to real user problems reveals insights that can help us shape how we might build the coding agents of tomorrow.

In these experiments, we’ve learned that categorical facts like ‘the project uses JWT for authentication’ survive high compression, while exact names like bcryptjs degenerate to the model’s “defaults,” bcrypt, under high compression. Despite the challenge, we’ve also learned that there’s significant headroom for more context efficiency in coding agents. Making such improvements would solve several key issues with coding agents as they exist today, and might make them more useful broadly. Applying the above, I would combine several approaches to create a personalized and efficient context strategy.

Compression: Compress context at the representation level with V-vector optimization, using perplexity to weight V-pooling and interpolate virtual tokens. Our proof of concept shows that it may be possible to compress context ~3x using this method. This is also complementary to caching–combining compression with prompt caching could reduce context costs by 80-90%. Of course, the next step would be building a model to productionize what we’ve learned.

Systematic Preservation of Vital Information: Preserve vital information and keep it verbatim (more to follow on how best to do this in future work). Simple file systems and knowledge graphs using markdown files seem to be the most popular approach. It may also make sense to preserve important information from these file reads in a context window verbatim. This is where personalization comes into play—what type of “memory” system someone uses and what serves as useful information outside of interacting with an agent differs from person to person. This is also complementary to compression.

Intelligent Compaction: We’ve discussed before that Anthropic moved away 16 from exposing very long (1M token+) context windows by default given the intrinsic issues. With the above techniques at our disposal, we could be more intelligent about when we compress instead of waiting for a certain context window threshold.

I would love to explore using signals to automatically trigger compaction and disposal of context, or encourage users to provide those signals directly. Examples include: forking conversations, discarding the last turn if it had a poor result, and having a meta conversation that preserves specific context. This same principle also applies to sub-agents—deciding explicitly whether sub-agents should have context (and what context), and where they should write outputs.

In today’s landscape, a lot of the burden of context management (and even understanding the problem) is put on the user. We could abstract this into the model harness itself, powered by new techniques.

Future work

There are several directions I want to explore from here.

Optimizing K vectors pre-RoPE: We circumvented the RoPE problem entirely by only optimizing V, but since RoPE rotations are reversible, we could undo them, optimize K directly, then re-apply them. This could push the compression threshold beyond 3.2x by jointly optimizing both.
Re-testing the hybrid approach by preserving specific pieces of context verbatim and compressing the rest, at the representation level.
Combining embedding compression with KV cache eviction: Our technique preserves historical context in approximate form. Published eviction methods like H2O and SnapKV manage the growing current context by dropping low-attention entries entirely, and these efforts would be complementary.
Testing the sub-agent context architecture ideas above: The compression research tells us what’s possible with context efficiency, and it would be interesting to apply that systematically to sub agents from a product perspective. For example, what context sub-agents receive, whether it should be compressed or verbatim, and when to discard context entirely.

Appendix

Limitations

There are several limitations to note.

Model choice and scale: All of the experiments were conducted using Qwen 2.5 Coder 7B. Larger models with more attention heads and parameters might behave and perform differently. On one hand they might have higher representational capacity in each virtual token. On the other, there are more complex internal representations for a virtual token to approximate. We were hardware bound, but I’d love to test the same mechanisms in more capable models.

Eval / benchmark simplicity: We used a single transcript with a set of binary evaluation questions for simplicity. In previous experiments, broken pipelines (and things as simple computer sleep) resulted in numerous restarts, leading me to focus on the simplest option possible for this phase of research. The compression threshold and degradation patterns need validation across different codebases, programming languages, and types of conversations.

Sequential test oddities: Even 1.5x compression scores ~7/8 on the sequential experiment we ran, suggesting that the sequential pipeline has some quality loss likely beyond what compression ratio alone explains. It’s just something to be explored further.

Hybrid approach is difficult to validate on short contexts: On short turns (e.g., 240 tokens), preserving ~15% of them consumes too much of the “compression budget” and results in a compression ratio that’s too aggressive.

An initial look at a hybrid compression approach

In the above experiments, failures at 4.8x compression are always specific tokens whose signal gets diluted during pooling. Applying a text-level compression technique I’ve attempted before, what if we kept those tokens verbatim and only compressed the rest?

The biggest challenge here was identifying which tokens to exempt. We tested two signals:

Perplexity: Represents how “surprising” the token was. Unfortunately, this selects formatting markers and sub-word fragments as well as useful ones. Sometimes, tokens that are surprising carry no useful content (hence we down-weighted low perplexity tokens in our pooling methodology). More broadly, perplexity wasn’t a useful signal (at least when applied this way) at the macro-content level, even though it was useful for pooling.

“Uniqueness” of V: Represents how different a token’s internal representation is from its neighbors. This selects content words like Task, converter, reports, SQLite, authenticate, userId. These are exactly the tokens that lose the most from being averaged with their neighbors during pooling.

Using this technique, we saved the top 15% of tokens by the uniqueness of V, and their KV entries passed through the pipeline unmodified at their original positions (with correct RoPE). Then, we compress the remaining 85%.

Bibliography & Related Work

Our main technique, embedding optimization (without pre-training), builds directly on Kuratov et al.’s work demonstrating that hundreds of tokens can be compressed into optimized embedding vectors (Cramming 1568 Tokens into a Single Vector, ACL 2025). We apply their technique to multi-turn coding agent conversations and add the V-only optimization target, perplexity-weighted pooling, and sequential turn-by-turn compression.

The KV cache compression research field is quite active. Most approaches compress the cache after computation through quantization or eviction. Our approach is different in that we optimize input embeddings beforehand. Notable work in this space includes EliteKV (RoPE frequency selection), CodeComp (structural compression for coding agents).

KV cache eviction methods like H2O and SnapKV manage growing context by dropping low-attention entries entirely. I would expect these to be complementary to our approach, since compression preserves important historical context, and eviction manages the current context’s memory footprint.

LLMLingua (Jiang et al., 2023) uses perplexity to discard tokens at the text level. We use perplexity to weight V-vector pooling at the embedding level, which is the same signal applied at a different level of the model.

Anthropic’s guidance on session management and context rot informed our experimental motivation and how widespread this problem is.

Bibliography

Kuratov et al., “In-Context Learning with Long-Context Models: An In-Depth Exploration.” ACL 2025. arxiv.org/abs/2502.13063
Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference.” 2023. arxiv.org/abs/2310.05736
Zhang et al., “H2O: Heavy-Hitter Oracle for Efficient KV Cache Eviction.” 2023. arxiv.org/abs/2306.14048
Li et al., “SnapKV: LLM Knows What You Are Looking For Before Generation.” 2024. arxiv.org/abs/2404.14469
EliteKV: RoPE-aware KV cache compression. 2025. arxiv.org/abs/2503.01586
CodeComp: Structural KV cache compression for coding agents. 2026. arxiv.org/abs/2604.10235
HybridKV: Hybrid per-head compression for multimodal models. 2026. arxiv.org/abs/2604.05887
Anthropic, “Using Claude Code: Session Management and the 1M Token Context Window.” April 2026. claude.com/blog/using-claude-code-session-management-and-1m-context

This means the model uses some summarization technique to “compact” context into something more manageable to keep working.

Codex calls an API that is opaque to the public.

Compounded further by increased use of agents.

Anthropic published guidance on context rot here.

Essentially using coordinates in a high-dimensional space (e.g., [1, 3.2, 4, 7] = cat) to represent pretty much anything.

It’s not quite discarded in the case of caching, but for purposes of this particular point, the message holds.

We’re focused on coding tasks, since that’s the use case where a lot of this pain is felt, though it’s likely broadly applicable outside of coding.

Qwen 2.5 coder 7B has 28 model layers.

A while ago, it took me a long time to figure out how attention works—it’s the mechanism that’s the core of modern LLM architecture. To put it in a few words, attention allows a model to understand whether “go” seen in input is a command, a turn (in the British sense), an ancient board game, who is going, and so on. This video captures it far better than I could explain in text.

It’s pretty much a fancy way of saying averaging together a bunch of vectors.

A video is worth a thousand of my words explaining gradient descent.

I’m very proud to note that I’m now getting non-gaming use out of this expensive graphics card that I spent multiple months trying to acquire at MSRP.

This is because the important tokens (to us) don’t necessarily have high perplexity. Yet, we’re able to reduce the dilutive impact of unimportant ones.

If all the virtual tokens were packed into consecutive positions, the model would lose valuable information on relative positions of tokens that it expects (that we’ve now messed with). The interpolation we’ve done across the original range gets us to “good enough.”

To be honest, I excitedly rushed through this step and didn’t fully think through my optimization. Nevertheless, it was a neat way to empirically show how wrong I was.

At least as it was noted several weeks ago.

Where AI coding costs actually come from (and how I cut them by 75%)

Vishnu Kalugotla — Mon, 20 Apr 2026 20:18:35 GMT

Anthropic published guidance last week on managing “context rot” in Claude Code, the degradation of model performance as conversations grow longer. In every conversation turn, the model re-reads the entire conversation history, including old files, potentially irrelevant errors, and abandoned directions. This accumulation makes models slower, less focused, and more expensive.

I’ve been measuring how much that “rot” costs and testing techniques to fix it. This post adds prompt caching to the compression work from Part 2, pushing savings to over 70%T against our uncompressed baseline.1

The most interesting findings from our series of experiments were where caching works, where it doesn’t, and why compression matters more than our benchmark suggests.

The Experiment: Seven Strategies

For this series, I built a minimal three-tool coding agent using the Anthropic API that includes read_file, write_file, and run_command in a loop. This removes the overhead of Claude Code’s full infrastructure and lets me measure context management techniques in isolation.

In this follow up, I added prompt caching to our custom agent and ran seven conditions on the same benchmark from Part 2 (an 8-task web application benchmark covering database setup, authentication, and testing2). We measured cost, tokens, turns, and code quality (1–5 scale with an LLM judge). I also tested a new technique, embedding-based selective retrieval, as an alternative to lossy summarization.

Here are the results:

Cost varies ~9x across strategies with consistent quality.

Two things quickly became apparent:

Quality is flat across every condition (4.0 – 4.2 —within what I’d call margin of error for this exercise).
Caching alone, with no changes to how the agent thinks or works, cut token costs by 75%. The savings come from paying significantly less for tokens the model has already seen.

A note on this comparison: Our agent is intentionally minimal. The Claude Code bar in the graph provides a reference point for what a full-featured tool costs.

Full results across seven conditions — ranges shown from multiple runs.

Caching is a reliable win

Adding caching markers (see footnote) to our API calls3 dropped costs from $0.62 to $0.18. This means that the agent sends the same tokens, does the same work, and produces very similar code. The only difference here is that on turn 10, the ~8,000 tokens of conversation history from turns 1–9 cost 1/10th what they cost without caching.

Prompt caching, of course, has been available for years, and Claude Code already uses it with a 90%+ hit rate. In our own experiment, the stability was remarkable—the two caching runs cost $0.17 and $0.18. It’s clearly the optimization to implement first (which is why most people seem to have done it already). It’s predictable, consistent, and critically, isn’t lossy.

Compression is still helpful

Adding compression on top of caching produced our single best experiment result: $0.14 (a 77% reduction from our uncompressed baseline, and our single best result across all conditions). However, it also produced a $0.35 run. The variance comes from the model making different choices after reading a summary instead of reading the full history. While the summary is normally sufficient context, sometimes, the model re-reads files it wouldn’t have otherwise if it could see the original conversation.

Per-turn token accumulation across four strategies. Compression creates the sawtooth pattern (as opposed to linear accumulation). Retrieval cycles repeatedly.

The initial result was surprising, but it made a bit more sense in the context of this experiment. The test benchmark is quite small, comprising eight tasks and files (mainly to save me the enormous headache of debugging a potential multi-hour testing run, not to mention that Anthropic API calls aren’t free). The model reads the same files repeatedly, and caching handles the repetition. In this testing environment, compression was solving a problem that barely existed.

Why these numbers understate the real gains

Real codebases are different from my benchmark, and the problem is more acute than the above result might suggest, especially because Claude Code reads entire files by default.4 In a large codebase, each new file the model explores is a cache miss of perhaps tens of thousands of tokens. Context switching has the potential to dramatically compound this as well.

Cache time-to-live (TTL)5 makes it worse. The default 5-minute TTL means any break causes a full cache expiry. Claude Code’s creator, Boris Cherny, acknowledged this directly:

Prompt cache misses when using 1M token context window are expensive... if you leave your computer for over an hour then continue a stale session, it’s often a full cache miss.

Developers are already building workarounds: MCP servers for code indexing, hooks that pre-process data before Claude sees it, and skills that provide architecture overviews so the model doesn’t have to search. I’ve yet to see a standardized solution that’s broadly applicable, though.

The 75% token cost savings would likely be even more impactful in a large codebase, where the model accumulates 100K+ tokens of history with lower cache hit rates. Our benchmark is the best case for caching and the worst case for compression. In a real codebase, the balance flips — cache hit rates drop as the model reads novel files, and the accumulated context that compression eliminates becomes the most significant cost. The recommendation, then, from this study is to use both: cache everything by default, and compress proactively before the context window fills.

This also explains why Anthropic reduced the auto-compact threshold from ~1M to ~400K tokens in Claude Code v2.1.92. Even with aggressive caching, maintaining the full conversation history at scale was apparently unsustainable. The aim of this project is effective proactive compression, not just to save on token costs, but to keep agents focused and performant in long sessions.

Embedding retrieval: completeness beat fidelity

Next, I tested an alternative to lossy compression: instead of summarizing all history, I used embedding similarity (to score each conversation turn by relevance) to select the 5 most relevant conversation turns and keep them verbatim. My hypothesis was that verbatim turns preserve more useful detail than a full summary.

While I’m sure there’s room for more optimization (I just picked 5 to start), it was our worst result of the test runs. It took 58 turns, costing $1.66, which was more expensive than Claude Code.

It was reassuring to see that embeddings actually do differentiate at the per-turn level. Scores ranged from 0.55 to 0.87, noticeably wider than the inter-task embedding scoring that failed in Part 2. This means that the model can tell that “read package.json” is less relevant than “wrote auth routes.”

However, keeping only 5 turns for the small tasks in this test seemed to drop too much context. This is likely because it’s helpful for the model to have a complete map of our codebase. For example, a lossy summary that says “created db.js with three tables, wrote auth routes with JWT, added validation middleware” gives the model enough to continue, while five verbatim turns might not include the file the model needs to reference when writing tests.

In the context of this experiment, completeness beats fidelity. I’ve seen other work using file structure solutions as maps to help mitigate further—we may investigate those strategies in future work.

What this means for the tools we use

Now let’s look at a broader question that goes beyond this specific research.

An interesting data point in our comparison is that Claude Code costs $0.72 for our benchmark, while our lean agent costs $0.18 with caching. The gap illustrates where token costs live: roughly 70% of the cost in our measurements comes from infrastructure that gets re-sent every turn, not from the actual code generation.

I’m not arguing that everyone should replace Claude Code with minimal agents. Claude Code’s system prompt, tools, and safety features exist for good reasons and make it flexible and capable. However, as sessions get longer and codebases get larger, that overhead compounds. The question here is how can we intelligently compress the information to make the entire process more efficient. A 4,000-token system prompt that gets re-sent every turn could be a 400-token summary after the first turn—the model likely doesn’t need to re-read tool specs it saw ten turns ago.

The pattern we’ve established keeps repeating: there is significant headroom for improvement in both context management and AI tool use broadly.

Next: beyond the text layer

Thus far, I’ve been investigating techniques that manipulate text, like summarization, caching (little different, but close enough), and embedding retrieval. Summarization works but risks losing some detail; caching is quite helpful, but can’t help with novel content; retrieval can’t replace completeness in our study.

The next step will be looking into the layer below text that these techniques don’t reach: the model’s internal representations. When an LLM processes a conversation, it builds vectors to serve as an internal representation of patterns computed from reading that text, which are what it uses in computations to calculate the output. With open-weight models, we can access these vectors directly6.

Instead of summarizing a turn into text and re-processing the summary, I’ll try a set of techniques to manipulate these representations more directly. For example, it’s possible to evict low-value KV cache entries and keep the rest intact.7 Applying these techniques to multi-turn coding agent sessions, where the “importance” signal comes from which past turns the model actually attends to, is less explored. We’ll dive deeper into that next.

The code for the evaluation framework, benchmark, agent, and other work described here is open source at github.com/rocketvish/bearing.

Sources

Session management and context rot Anthropic blog, April 15, 2026. Official guidance on compaction and context rot in Claude Code. https://claude.com/blog/using-claude-code-session-management-and-1m-context
Context management for coding agents Lindenbauer et al., TUM / JetBrains. Presented at NeurIPS 2025 (DL4Code workshop). https://blog.jetbrains.com/research/2025/12/efficient-context-management/
Claude Code autocompact reduction GitHub Issue #43989. Threshold reduced from ~1M to ~400K tokens in v2.1.92. https://github.com/anthropics/claude-code/issues/43989
Prompt caching mechanics Anthropic API docs. Cache reads at 0.1x price, writes at 1.25x, 5-minute TTL. https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Cache read token dominance GitHub Issue #24147. 99.93% of tokens are cache reads over 30 days of real usage. https://github.com/anthropics/claude-code/issues/24147
Cache TTL regression GitHub Issue #46829. Documents silent TTL reversion from 1h to 5m, with cost impact analysis. https://github.com/anthropics/claude-code/issues/46829
Claude Code reads entire files by default BSWEN blog. MCP server for symbol-level indexing reduced tokens from 84K to 2.7K per file read. https://docs.bswen.com/blog/2026-03-20-reduce-token-usage-claude-code/
Boris Cherny on cache miss costs The Register, April 2026. “Prompt cache misses when using 1M token context window are expensive.” https://www.theregister.com/2026/04/13/claude_code_cache_confusion/

This is Part 3 of a series on context management for AI coding agents. Part 1 introduced a lightweight coding agent orchestrator and the case for separating planning and execution context. Part 2 measured mid-conversation compression (44% savings).

This is a controlled 8-task web application benchmark a backend API covering database setup, authentication, data operations, validation, and tests.

A cache control breakpoint is a marker you place in your API request telling Anthropic “cache everything up to this point.” On the next request, if the content before that marker is identical, Anthropic serves it from cache at 1/10th the price instead of reprocessing it.

One developer found that reading a single 22,000-line module to understand one function consumed 84,000 tokens. They built an MCP server for symbol-level indexing that reduced it to 2,700 tokens — a 97% reduction.

Cache TTL (time-to-live) is how long Anthropic keeps your cached tokens available before discarding them. The default is 5 minutes)

Frontier models served don’t expose their internal representations. Open-weight models like Qwen and Llama do, which is why the next phase of this research uses local models as a test bed.

Research frameworks like H2O (NeurIPS 2023) and SnapKV have already demonstrated this for general language tasks.

How I made coding agents 44% cheaper

Vishnu Kalugotla — Tue, 14 Apr 2026 03:51:22 GMT

Smart context management cut my AI coding costs by 44%, with room to go further.

What started off as a lightweight app to make my work flow more efficient has turned into a veritable rabbithole on effective context management in LLMs.1

Model context, especially within the environment of a model harness like Claude Code, tends to be inefficient and can result in higher token burn with potentially worse quality outputs.

This is because, as conversations get longer, all previous queries and outputs get swept in as context until hitting the context window limit. At this point, conversations are compacted, losing valuable information.2

I built a rough evaluation framework and ran five experiments to find out where those tokens go, which ones matter, and which ones can be cut. There were, of course, a few dead ends, but the results indicate that there’s significant untapped head room.

Findings

Compressing mid-conversation context to eliminate unnecessary inputs saved 44% on token costs.
Claude Code stores full session transcripts, so you can inspect what happened after the fact. However, during execution, you can’t intervene between turns, making it very difficult to compress the history, drop stale results, etc.

Experiment Context

In my last post, “Why I stopped copy-pasting between Claude windows,” I introduced Bearing — an open-source orchestrator that separates AI planning from execution and optimizes for human input in the planning loop, while keeping tasks model agnostic.

Here’s how it works: you have a planning conversation with your favorite model about what to build. Bearing then runs each task in a fresh claude -p session.

Soon after my first post, Anthropic released the advisor strategy— a server-side feature that pairs Sonnet as an executor with Opus as an on-demand advisor. Sonnet drives the session and escalates to Opus when it hits a decision it can’t solve alone. It’s a similar intuition to Bearing’s planning-with-Opus, execution-with-Sonnet split, but automated and without a human in the planning loop.

While the original why behind bearing revolved around AI workflow efficiency and model flexibility, I noticed two things in my daily use:

I became increasingly tempted to manually control context to increase the signal-to-noise ratio of inputs.
I was painfully aware of both speed and quality dropping as conversations (or even just tasks) ran on in a single conversation.

Based on my experience manipulating context, I saw significant headroom to make models (and by extension, harnesses) more efficient. This is where we dive into something a bit more research-oriented: I built a rough evaluation framework and ran several tests that I’ve laid out below. I even had to iterate on the tests to make sure I didn’t constantly hit API rate limits or have my computer falling asleep force a re-run of a 3-hour job (I learned that one the hard way).

The evaluation consists of a reproducible benchmark, automated quality scoring with an LLM as a judge, and a framework that runs the same tasks under different conditions and compares cost, token consumption, and output quality.

Building the Measurement Tool

I wanted to run my experiments using Claude Code directly, but as far as I understand, I can’t because claude -p cannot be interacted with between turns.

To get around this, I built a (very) minimal code agent using the Anthropic API directly. It includes three tools, read file, write file, run command, that execute in a straightforward loop: send message, receive tool calls, execute them, and then return results.

The first measurement was instructive:

With the same tasks, the main difference is Claude Code’s system prompt, project scaffolding, and tool infrastructure, all of which get re-sent every turn.

per-turn accumulation showed linear growth as follows:

Turn 1: 2,444 input tokens
Turn 5: 4,952
Turn 10: 8,303
Turn 15: 10,507
Turn 20: 14,082

Every turn re-sends the full conversation history (as we expected). Turn 1 sends the initial prompt, while turn 20 re-sends the prompt plus every file it read, every command output it processed from previous turns, and so on.

The linear accumulation of conversation history, most of which decays in relevance within a few turns of being generated, ends up becoming a tax on token use.

What Worked

Mid-Conversation Compression

This intervention is straightforward. When the conversation history exceeds a threshold (12,000 tokens in these experiments), the agent pauses, generates a structured summary of everything accomplished so far, including files read, files written, commands run, errors encountered, and replaces the full history with that summary.

The token consumption pattern looks as follows:

The compression at turn 14 reduced the conversation from 12,733 to 2,255 tokens. The model continued from the summary and completed all remaining tasks without losing track of what it had already built.

Results across all eight tasks:

*Claude Code quality from the session isolation experiment; agent quality from the compression experiment. Both use Sonnet on the same benchmark.

Implications

Mid-conversation compression reduced input tokens by 25.6% versus the uncompressed agent, with no quality degradation (4.2 vs. 4.3 — essentially within noise on a 5-point scale). Against Claude Code, the compressed agent used 78.5% fewer input tokens at 44.4% lower cost.

That cost comparison is actually conservative. Claude Code benefits from prompt caching, and our agent currently does not. Adding caching to the agent should reduce its cost further. Preliminary estimates suggest $0.10–0.15, which would represent an 80%+ reduction versus Claude Code for the same work at comparable quality.

It’s important to keep in mind however, that Claude Code mitigates re-reading costs through prompt caching. Cached tokens cost 1/10th of regular input, so a well-optimized session hits a 90%+ cache rate. Our compression approach eliminates them entirely. The optimal strategy likely combines both: cache between compressions, accept the cache miss after each compression event, and let the cache rebuild. We’ll explore this more in future directions.

These findings align with a growing body of research on agent context management. Lindenbauer et al. (NeurIPS 2025) found that observation masking and LLM summarization both reduce costs in coding agents, but noted that the field still treats context management as an engineering detail rather than a core research problem. Our 25% reduction through simple threshold-based summarization suggests there’s significant room for more sophisticated techniques.

Limitations

These results are preliminary. The experiments use one benchmark project, one model (Sonnet), and one run per condition. The findings need replication across project types, models, and multiple runs to establish reliability. The quality evaluation uses LLM-as-judge (Sonnet scoring Sonnet’s output).

The compression threshold (12,000 tokens) and summarization strategy (full-history summary) were chosen pragmatically based on turn outputs, and they’re not optimized. Different thresholds and compression techniques would produce different tradeoff curves.

What’s Next

This post establishes that mid-conversation history compression works — it reduces tokens significantly without degrading output quality. The next order of business is investigating how far the technique extends and what the optimal compression strategy looks like.

I’m exploring a few approaches that go beyond lossy summarization. There are techniques that preserve relevant parts of the original conversation verbatim while discarding the parts the model no longer needs. There are also interesting interactions between compression and prompt caching, since compression resets the cache and the two techniques need to be balanced. More on those in Part 3.

The code for Bearing, the evaluation framework, the benchmark, the custom agent, and the compression module are open source and available at github.com/rocketvish/bearing.

This is the second post in a series on context management for AI coding agents. The first, “Why I stopped copy-pasting between Claude windows ”, introduced Bearing and the case for separating planning from execution.

Appendix

What Didn’t Work

Experiment 1: Compressing Inter-Task Context

When Bearing completes a task, it passes a short summary to the next one — something like “[task-001: Database schema | files: src/db.js] Created SQLite tables for users, tasks, and reports.”

I built four strategies for compressing this handoff context: raw prose, structured JSON, embedding-based relevance scoring (using nomic-embed-text locally via Ollama to drop portions that score as irrelevant to the next task), and LLM-guided compression (using Gemma 4 26B locally to rewrite mid-relevance chunks with only the details relevant to the downstream task).

There wasn’t a meaningful difference in cost or quality. The embedding scorer never dropped a single portion across three full evaluation runs.

There were two issues.

In a cohesive project, where tasks build on each other within the same codebase, everything scores as semantically related to everything else, which makes dropping portions difficult. Moreover, in a small project like our test benchmark, filtering out irrelevant files will result in minimal savings.
The inter-task handoff is approximately 200 tokens, while each task consumes 100,000–300,000 tokens during execution.

This experiment turned out to test something different than I intended — I didn’t realize you can’t modify what happens inside claude -p between turns

Experiment 2: Do fresh windows save tokens?

I tested eight separate claude -p sessions versus one single claude -p call with all eight tasks consolidated into a mega-prompt. The benchmark was an 8-task Express API project covering database setup, authentication, CRUD routes, validation, middleware, reports, and integration tests.

The mechanism seems to be cold-start overhead. Each fresh Bearing session reads CLAUDE.md, scans the directory, reads package.json, reads existing source files before actually writing any code. Meanwhile, the single session only has to do that once vs. eight times.

Prompt caching compounds the disadvantage. Claude Code caches the conversation prefix at roughly 1/10th cost. In a single session, most tokens on turn 10 are cache hits from turns 1–9, while fresh sessions have nothing to cache.

Sources

Context degradation in long contexts Liu et al., “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024. https://arxiv.org/abs/2307.03172
Context management for coding agents Lindenbauer et al., TUM / JetBrains. Presented at NeurIPS 2025 (DL4Code workshop). Tests observation masking and LLM summarization on SE agents. https://blog.jetbrains.com/research/2025/12/efficient-context-management/
Agent context compression framework Kang et al., “ACON: Optimizing Context Compression for Long-Horizon LLM Agents.” arXiv 2025. 26–54% token reduction on multi-step benchmarks. https://arxiv.org/abs/2510.00615
Claude Code autocompact reduction GitHub Issue #43989. Autocompact threshold reduced from ~1M to ~400K tokens in v2.1.92. https://github.com/anthropics/claude-code/issues/43989
Prompt caching mechanics Anthropic API docs. Cache reads at 0.1x price, writes at 1.25x, 5-minute TTL. https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Cache read tokens consume 99.93% of usage quota https://github.com/anthropics/claude-code/issues/24147
Advisor tool Anthropic blog, April 2026. Sonnet executor + Opus advisor. 2.7pp SWE-bench improvement, 11.9% cost reduction. https://claude.com/blog/the-advisor-strategy

This is the second post in a series on context management for AI coding agents. The first, “Why I stopped copy-pasting between Claude windows”, introduced Bearing and the case for separating planning from execution.

It’s interesting to note that Claude Code likely reduced the autocompact window to ~400K tokens on Opus 4.6, down from the full ~1M available in v2.1.91 and earlier (Source: Github). This is an undocumented change that might be an effort to address context inefficiency in large context windows.

Why I stopped copy-pasting between Claude windows (and what I did instead)

Vishnu Kalugotla — Wed, 08 Apr 2026 04:11:12 GMT

AI is better at writing prompts for AI than humans are (for the purpose of writing code).

Or at least that’s what it has felt like as of late. I’ve recently fallen into a pattern of copying and pasting generated prompts between multiple Claude Code and Claude windows (embarrassing – I know), and getting much better results than I would have otherwise.

What started as an attempt to optimize my workflow led me down a rabbit hole to improve model + harness performance and trying to address longstanding issues with context window length. Namely that compaction is lossy and token-inefficient.

Here’s the long and short of what I learned:

The Research

Models “attend” (pay attention) worse to mid-context information (it’s improved but not solved)
AI-generated prompts (might) outperform human ones on complex tasks

The latter should be taken with a grain of salt, but at the very least might help reduce the amount of time it takes to translate human intent into an output or outcome.

The Approach

Separate planning from execution context windows to keep both clean
Assist AI compaction to preserve relevant context, especially for long-running tasks
Keep the human in the planning loop to better achieve what you intend

These are all implemented in Bearing, an open-source tool that separates planning from execution across AI coding agents.

Early Results

In a long session, compaction can discard the vast majority of your conversation history. In one of my sessions, 99.5% was dropped, keeping decisions but losing the reasoning behind them. Bearing avoids this on the execution side entirely: each task starts with a fresh context window

The Attention Problem

In 2023, Stanford and Meta researchers (Liu et al., 2024) documented the “Lost in the Middle” phenomenon: model performance follows a U-shaped curve, highest when relevant information is at the beginning or end of the context, degrading when it’s buried in the middle. GPT-3.5’s (yes that’s an old one, as are some of the ones below) accuracy dropped over 20% when key information sat mid-context.

Newer models, of course, have improved dramatically on this. As early as 2023, Anthropic showed that a simple prompt adjustment raised Claude 2.1’s retrieval accuracy from 27% to 98% across its 200K context window (Anthropic).

Despite these improvements, 2026 best-practice guides still recommend placing critical information at the beginning and end of context windows. This practical problem persists in a different form: even if the model can technically attend to everything, a Claude Code session that’s accumulated five tasks worth of test output, and error traces has a much noisier context than a fresh session with just the task prompt. The signal-to-noise ratio degrades regardless of whether the model’s “positional attention” is uniform.

In other words: your intent from attempt 1 isn’t lost because the model can’t read the middle of its context. It’s lost because it’s competing for attention with dozens of other signals that have accumulated since then.

Why AI should (maybe) write prompts for AI

When you tell Claude Code “add dark mode,” your brain fills in 20 things you don’t say: where the toggle goes and what it looks like, how to persist the preference, how to contrast with the existing color scheme, and more. It’s difficult to figure out which details to fill in, and which an agent might fill in better than you might plan—worse, it’s tough to know what might anchor an agent on a suboptimal path.

This can be especially tough because an executing coding agent starts without much shared context (except what you painstakingly provide in markdown files, of course).

An AI planner writing a prompt for the executor doesn’t assume shared context — it knows the target session is clean. It specifies which files to read, what approach to take, what to verify when done.

There may be some reasons attributable to training that show why this works. Google DeepMind (Yang et al., 2023) and Microsoft Research have both shown AI-optimized prompts outperforming human-written ones on benchmarks (OPRO, PromptAgent), though how directly this applies to coding agent prompts is still an open question. Models were optimized during RLHF against structured, complete instructions. An AI writing a prompt for another AI naturally produces text in that format.

There’s an added layer to all of this: as the world stands today, I believe that human + AI interaction can serve as a more effective planner in translating intent into outcomes than either alone. Many people writing about vibe coding have played with markdown files and prompt generation to make a coding agent one-shot most (easy) tasks quite effectively. That said, as tasks grow in complexity, this approach has its limits. Together, a human can bring taste, judgment, and domain knowledge, while the AI brings completeness, structure, and format the executor is optimized for.

Doesn’t Claude Code Already Do This?

Kind of. Claude Code already separates planning from execution. Plan mode creates a plan before writing code, subagents create workers, and agent teams have a lead that coordinates agents.

As far as I’ve seen though, the planner is the AI talking to itself. Plan mode plans, then executes with the same context window and without human input during the planning phase. Agent Teams have a lead, but that lead makes architectural decisions autonomously.

Claude Code’s plan mode likely won’t say “wait – we’re not thinking about [insert vital topic here]” It won’t say “actually, build the database layer first because everything depends on it” unless you thought to ask. It can’t bring your product instinct into the architectural conversation because it doesn’t have access to it.

The issue also isn’t planning versus execution. Given the way typical workflows are currently structured, it’s whether the human is in the planning loop or downstream of it.

What does Bearing do then?

I built a tool called Bearing to test these ideas. Here’s how it works:

You open an interactive Claude Code session; that’s your planner (running Opus or another performant model) that’s focused on strategy. You have a real conversation to do anything from debating architecture to pushing back on complexity, and making decisions.
When you converge on a direction, the planner writes a structured task file. Each task specifies the CLI tool, model, budget, turn limit, which files are relevant, and a detailed prompt drawn from your conversation.
Then you run bearing run . and each task executes in a new, isolated claude -p session.
Results write back to a status file – the planner reads the results. You discuss what to adjust, and the cycle continues.

The wrong kind of bearing — very helpful though.

A couple benefits of this approach:

Model-agnostic execution: Each task specifies which CLI runs it — Claude Code, Codex, or any custom agent. You can also mix them in the same task queue.

Context focusing: Instead of the executor reading your entire codebase and hoping attention lands in the right place, each task specifies which files matter and which to skip. The executor sees FOCUS: Read these files first and SKIP: Do not read or modify these before the task prompt. This reduces token consumption and concentrates the model’s attention on relevant code. Quick note here – I’m working on a better approach that I’ll write a bit about next time.

Auto-context propagation. When task-001 completes, its summary and file list automatically inject into every dependent task. That means task-002 knows what task-001 created without the human manually copying context between sessions.

This also means no more copying and pasting between chats, which is how this whole thing started, and AI handles the translation from intent to implementation.

The Deeper Problem: Context Compression

There’s a larger issue beneath all of this that hasn’t been solved (or even addressed, really) yet.

When a context window compacts, whether through Claude’s built-in summarization or simply through the model attending less to older content, information loss is unstructured – the model isn’t great at knowing what you consider important. Some big architecture decision from prompt 3 could get the same treatment as a debug trace from prompt 17 (most of you have probably seen a typical Claude memory reference to something entirely irrelevant – this is similar).

The planner layer offers a structural answer to this. Because the planner curates what context flows into each task — which files matter, what previous tasks accomplished, what the human decided and why — it’s performing a form of (slightly) intelligent compression that the model doesn’t do on its own.

It’s important to note that this attempt is quite early. The current implementation uses text summaries and file lists. A more ambitious version might generate structured relevance maps, prioritize context by impact, and compress differently for different task types, and that’s only one potential approach. It’s also possible or even likely that Anthropic and OpenAI will build some of this into their tools natively.

Regardless, I might contend though that a human-in-the-loop planning layer can make better context decisions than a model alone, at least in the current era of LLM-based agents.

Try It

git clone https://github.com/rocketvish/bearing.git
cd bearing
uv tool install -e .
cd 
bearing start .

This is open source and has no dependencies. The repo includes examples and a planner prompt that teaches your AI session how to write task files.

The code is at github.com/rocketvish/bearing. I’d love to hear what works and what doesn’t.

Final Thoughts

The aforementioned rabbit hole turns out to be deeper than I imagined. Context compression has always been lossy, and as far as I know, no one is solving it structurally yet. Bearing is a rudimentary attempt. If you’ve hit the same wall — your agent forgets what matters three tasks in — I’d love to hear how you’re solving it.

Appendix

How this differs from the Landscape

The current ecosystem is focused on parallelism, running more agents simultaneously:

Garry Tan’s gstack (github) assigns different personas to a single Claude Code session via 23 slash commands, but it’s still one context window accumulating everything.

Steve Yegge’s Gas Town (github) runs 20-30 parallel agents coordinated by a Mayor agent, with Polecats for execution, a Refinery for merge queues, and persistent state in Git via Beads. Architecturally ambitious — it’s designed for massive parallelization across large codebases.

Anthropic’s Agent Teams (docs) is Claude Code’s built-in multi-agent feature. One session acts as team lead, coordinating teammates that each get their own context window.

Conductor (Melty Labs) is a YC-backed Mac app that gives you a visual UI for parallel Claude Code agents. If you want to run 5 agents on different features simultaneously, Conductor is excellent.

What all of these share: they solve the problem of running more agents in parallel, coordinating their work, and managing conflicts, generally scaling throughput.

What none of them address: the planning conversation. The back-and-forth between human and AI where you debate architecture, challenge complexity, make strategic decisions, and translate vague intent into precise tasks. In every tool above, the planning either happens in the same polluted context as execution (gstack), or is done autonomously by an AI lead without human input (Agent Teams, Gas Town), or isn’t part of the tool’s scope at all (Conductor).

Bearing doesn’t compete with these tools. You could use Bearing for planning and Conductor for parallel execution. The question Bearing answers is “how do I think clearly about what to build while AI builds it?”

Sources:

Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (TACL, 2024) — tested on GPT-3.5 and Claude 1.3; the positional attention effect has been reduced in newer models but the signal-to-noise principle still holds
He et al., “Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding” (2024) — proposes Ms-PoE to mitigate the lost-in-the-middle effect
Yang et al., “Large Language Models as Optimizers” (OPRO) (Google DeepMind, 2023)
Wang et al., “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization” (2024)
Microsoft Research, “PromptWizard” (2025)
Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT) (OpenAI, 2022)
Bai et al., “Training a Helpful and Harmless Assistant with RLHF” (Anthropic, 2022)
Addy Osmani, “The Code Agent Orchestra” (2026) — comprehensive overview of multi-agent coding patterns
Maggie Appleton, “Gas Town’s Agent Patterns, Design Bottlenecks, and Vibecoding at Scale” (2026)