Compressing Coding Agent Context at Models' Native Representation Layer

May 11, 2026

Introduction & problem

Modern agent/model workflows rely on context (essentially conversation history) to be effective. As context grows1, models “forget” details, slow down, and otherwise display eroded performance, especially as this growing context is compacted2.

Every major coding agent today handles context compression through text summarization (according to public documentation).3 This post builds on past work and explores compressing at the representation level, rather than the text level to see if it’s possible to more-efficiently preserve important details.

Below, we get deep into the details—skip to the practical implications section for conclusions and takeaways.

What we’re tackling

Every time you send a message to an LLM, the entire conversation history gets re-sent as context in the form of English. Coding agents feel this most acutely because their conversations are long (full of tool calls) and the details matter (a wrong file name or path results in broken code). In Part 3 of this series, I discussed how prompt caching (widely used today) reduces compute costs by ~70% by avoiding redundant computation. However, caching doesn’t mitigate the fact that infinitely growing context is difficult to manage and impacts output quality.

Today’s coding agents handle this by truncating old turns, summarizing them into shorter text, or increasing context window size.4 Claude Code, for example, auto-compacts at 95% capacity, while Cursor truncates old history. Note that these are text-level interventions—they modify the English conversation before sending it back to the model, while internally representing this information as numerical vectors.

Each call re-sends the full conversation as text. The model processes it, generates a response, and that gets added back for the next trip.

Seeing this made me ask: what if we interacted with internal representations directly?

When a model processes your conversation, it builds representations in the form of numerical vectors5 that encode its understanding of each token in context. When the model generates a response, that understanding gets turned back to English text, and the internal state is discarded6. In the next turn, the model rebuilds everything all over again.

Below, we investigate compressing conversation context into the model’s own representation format. Our experiments compressed a 10-turn coding agent conversation to ~50% of its token count while correctly answering 7 of 8 factual questions about the conversation's content.

What we’re actually doing

When a model reads the text “the project uses bcryptjs for password hashing7,” it creates internal vectors at each of its layers8. These vectors encode important context9 in the human-literal sense—that bcryptjs is a specific npm package and that it’s being used for some security function.

The way we compress is as follows: we create “virtual” tokens, which are new vectors initialized from mean-pooled embeddings10, and we tune them using gradient descent11 until the model’s internal response to them matches its response to the original text. After optimization, these virtual tokens become representations that produce approximately the same internal state when the model processes them as if it had read the full text.

How the optimization works: we run the original tokens through the frozen model to get “target” V-vectors. Then, we iteratively adjust virtual token embeddings until they match the target.

The model itself is frozen, meaning that we’re not training or fine-tuning it. In other words, we’re finding inputs that, when fed through the frozen model, reproduce the internal states that represent the content we want to maintain that are smaller than the original inputs.

How we tested this

The setup

We ran all of the experiments on a local RTX 509012 with 32 GB of VRAM using Qwen 2.5 Coder 7B-Instruct at 8-bit quantization. This particular Qwen model satisfied a few requirements:

We needed a local open-weight model because our technique required gradient access, which API-based models don’t expose.
We couldn’t fill the entire 32 GB VRAM capacity with Qwen parameters because we needed room for gradient computation alongside the loaded model.

The data

We used a (very) generic coding agent transcript recorded from our benchmark in Part 3: a 57-turn conversation where an AI agent builds a task management API with Express and SQLite as an example of a coding workflow. The transcript includes (brief) system prompts, tool calls, file reads, code generation, and debugging.

For single-turn experiments, we worked with a 4,096-token window from this transcript (12 turns). For sequential experiments, we split it into:

Prefix: system prompt
Middle: X turns
Suffix: most recent exchange

The evaluation

We test whether the model can answer factual questions about the conversation using compressed context, measured against full context.

“What library does the project use for password hashing?” (answer: bcryptjs, not just bcrypt)
“What is its Node.js test runner?” (answer: node:test, not Jest)
“Does the markdown converter use a library or regex?” (answer: regex-based, no external library)

The scoring is keyword-based to keep complexity low, and scores unambiguous and deterministic. This tests whether specific details survive compression and won’t get fooled by plausible-sounding text.

Why this evaluation is useful

Each question has three properties that make it a reasonable test of compression quality:

The answer exists only in the compressed turns: We verify that the system prompt and what I call the suffix (latest input) don’t contain the answer. The model can only get it right by extracting information from the compressed representation.

The model has a wrong “default”: When given no context, the model guesses bcrypt (not bcryptjs), Jest (not node:test), and marked (not regex). These are what I expect are the most popular options from training data. Correct answers require overriding these “defaults” with specific context, which is what our compression methodology needs to preserve.

There are specific ways to be wrong: bcryptjs vs bcrypt is a tiny change and node:test vs Jest is a completely different tool. These are the kind of specific details that text-level summarization (or context compression in general) could easily lose.

The compression recipe that worked

In subsequent sections, I’ve gone into depth on several paths (and dead ends) that led me to a working “recipe” for efficiently compressing context. That recipe is as follows.

Use Value-only optimization targets (vs. Query and Key)

Optimize virtual tokens so their Value vectors (at all layers) approximately match the Value vectors from the full text. This means excluding Keys entirely from the loss (more on this later).

V-vectors carry the actual content that gets processed through attention, the data payload. K-vectors serve as an index—they help the model decide which tokens to attend to. During inference, the model applies RoPE to the virtual tokens’ K vectors at their positions automatically. Since we only optimize V (which is position-independent), we don’t need to worry about positional encoding during the optimization step.

This finding was critical because it circumvents the RoPE problem I hit in Failure 2 below—pooling K vectors from different positions averages together incompatible rotational encodings. There’s been significant work done on KV optimization, and I’ll continue to explore this in future studies. Optimizing V-vectors is simple and more architecture-general.

Use perplexity-weights when V pooling

As in any robust compression, when pooling (i.e., “averaging” or compressing) the V-vectors above into blocks, not all tokens should get equal weight.

We computed per-token perplexity (also known as surprisal or “cross-entropy” when averaged across tokens), which calculates how “surprised” the model was by each token. Boring or “unsurprising” tokens (articles, whitespace) get low weight. Surprising tokens (specific library names, unique identifiers) get high weight.

Uniform pooling averages all tokens equally, potentially diluting specific details. Perplexity-weighted pooling down-weights predictable tokens, better preserving the signal carried by more informative ones.

This weighting mechanism doubled the evaluated compression threshold from 1.6x to 3.2x. This mechanism was a bit unintuitive13—it works by down-weighting “boring” tokens rather than boosting important ones. When predictable tokens carry less weight in the average, the resulting pooled representation is naturally closer to the informative tokens. The optimizing step then has an easier path to matching these improved targets. For example, the loss at 100 virtual tokens dropped from 8.03 to 4.85 compared to uniform pooling in our experiment.

Interpolate positions

Virtual tokens need positions for RoPE on their K vectors during inference. By default, virtual tokens would get consecutive positions (e.g., 144, 145, ..., 218), while the original tokens might occupy different positions (e.g., 146, ..., 385). We spread the virtual tokens’ positions across the original range: torch.linspace(144, 385, 75).

The critical change here is that it preserves the positional distribution that downstream attention expects14 in models. When the suffix (i.e., most important) tokens compute attention, the relative distances to each virtual token approximate the distances to the original tokens. The suffix tokens start at position 386 in our example, regardless of how many virtual tokens there are, acting as if the full middle were present.

The effect is small at low compression (positions are already dense) and more meaningful at high compression, where position clustering would otherwise distort attention patterns.

An important note on practicality

This is a proof of concept—per-instance gradient optimization the way we’ve done it here is too computationally intensive and slow for production use. The path to get it production-ready likely requires training a neural net to predict the compressed virtual token embeddings.

What I tried, what failed, and lessons

To get to the working recipe above, I tried numerous approaches and hit several hurdles. Below is a map of that path and how each failed test narrowed the design space that led to the results.

Note that when I’m discussing “loss” below, I’m referring to mean-squared error between two sets of vectors that gradient descent is minimizing. The two sets in this case were virtual tokens and the real tokens they were attempting to represent.

Failure 1: Optimizing for final layer is the wrong objective

My first approach was to create virtual tokens and optimize them so that the model’s output at the final layer matches what the real tokens produce. Using this method, it looked like the virtual tokens were producing “correct” final representations.

However, in testing, our eval answers looked like there was no context at all, and the model was “guessing” its training defaults.

The lesson: When the model generates text, it isn’t solely reliant on its final layer15. At every layer independently, each generated token attends to the KV cache entries from previous tokens. This means that optimizing only for the final layer results in significant loss in previous layers.

Failure 2: Including Keys in optimization difficult due to RoPE

In my next attempt, I switched to optimizing the KV entries at all 28 layers. I noticed that loss declined (and later significantly plateaued), but the eval answers were still often wrong.

The problem: Modern models apply position-dependent rotations to Key vectors, which is called Rotary Position Embeddings, or RoPE. The purpose of RoPE is to encode token positions in a way such that when the model calculates attention, it naturally takes into account relative positions between each token. When we start pooling K vectors from various positions, however, we start averaging vectors with different rotational encodings. The result is a rotational mess that no single-position virtual token can match.

I also confirmed that RoPE isn’t unique to Qwen models, and is used by LLaMA, Gemma, and virtually every modern open-weight model. Any compression approach that targets K vectors will hit this wall.

The lesson: RoPE makes K vectors position-dependent, but Value vectors are position-independent. We verified in Qwen’s source: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin). This means we can modify and optimize V without worrying about positional encoding.

There’s room for improvement here since RoPE rotations are deterministic and reversible, a future approach could undo the rotation, optimize K pre-RoPE, then re-apply it for inference

Failure 3: Attention-weighted pooling is both question-specific and regresses results

Once V-only optimization was working, I tried improving the pooling targets (it’s something we’ll have to come back to in future work). Instead of giving all V vectors equal weight when averaging them into blocks, I tried weighting them by how much the evaluation question’s tokens attended to each position. The thought was: high-attention tokens should be preserved better, right?

It’s important to note, though, that this was a rudimentary approach chosen for simplicity given the risk of “overfitting.” The results actually got worse. 150 tokens went from 3/3 to 2/ 3 correct answers.

The lesson: Attention-weighted pooling is inherently question-specific. Weighting toward one question’s attention pattern down-weights tokens that other questions need. There may be better weighting mechanisms for query-agnostic compression that we can explore in the future.

An aside on the helpfulness of AI

It goes without saying that AI coding agents were integral in running these experiments. Of course, AI was vital in writing the necessary Python code to create and run the tests. What I didn’t expect, though, was how I used it to help me run feasibility tests to avoid hitting unnecessary obstacles. For example:

Mr. AI, please help me figure out which Qwen model to use so I don’t overflow to system DRAM if I have to run gradient descent on my loaded model and therefore have it take days to run.

Results Summary

Shows single-turn compression across three different content types: 3.2x compression preserves all tested facts regardless of content type.

3.2x is the lossless (in this eval) threshold for all three turns. Larger turns compress equally well. Turn 24 has the lowest optimization loss at every ratio, suggesting more tokens give the optimizer more material to work with.

At 4.8x, failures follow a similar pattern of degrading names or strings:

“bcryptjs” → “bcrypt” (dropped the JS suffix)
“dev-secret” → “dev-dev-dev-dev-dev...” (strange repetition of the prefix)
“done” → “completed” (fell back to a training synonym)

Structural facts seem to survive much longer than exact names. For instance, the model knows which library is used for password hashing well past the point where it “forgets” the exact name of that library.

Sequential multi-turn: 2x preserves most facts across 10 turns

Single-turn compression gives us a great signal on the compression mechanism we tested, but coding agents rely on long, multi-turn conversations. The next step was testing sequential compression: compressing each turn independently and building up a running KV cache.

Sequential Compression in practice

Each turn is compressed independently against the running cache. The compressed representation grows incrementally as the conversation progresses.

Sequential Compression Results

Multi-turn compression across 10 turns: 2.0x compression preserves 7 of 8 factual answers, which is the same quality as 1.5x at lower token cost.

The main finding was that there is minimal error accumulation across turns: meaning the optimization process works equally well at each step. The remaining question, whether the targets themselves drift as compressed context accumulates, is harder to measure directly in this set of experiments. That said, the fact that 2x compression still preserves 7 of 8 answers suggests this drift (if it exists) remains bounded.

Practical Implications and final thoughts

Getting deep into manipulating internal representations to optimize coding agents, as we’ve done here, can help us gain an intuitive and applicable understanding of several concepts and model behaviors. Taking that understanding and applying it to real user problems reveals insights that can help us shape how we might build the coding agents of tomorrow.

In these experiments, we’ve learned that categorical facts like ‘the project uses JWT for authentication’ survive high compression, while exact names like bcryptjs degenerate to the model’s “defaults,” bcrypt, under high compression. Despite the challenge, we’ve also learned that there’s significant headroom for more context efficiency in coding agents. Making such improvements would solve several key issues with coding agents as they exist today, and might make them more useful broadly. Applying the above, I would combine several approaches to create a personalized and efficient context strategy.

Compression: Compress context at the representation level with V-vector optimization, using perplexity to weight V-pooling and interpolate virtual tokens. Our proof of concept shows that it may be possible to compress context ~3x using this method. This is also complementary to caching–combining compression with prompt caching could reduce context costs by 80-90%. Of course, the next step would be building a model to productionize what we’ve learned.

Systematic Preservation of Vital Information: Preserve vital information and keep it verbatim (more to follow on how best to do this in future work). Simple file systems and knowledge graphs using markdown files seem to be the most popular approach. It may also make sense to preserve important information from these file reads in a context window verbatim. This is where personalization comes into play—what type of “memory” system someone uses and what serves as useful information outside of interacting with an agent differs from person to person. This is also complementary to compression.

Intelligent Compaction: We’ve discussed before that Anthropic moved away16 from exposing very long (1M token+) context windows by default given the intrinsic issues. With the above techniques at our disposal, we could be more intelligent about when we compress instead of waiting for a certain context window threshold.

I would love to explore using signals to automatically trigger compaction and disposal of context, or encourage users to provide those signals directly. Examples include: forking conversations, discarding the last turn if it had a poor result, and having a meta conversation that preserves specific context. This same principle also applies to sub-agents—deciding explicitly whether sub-agents should have context (and what context), and where they should write outputs.

In today’s landscape, a lot of the burden of context management (and even understanding the problem) is put on the user. We could abstract this into the model harness itself, powered by new techniques.

Future work

There are several directions I want to explore from here.

Optimizing K vectors pre-RoPE: We circumvented the RoPE problem entirely by only optimizing V, but since RoPE rotations are reversible, we could undo them, optimize K directly, then re-apply them. This could push the compression threshold beyond 3.2x by jointly optimizing both.
Re-testing the hybrid approach by preserving specific pieces of context verbatim and compressing the rest, at the representation level.
Combining embedding compression with KV cache eviction: Our technique preserves historical context in approximate form. Published eviction methods like H2O and SnapKV manage the growing current context by dropping low-attention entries entirely, and these efforts would be complementary.
Testing the sub-agent context architecture ideas above: The compression research tells us what’s possible with context efficiency, and it would be interesting to apply that systematically to sub agents from a product perspective. For example, what context sub-agents receive, whether it should be compressed or verbatim, and when to discard context entirely.

Appendix

Limitations

There are several limitations to note.

Model choice and scale: All of the experiments were conducted using Qwen 2.5 Coder 7B. Larger models with more attention heads and parameters might behave and perform differently. On one hand they might have higher representational capacity in each virtual token. On the other, there are more complex internal representations for a virtual token to approximate. We were hardware bound, but I’d love to test the same mechanisms in more capable models.

Eval / benchmark simplicity: We used a single transcript with a set of binary evaluation questions for simplicity. In previous experiments, broken pipelines (and things as simple computer sleep) resulted in numerous restarts, leading me to focus on the simplest option possible for this phase of research. The compression threshold and degradation patterns need validation across different codebases, programming languages, and types of conversations.

Sequential test oddities: Even 1.5x compression scores ~7/8 on the sequential experiment we ran, suggesting that the sequential pipeline has some quality loss likely beyond what compression ratio alone explains. It’s just something to be explored further.

Hybrid approach is difficult to validate on short contexts: On short turns (e.g., 240 tokens), preserving ~15% of them consumes too much of the “compression budget” and results in a compression ratio that’s too aggressive.

An initial look at a hybrid compression approach

In the above experiments, failures at 4.8x compression are always specific tokens whose signal gets diluted during pooling. Applying a text-level compression technique I’ve attempted before, what if we kept those tokens verbatim and only compressed the rest?

The biggest challenge here was identifying which tokens to exempt. We tested two signals:

Perplexity: Represents how “surprising” the token was. Unfortunately, this selects formatting markers and sub-word fragments as well as useful ones. Sometimes, tokens that are surprising carry no useful content (hence we down-weighted low perplexity tokens in our pooling methodology). More broadly, perplexity wasn’t a useful signal (at least when applied this way) at the macro-content level, even though it was useful for pooling.

“Uniqueness” of V: Represents how different a token’s internal representation is from its neighbors. This selects content words like Task, converter, reports, SQLite, authenticate, userId. These are exactly the tokens that lose the most from being averaged with their neighbors during pooling.

Using this technique, we saved the top 15% of tokens by the uniqueness of V, and their KV entries passed through the pipeline unmodified at their original positions (with correct RoPE). Then, we compress the remaining 85%.

Bibliography & Related Work

Our main technique, embedding optimization (without pre-training), builds directly on Kuratov et al.’s work demonstrating that hundreds of tokens can be compressed into optimized embedding vectors (Cramming 1568 Tokens into a Single Vector, ACL 2025). We apply their technique to multi-turn coding agent conversations and add the V-only optimization target, perplexity-weighted pooling, and sequential turn-by-turn compression.

The KV cache compression research field is quite active. Most approaches compress the cache after computation through quantization or eviction. Our approach is different in that we optimize input embeddings beforehand. Notable work in this space includes EliteKV (RoPE frequency selection), CodeComp (structural compression for coding agents).

KV cache eviction methods like H2O and SnapKV manage growing context by dropping low-attention entries entirely. I would expect these to be complementary to our approach, since compression preserves important historical context, and eviction manages the current context’s memory footprint.

LLMLingua (Jiang et al., 2023) uses perplexity to discard tokens at the text level. We use perplexity to weight V-vector pooling at the embedding level, which is the same signal applied at a different level of the model.

Anthropic’s guidance on session management and context rot informed our experimental motivation and how widespread this problem is.

Bibliography

Kuratov et al., “In-Context Learning with Long-Context Models: An In-Depth Exploration.” ACL 2025. arxiv.org/abs/2502.13063
Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference.” 2023. arxiv.org/abs/2310.05736
Zhang et al., “H2O: Heavy-Hitter Oracle for Efficient KV Cache Eviction.” 2023. arxiv.org/abs/2306.14048
Li et al., “SnapKV: LLM Knows What You Are Looking For Before Generation.” 2024. arxiv.org/abs/2404.14469
EliteKV: RoPE-aware KV cache compression. 2025. arxiv.org/abs/2503.01586
CodeComp: Structural KV cache compression for coding agents. 2026. arxiv.org/abs/2604.10235
HybridKV: Hybrid per-head compression for multimodal models. 2026. arxiv.org/abs/2604.05887
Anthropic, “Using Claude Code: Session Management and the 1M Token Context Window.” April 2026. claude.com/blog/using-claude-code-session-management-and-1m-context

This means the model uses some summarization technique to “compact” context into something more manageable to keep working.

Codex calls an API that is opaque to the public.

Compounded further by increased use of agents.

Anthropic published guidance on context rot here.

Essentially using coordinates in a high-dimensional space (e.g., [1, 3.2, 4, 7] = cat) to represent pretty much anything.

It’s not quite discarded in the case of caching, but for purposes of this particular point, the message holds.

We’re focused on coding tasks, since that’s the use case where a lot of this pain is felt, though it’s likely broadly applicable outside of coding.

Qwen 2.5 coder 7B has 28 model layers.

A while ago, it took me a long time to figure out how attention works—it’s the mechanism that’s the core of modern LLM architecture. To put it in a few words, attention allows a model to understand whether “go” seen in input is a command, a turn (in the British sense), an ancient board game, who is going, and so on. This video captures it far better than I could explain in text.

It’s pretty much a fancy way of saying averaging together a bunch of vectors.

A video is worth a thousand of my words explaining gradient descent.

I’m very proud to note that I’m now getting non-gaming use out of this expensive graphics card that I spent multiple months trying to acquire at MSRP.

This is because the important tokens (to us) don’t necessarily have high perplexity. Yet, we’re able to reduce the dilutive impact of unimportant ones.

If all the virtual tokens were packed into consecutive positions, the model would lose valuable information on relative positions of tokens that it expects (that we’ve now messed with). The interpolation we’ve done across the original range gets us to “good enough.”

To be honest, I excitedly rushed through this step and didn’t fully think through my optimization. Nevertheless, it was a neat way to empirically show how wrong I was.

At least as it was noted several weeks ago.

Ship's Log

Discussion about this post

Ready for more?