Image in article

The New Direction in 2026 LLM Architecture

A practical read on Gemma 4, Laguna, ZAYA1, and DeepSeek V4: why new open-weight LLMs are redesigning attention, KV cache, and residual pathways for long context.

字数 2234阅读时长 6 分钟
2026-5-20
2026-5-19
type
status
date
summary
tags
category
slug
icon
password
公众号
关键词
小宇宙播客
小红书
数字人视频号
笔记

The New Direction in 2026 LLM Architecture: Long-Context Cost Is Being Redesigned

Sebastian Raschka's May 16, 2026 article, Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention, is worth reading closely. On the surface, it walks through new architectural details in Gemma 4, Laguna XS.2, ZAYA1-8B, and DeepSeek V4. Underneath, the larger story is clearer: LLM architecture competition is moving from “who has more parameters” toward “who can process longer context more cheaply.”
That is not a minor optimization. Reasoning models, coding agents, browser agents, and long-document analysis systems all keep more tokens alive for longer. Once context grows, attention cost, KV cache size, memory bandwidth, and cache movement become the real bottlenecks. The value of Raschka's article is that it connects several seemingly separate architectural choices into one line of progress: Transformer models are not being replaced, but many parts inside the Transformer block are being redesigned.
notion image

The Short Version

If I had to summarize the article in one sentence: many 2026 open-weight LLM architecture innovations focus on reducing long-context cost, not inventing a completely new base paradigm.
The representative directions are:
  • Gemma 4 reduces KV cache through cross-layer KV sharing and adds capacity to small models through per-layer embeddings.
  • Laguna XS.2 stops giving every layer the same attention budget and varies query-head allocation by layer.
  • ZAYA1-8B moves attention into a compressed latent space through Compressed Convolutional Attention, reducing both cache and computation.
  • DeepSeek V4 combines mHC for richer residual pathways with CSA/HCA for aggressive long-context attention compression.
My read is that LLM architecture is not getting simpler. It is getting more engineered. The basic skeleton is still a decoder-only Transformer, but every expensive component is being opened up and redesigned.

Why Long Context Forces These Designs

From the product side, many AI features look like simple upgrades: read longer files, keep longer conversations, let agents operate across more steps. Inside the model, those capabilities amplify two costs.
The first cost is attention computation. Standard attention needs to relate the current token to previous tokens. The longer the visible history, the harder it is to keep this cost under control.
The second cost is the KV cache. During autoregressive generation, the model caches past key/value representations so it does not recompute the entire history for each new token. This cache grows with context length, number of layers, number of KV heads, and hidden dimensions. In long-context inference, the expensive part is often not the parameters themselves, but the amount of cache that must be stored, moved, and read for each request.
That is why the mechanisms in Raschka's article all answer the same practical question: how do we preserve long-context ability while storing less, computing less, and moving less data?
notion image

Gemma 4: Reusing KV Across Layers

In the smaller Gemma 4 E2B/E4B variants, one design is very direct: not every layer computes its own K and V. Some later layers reuse KV tensors from earlier attention layers of the same type.
This points in the same direction as GQA/MQA, but the reuse happens along a different dimension. GQA lets multiple query heads share fewer KV heads. Cross-layer KV sharing lets multiple layers reuse earlier KV tensors. Both reduce KV cache, but one saves along the head dimension while the other saves along the layer dimension.
Raschka highlights a useful detail: Gemma 4 E2B has 35 layers, but only the first 15 compute their own KV projections; the last 20 reuse KV from earlier compatible layers. E4B has 42 layers, with 24 computing their own KV and 18 reusing it. In long context, this saves a substantial amount of memory. The article gives a 128K context example at bfloat16 precision: about 2.7GB saved for E2B and about 6GB for E4B.
A practical mental model is this: each layer still has its own query projection, so it can form its own attention pattern, but it no longer creates a fully independent key/value representation. The cost is a reduction in capacity, because some layer-level independence is removed. The benefit is much lower cache pressure during long-context inference.
This is especially relevant for small and edge-oriented models. For those models, the bottleneck is often not only whether the parameters fit, but whether long-context cache and memory bandwidth can be handled efficiently.

Per-Layer Embeddings: Small Models Do Not Only Scale by Widening the Transformer

Gemma 4 E2B/E4B also use per-layer embeddings, or PLE. This solves a different problem from KV sharing.
KV sharing saves runtime cache. PLE is about parameter efficiency: keep the main Transformer stack computationally smaller, but add token-specific embedding slices that each layer can use.
The “E” in Gemma 4 E2B/E4B means “effective.” For example, E2B is listed as 2.3B effective parameters, or 5.1B parameters when embeddings are counted. E4B is similarly listed as 4.5B effective parameters, or 8B including embeddings. The point is not that the model secretly becomes a dense 5.1B or 8B compute path. The main compute path is closer to the smaller number, while extra capacity is stored in lookup-style embedding parameters.
A simple way to understand it: if you shrink a dense model by reducing layers, hidden size, or FFN width, you directly remove capacity from the core computation. PLE keeps a cheaper backbone and gives each layer a small token-related side input. It is not free, and it adds implementation complexity, but it can be cheaper than scaling the whole Transformer stack.
Raschka is appropriately cautious here: it would be useful to see direct comparisons against a regular 2.3B model and a regular 5.1B dense model. Still, as an engineering pattern, PLE shows that small-model capacity does not have to come only from making the entire Transformer wider or deeper.

Laguna XS.2: Attention Budgets Are No Longer Uniform

Laguna XS.2 looks fairly standard at first, but its attention budgeting is representative of a broader trend. It has 40 layers: 30 sliding-window attention layers and 10 global/full attention layers.
Mixing local sliding-window layers with global layers is not new. The more interesting detail is that Laguna XS.2 uses different numbers of query heads in different layers. Raschka points to the num_attention_heads_per_layer field in the Hugging Face config, which allows query-head counts to vary by layer while keeping the KV cache shape compatible.
In plain terms: global attention layers are expensive because they see the entire context; sliding-window layers are cheaper because they only see a local window. Laguna gives fewer query heads to the expensive global layers and more query heads to the cheaper sliding-window layers. Instead of distributing attention capacity evenly, it allocates capacity according to each layer's cost and role.
This looks a lot like resource scheduling in real systems. Not every module deserves the same budget. Not every layer needs the same width. Model architecture is starting to look like a profiled system: compress expensive paths, and spend capacity where it is cheaper and useful.
notion image

ZAYA1-8B: Doing Attention Directly in a Compressed Space

ZAYA1-8B's key mechanism is Compressed Convolutional Attention, or CCA. It is related in spirit to DeepSeek's earlier MLA because both introduce compressed representations, but they use those representations differently.
MLA mainly stores the KV cache in a compact latent representation and projects it into attention-head space when needed. CCA is more aggressive: it compresses Q, K, and V, performs attention directly in the compressed latent space, and then projects the result back.
This creates two potential benefits:
  • The KV cache can be smaller.
  • Attention FLOPs during prefill and training can also fall, because the attention operation itself happens in a narrower space.
Why add the convolutional part? Because compression reduces expressiveness. CCA applies convolutional mixing to the compressed Q and K tensors so they carry more local context before attention scores are computed. This is mainly applied to Q/K rather than V because Q/K decide where to attend, while V carries what content is retrieved.
The important part of ZAYA1-8B is that it does not only store cache more compactly. It rewrites the attention computation path. If this type of method scales reliably, it matters a lot because long-context cost is not only a decode-time cache problem; it is also a prefill problem when very large prompts are processed.

DeepSeek V4: Wider Residual Paths and Compressed Long-Context Attention

DeepSeek V4 is the most complex part of Raschka's article. To keep it understandable, split it into two lines.
The first line is mHC, or manifold-constrained hyper-connections. It changes the residual pathway.
In a conventional Transformer block, we usually think in terms of a main residual stream. Hyper-connections expand that into several parallel residual streams and learn mappings between them. mHC adds constraints on top of that, such as projecting the residual mapping onto the manifold of doubly stochastic matrices, to make the mixing more stable and avoid uncontrolled amplification or cancellation across many layers.
This is interesting because many recent architecture changes focus on attention, MoE, normalization, and routing. mHC changes the residual connections instead. The goal is not to make attention itself wider, but to make the information flow between layers more expressive without widening the attention or MoE hidden size.
The second line is CSA/HCA: Compressed Sparse Attention and Heavily Compressed Attention. This changes long-context attention and cache.
It is important to distinguish this from MLA-style compression. MLA mainly compresses the representation stored for each token while still keeping an entry for each token. CSA/HCA compress along the sequence dimension. Instead of keeping one historical entry per token, they compress groups of tokens into fewer KV entries. In other words, they make the cache shorter.
CSA is milder and combines compression with sparse top-k selection. HCA is more aggressive, for example compressing 128 tokens into one entry and then using dense attention over the much shorter compressed history. Both keep a local sliding-window branch for recent uncompressed tokens.
Raschka cites numbers from the DeepSeek V4 paper: at 1M-token context length, DeepSeek V4-Pro uses 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DSA. V4-Flash is smaller still, at 10% FLOPs and 7% KV cache. These numbers are aggressive, but they come from the full model recipe rather than a clean ablation of one component.
I would describe CSA/HCA as a long-context engineering route rather than a universally superior replacement for MLA. But once the target is million-token context, compression along the sequence dimension becomes very hard to avoid.
notion image

What These Designs Have in Common

Put these models next to each other and the trend is clear: the Transformer is still here, but it increasingly differs from the clean textbook block.
The old mental model was embedding, multi-head attention, MLP, residual, and normalization. In current frontier-style models, we also have GQA/MQA, MLA, sliding-window attention, sparse attention, cross-layer KV sharing, per-layer embeddings, layer-wise head budgeting, compressed latent attention, sequence-level cache compression, MoE routing, and residual-stream mixing.
This has two consequences.
First, inference costs should continue to fall, especially for long-context and agentic workloads. Users see cheaper, longer, faster models. Engineers see more complex kernels, cache layouts, and runtime integrations.
Second, understanding LLM architecture can no longer stop at parameter count and benchmark score. Two models with similar parameter counts can have very different deployment costs because their KV cache design, attention pattern, active-parameter ratio, routing, and residual pathway differ.
That is the most useful point to take from Raschka's article: a lot of important competition is invisible in the chat interface. It happens in the model's internal data flow and cache structure.

Practical Takeaways for Builders

If you build AI products, especially long-document, codebase, knowledge-base, or agent systems, I would take three practical lessons from this architecture trend.
First, long context is not just a “maximum token count” metric. You also need to care about throughput, latency, KV cache footprint, prefill cost, and quality degradation at long context lengths.
Second, open-weight models will become more specialized. One model may be great for local short-context inference. Another may be better for long-context retrieval. Another may be better for high-concurrency agent workloads. Architecture details will increasingly shape model selection.
Third, inference runtimes matter more as architectures become more complex. Supporting attention variants, cache layouts, and MoE routing efficiently can decide whether a model is actually cheap to run. Choosing a model is no longer only about downloading weights; it is also about whether the runtime ecosystem can support the architecture well.

My Read on the Original Article

Raschka's article is strong because it is structured and visual. It does not focus on leaderboard claims. It focuses on what is changing inside the Transformer block. That is exactly the right lens if you care about model cost.
I would add one product-oriented point: these architecture innovations may turn long-context ability from a premium feature in a few expensive models into a default ability across more models. But the cost is rising system complexity. For application developers, the best strategy is not to memorize every new acronym. It is to understand the major cost sources: KV cache, attention FLOPs, memory traffic, prefill, and decode.

Conclusion

The article is about LLM architecture, but the deeper topic is the cost structure of AI systems.
Gemma 4 shares KV across layers. Laguna allocates attention budget by layer. ZAYA1 computes attention in a compressed latent space. DeepSeek V4 pushes both residual pathways and long-context compression further. The mechanisms differ, but the goal is the same: make models usable and affordable in the long-context era.
My one-line takeaway: the Transformer has not been replaced, but it is being systematically engineered. In the next few years, LLM competition will not only be about which model is smarter. It will also be about which model can handle million-token context, agent workflows, and high-concurrency serving with less cache and less computation.