Back to List
Do Transformers Need Three Projections? New Research Explores QKV Variants for Massive KV Cache Reduction
Research BreakthroughTransformersMachine LearningAI Research

Do Transformers Need Three Projections? New Research Explores QKV Variants for Massive KV Cache Reduction

A systematic study titled 'Do Transformers Need Three Projections?' challenges the traditional Query, Key, and Value (QKV) architecture in Transformer models. Researchers Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis evaluated three projection sharing constraints: shared Key-Value (Q-K=V), shared Query-Key (Q=K-V), and a single projection (Q=K=V). The study, which included experiments on language models up to 1.2B parameters, found that these variants often perform on par with standard Transformers. Most notably, the Q-K=V configuration achieves a 50% reduction in KV cache with only a 3.1% increase in perplexity. When combined with Multi-Query Attention (MQA), this approach can reduce cache requirements by up to 96.9%, presenting a significant breakthrough for efficient on-device AI inference.

Hacker News

Key Takeaways

  • Redefining Transformer Architecture: Research shows that Transformers do not strictly require three separate projections (Q, K, and V) to maintain high performance.
  • Significant Memory Efficiency: The Q-K=V (shared Key-Value) variant reduces KV cache by 50% with a minimal 3.1% perplexity degradation in language modeling.
  • Synergy with Head Sharing: Projection sharing is complementary to Grouped-Query Attention (GQA) and Multi-Query Attention (MQA), enabling a total KV cache reduction of up to 96.9%.
  • Preservation of Quality: The Q-K=V model maintains quality because keys and values occupy similar representational spaces and attention operates in a low-rank regime.
  • On-Device Viability: These optimizations drastically lower the memory footprint, making large-scale models more practical for on-device inference.

In-Depth Analysis

Systematic Evaluation of QKV Variants

The standard Transformer architecture relies on three distinct linear projections: Query (Q), Key (K), and Value (V). This research systematically investigates whether this tripartite division is necessary by testing three specific sharing constraints. The first, Q-K=V, involves sharing the projections for keys and values. The second, Q=K-V, shares the query and key projections. The third, Q=K=V, utilizes a single projection for all three components.

A technical challenge identified with the Q=K-V and Q=K=V variants is that they naturally produce symmetric attention maps, which can limit the model's ability to represent directional relationships. To counteract this, the researchers explored the use of 2D positional encodings to introduce asymmetric attention. The study's scope was comprehensive, covering synthetic tasks, vision benchmarks (including MNIST, CIFAR, TinyImageNet, and anomaly detection), and large-scale language modeling using 300M and 1.2B parameter models trained on 10B tokens.

Performance Benchmarks and Efficiency Gains

The experimental results reveal that these modified Transformers perform on par with, and occasionally better than, the standard QKV Transformer. In the context of language modeling, the Q-K=V projection sharing constraint emerged as a particularly effective configuration. It achieved a 50% reduction in the Key-Value (KV) cache—a critical bottleneck in Transformer inference—while incurring only a marginal 3.1% degradation in perplexity.

The researchers noted that the success of Q-K=V is rooted in the fact that keys and values often occupy similar representational spaces within the model. Furthermore, because attention typically operates in a low-rank regime, the reduction in parameters does not lead to a proportional loss in model capability. Conversely, the study found that the Q=K-V variant tends to break the essential directionality of attention, making it less effective than the shared Key-Value approach.

Compounding Benefits with Head Sharing

One of the most significant findings of the study is that projection sharing is entirely complementary to existing head-sharing techniques like Grouped-Query Attention (GQA) and Multi-Query Attention (MQA). When the Q-K=V constraint is combined with GQA-4, the KV cache reduction reaches 87.5%. When paired with MQA, the reduction reaches a staggering 96.9%.

This compounding effect addresses one of the primary hardware limitations for deploying Large Language Models (LLMs). By drastically reducing the memory required to store the KV cache during inference, these variants enable high-performance AI to run on resource-constrained hardware, facilitating the transition of complex models from massive data centers to practical on-device applications.

Industry Impact

The implications of this study for the AI industry are centered on efficiency and deployment. As the demand for on-device AI grows—driven by privacy concerns and the need for low-latency responses—architectural optimizations that reduce memory overhead without sacrificing intelligence are paramount. By characterizing projection sharing as an underexplored form of weight tying, this research provides a quantifiable path toward more lean and deployable Transformer models. The ability to achieve nearly 97% cache reduction through the combination of projection and head sharing could redefine the hardware requirements for the next generation of mobile and edge AI devices.

Frequently Asked Questions

Question: Why does sharing the Key and Value projections (Q-K=V) work so well?

According to the research, Q-K=V preserves model quality because the keys and values in a Transformer often occupy similar representational spaces. Additionally, the attention mechanism operates in a low-rank regime, allowing the model to function effectively even with a reduced number of projections.

Question: How does this research help with running AI on smartphones or laptops?

The primary bottleneck for running large models on-device is the memory required for the KV cache. By reducing this cache by up to 96.9% (when combined with MQA), these architectural variants make it possible to fit larger, more capable models into the limited RAM available on consumer devices.

Question: Does reducing the number of projections significantly hurt model accuracy?

The study found that the performance was on par with standard Transformers. In language modeling specifically, the 50% cache reduction from Q-K=V resulted in only a 3.1% degradation in perplexity, which is considered a minimal trade-off for the massive gain in efficiency.

Related News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation benchmark designed specifically for interactive video world models. As the first systematic multi-round assessment tool of its kind, WBench serves as a diagnostic 'CT scanner' for the AI industry. It is engineered to precisely identify the technical bottlenecks that occur when world models attempt to transition from 'passive viewing'—simply generating or observing video—to 'active interaction,' where the model must respond to dynamic inputs over multiple stages. By testing these models across diverse environments, ranging from lunar walks to cybernetic cities, WBench provides the necessary framework to define the current boundaries of world model capabilities and highlights where the technology currently struggles in maintaining consistency during complex, interactive sequences.

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization
Research Breakthrough

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization

Meituan's technical team has achieved significant recognition at ACL 2026, with six papers accepted into this prestigious computational linguistics conference. The research spans a broad spectrum of cutting-edge AI fields, including large model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Furthermore, the papers explore advancements in reinforcement learning and the emerging field of generative recommendation. This collection of work underscores Meituan's strategic focus on refining generative paradigms and enhancing the practical capabilities of AI models in solving intricate problems and providing personalized user experiences. By addressing both theoretical benchmarks and practical application challenges, Meituan is positioning itself at the forefront of the next generation of natural language processing and artificial intelligence development.

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space

The Meituan LongCat team has officially released LongCat-AudioDiT, a specialized model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the audio generation pipeline, the model abandons traditional intermediate representations like Mel-spectrograms. Instead, it utilizes a diffusion-based approach operating directly within the waveform latent space. This strategic shift is intended to eliminate cascade errors that typically arise during multi-stage data conversion processes. By allowing the AI to learn the inherent patterns of sound directly from the source, LongCat-AudioDiT aims to overcome existing technical bottlenecks in voice synthesis, providing a more streamlined and high-fidelity solution for cloning voices without the need for extensive training on specific target speakers.