Large Language Models (LLMs) are powerful but have inherent limitations. They can hallucinate, lack up-to-date knowledge, and struggle with domain-specific expertise. To mitigate these issues, two popular approaches have emerged: CAG (Cache-Augmented Generation) and RAG (Retrieval-Augmented Generation). While both enhance LLM performance, they serve different use cases and have distinct advantages and trade-offs.
https://www.youtube.com/watch?v=HdafI0t3sEY
Cache-Augmented Generation
Cache-Augmented Generation (CAG) leverages the expanded context windows of modern LLMs to preload all relevant knowledge into the model before inference. It then precomputes and stores the model’s internal states (key-value, or KV, caches) for this knowledge. When a query arrives, the model processes it using this preloaded cache, eliminating the need for real-time document retrieval.

How CAG Works:
- Preload Knowledge: Curate and preprocess a static dataset or document set, then inject it into the model’s context window.
- Precompute KV Cache: The model encodes this knowledge, saving intermediate states for reuse.
- Store Cache: The KV cache is saved in memory or on disk for efficient reuse.
- Inference: When a user query arrives, it is processed alongside the cached context, allowing for rapid, consistent responses.
- Cache Reset: Optionally, the cache can be refreshed to manage memory or update knowledge.
Benefits of CAG:
- Low Latency: No need for real-time retrieval, resulting in instant responses.
- Consistency: Responses are stable and repeatable, as they always draw from the same preloaded knowledge.
- Simplicity: No retrieval pipeline means fewer moving parts and easier maintenance.
- Efficiency: Ideal for small, stable knowledge bases that fit within the model’s context window.
Limitations:
- Context Size: Cannot handle knowledge bases larger than the model’s context window.
- Static Data: Not suitable for rapidly changing information unless the cache is frequently refreshed.
- Initial Setup: Requires up-front compute to build and store the cache.
Retrieval-Augmented Generation (RAG)