Conversation Memory

Conversation Memory is a pattern that maintains context across multiple turns of a dialogue by storing and retrieving previous messages. It solves the statelessness of LLM APIs by explicitly managing conversation history, summarization, and context window limits.

What problem does Conversation Memory solve?

LLM APIs are stateless. This is the kind of fact that seems obvious once you know it but catches many developers off guard. When you send a message to a model, it has no idea what you said five minutes ago. Each API call is a completely independent event. The model receives a block of text, generates a response, and forgets everything.

The conversational experience that chatbots provide is an illusion maintained entirely by the client. The application keeps track of the message history and sends the entire conversation, every previous turn, as part of each new request. The model reads through all of it and generates a response as if it had been participating all along.

This works fine for short conversations. But conversations grow. A 20-turn conversation might consume 8,000 tokens of context just for the history. A technical support session that goes back and forth for an hour could easily exceed 30,000 tokens. At some point, the conversation history no longer fits in the context window. And long before it stops fitting, it starts causing problems. The model takes longer to respond because it is processing more tokens. Costs go up because you are paying per token. And the model's attention can degrade over very long contexts, missing details buried in the middle of the history.

You need a strategy for managing what conversation history goes into the context window. Not a one-size-fits-all approach, but a deliberate choice based on what your application actually needs to remember.

How does Conversation Memory work?

Conversation memory is the set of techniques for deciding which parts of the conversation history to include in each API call. There are several strategies, each with distinct characteristics.

Full history is the simplest approach. Include every message from the conversation in every request. This preserves complete context and is the right choice for short conversations where you will never approach the context limit. Most prototypes start here. The ceiling is obvious: eventually the history exceeds the context window and you need to truncate anyway. But for conversations where total history stays under roughly 20% of your model's context window (about 15-20 turns for a typical 128K-token model), full history is perfectly adequate and the simplest thing that works.

Sliding window keeps only the most recent N turns. Older messages are dropped. This guarantees a fixed context size and works well for applications where recent context matters more than distant history. A coding assistant that helps you debug step by step mostly needs the last few exchanges. What you discussed 50 turns ago is rarely relevant to the current error message. The downside is abrupt information loss. If the user stated a critical constraint in turn 3 and you are now on turn 25 with a window of 10, that constraint is gone.

Summary memory periodically compresses older turns into a condensed summary. You keep the recent turns in full and prepend a summary of everything that came before. When the history grows beyond a threshold, you ask the model to summarize the oldest unsummarized turns and replace them with that summary. The summary preserves the gist of earlier conversation while using far fewer tokens. This is more sophisticated than a sliding window because important information from early turns can survive in compressed form.

Entity memory takes a structured approach. Instead of summarizing the conversation as prose, you extract and maintain a running record of entities mentioned, their attributes, and relationships. The user mentioned they work at Acme Corp on the billing team and their manager is Sarah. These facts get stored as structured entries and injected into the context as a factsheet, independent of the conversation turns. This works well when the conversation revolves around specific entities (people, projects, products) and you need to track evolving facts about them.

Hybrid approaches combine multiple strategies. A common production setup uses summary memory for older turns, full history for the most recent turns, and entity memory for key facts. The summary provides general context. The recent turns provide conversational flow. The entity facts ensure critical details are not lost.

When should you use Conversation Memory?

Every conversational LLM application needs some form of conversation memory. The question is which strategy fits your use case.

Start with full history if your conversations are typically short (total history stays under roughly 20% of your model's context window) and your context window is large. Do not over-engineer memory management for conversations that will never hit the limit.

Use a sliding window when recency is what matters. Customer support chats where each question is relatively self-contained, coding assistants where the current task is all that matters, and casual chatbots all work well with a window of the last 5-10 turns.

Use summary memory when conversations are long and the early context stays relevant. Consulting sessions, tutoring interactions, and project planning conversations all benefit from maintaining a compressed record of what was discussed earlier.

Use entity memory when your application revolves around tracking specific objects or people. CRM assistants, project management tools, and relationship management applications need to know facts about entities rather than the flow of conversation.

Implementation

# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI

client = OpenAI()

class SlidingWindowMemory:
    """Keep only the last N turns in the conversation history."""

    def __init__(self, window_size: int = 10, system_prompt: str = "You are a helpful assistant."):
        self.window_size = window_size
        self.system_message = {"role": "system", "content": system_prompt}
        self.messages: list[dict] = []

    def add_user_message(self, content: str):
        self.messages.append({"role": "user", "content": content})
        self._trim()

    def add_assistant_message(self, content: str):
        self.messages.append({"role": "assistant", "content": content})
        self._trim()

    def _trim(self):
        # Keep last window_size * 2 messages (each turn = user + assistant)
        max_messages = self.window_size * 2
        if len(self.messages) > max_messages:
            self.messages = self.messages[-max_messages:]

    def get_context(self) -> list[dict]:
        return [self.system_message] + self.messages

    def chat(self, user_input: str) -> str:
        self.add_user_message(user_input)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=self.get_context(),
        )
        reply = response.choices[0].message.content
        self.add_assistant_message(reply)
        return reply

# Usage
memory = SlidingWindowMemory(window_size=5)
print(memory.chat("My name is Alice."))
print(memory.chat("What's my name?"))  # Should remember: "Alice"

What are the common pitfalls?

Summary compression loses information. The model decides what is "important" when generating the summary, and it does not always decide correctly. A detail that seemed minor at the time might become critical later. Once it is summarized away, it is gone. You can mitigate this by using conservative summarization prompts that err on the side of including more detail, but there is always a trade-off between compression ratio and information preservation.

Sliding windows create a jarring experience when users reference things from before the window. "Remember when I said I needed the blue version?" The model has no memory of that exchange and either confesses ignorance or, worse, confabulates an answer. If your application uses a sliding window, consider informing users about the memory horizon or providing a way to pin important messages.

Entity extraction is imperfect. The model may miss entities, extract incorrect attributes, or fail to update an entity when new information is provided. Entity memory requires validation and correction mechanisms to stay accurate over time.

Context window cost is often underestimated. If you are using summary memory with a 2,000-token summary plus the last 10 turns plus a system prompt, you might be using 5,000 tokens of context before the model generates a single token. At API pricing, that adds up across millions of conversations.

What are the trade-offs?

Simpler strategies (full history, sliding window) are easier to implement but less effective at preserving important information over long conversations. Complex strategies (summary memory, entity memory, hybrids) preserve more information but require more engineering effort, more model calls for summarization and extraction, and more careful testing.

Summary memory costs extra because each summarization step is an additional model call. If you summarize every 10 turns, you add roughly 10% overhead in model calls. That is usually worth it for the context savings, but it is a real cost.

There is a latency component to any strategy that pre-processes history. Summarizing older turns or extracting entities before generating a response adds time. For applications where every millisecond of response time matters, you may need to run these processes asynchronously between turns rather than synchronously during generation.

No strategy perfectly solves the fundamental problem. Context windows are finite. Conversations can be infinite. Some information will always be lost or degraded. The goal is to lose the least important information first and preserve what matters most for the user's current needs.

Goes Well With

Long-Term Memory extends conversation memory across sessions. Conversation memory manages the current session. Long-term memory persists important facts and preferences to an external store that survives between sessions. Most applications that care about memory need both layers.

Prompt Caching reduces the cost of conversation memory. If the beginning of your prompt (system message, summary, entity facts) stays stable across turns, prefix caching avoids reprocessing those tokens. This is especially valuable for summary memory where the compressed history changes infrequently.

Basic RAG can supplement conversation memory with external knowledge. When the conversation references information that is neither in the history nor in the model's training data, RAG retrieves it from a knowledge base. The combination of conversation memory (what was discussed) and RAG (what the system knows) gives the model comprehensive context for generating useful responses.

References

Park, J.S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.

What problem does Conversation Memory solve?

How does Conversation Memory work?

Conversation memory is the set of techniques for deciding which parts of the conversation history to include in each API call. There are several strategies, each with distinct characteristics.

When should you use Conversation Memory?

Every conversational LLM application needs some form of conversation memory. The question is which strategy fits your use case.

Implementation

# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI

client = OpenAI()

class SlidingWindowMemory:
    """Keep only the last N turns in the conversation history."""

    def __init__(self, window_size: int = 10, system_prompt: str = "You are a helpful assistant."):
        self.window_size = window_size
        self.system_message = {"role": "system", "content": system_prompt}
        self.messages: list[dict] = []

    def add_user_message(self, content: str):
        self.messages.append({"role": "user", "content": content})
        self._trim()

    def add_assistant_message(self, content: str):
        self.messages.append({"role": "assistant", "content": content})
        self._trim()

    def _trim(self):
        # Keep last window_size * 2 messages (each turn = user + assistant)
        max_messages = self.window_size * 2
        if len(self.messages) > max_messages:
            self.messages = self.messages[-max_messages:]

    def get_context(self) -> list[dict]:
        return [self.system_message] + self.messages

    def chat(self, user_input: str) -> str:
        self.add_user_message(user_input)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=self.get_context(),
        )
        reply = response.choices[0].message.content
        self.add_assistant_message(reply)
        return reply

# Usage
memory = SlidingWindowMemory(window_size=5)
print(memory.chat("My name is Alice."))
print(memory.chat("What's my name?"))  # Should remember: "Alice"

What are the common pitfalls?

What are the trade-offs?

Goes Well With

References

Park, J.S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.

What problem does Conversation Memory solve?

How does Conversation Memory work?

When should you use Conversation Memory?

Implementation

What are the common pitfalls?

What are the trade-offs?

Goes Well With

References

Related Patterns

Prompt Caching

Long-Term Memory

Basic RAG

Conversation Memory

What problem does Conversation Memory solve?

How does Conversation Memory work?

When should you use Conversation Memory?

Implementation

What are the common pitfalls?

What are the trade-offs?

Goes Well With

References

Related Patterns

Prompt Caching

Long-Term Memory

Basic RAG