Agent Memory Architecture¶

Patterns for managing agent conversation history and long-term memory. This guide explains why manual prompt concatenation fails, how to use state-managed memory correctly, and how to share context between agents.

This document is framework-independent in principles but includes concrete examples for LangGraph and OpenAI Agents SDK.

The Anti-Pattern: Manual Prompt Concatenation¶

When building agents, developers (and AI coding assistants) often default to manually concatenating conversation history into prompts. This is the most common mistake in agent development.

What It Looks Like¶

Anti-pattern: Naive history concatenation

# DON'T DO THIS
class NaiveAgent:
    def __init__(self, model):
        self.model = model
        self.history = []  # Manual history list

    def chat(self, user_message: str) -> str:
        self.history.append({"role": "user", "content": user_message})

        # Stuffing full history into every call
        response = self.model.chat(
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                *self.history  # Growing unboundedly
            ]
        )

        self.history.append({"role": "assistant", "content": response})
        return response

Problems:

No persistence: History lost on restart
Unbounded growth: Eventually exceeds context window
No thread isolation: Can't run multiple conversations
Attention degradation: Middle content gets ignored
Token waste: Paying for stale context every call

Anti-pattern: String concatenation

# DON'T DO THIS
def build_prompt(history: list[dict], new_message: str) -> str:
    history_text = "\n".join([
        f"{msg['role']}: {msg['content']}"
        for msg in history
    ])

    return f"""Previous conversation:
{history_text}

User: {new_message}
Assistant:"""

Problems:

Format fragility: Role formatting can confuse the model
No structure: Loses message boundaries
Injection risk: History content can break prompt structure
No tool call preservation: Loses function call context

Why AI Coding Assistants Default to This¶

Training data contains many examples of this pattern because:

It's the simplest implementation
It works for demos and tutorials
Framework-specific patterns require API knowledge
Most code examples don't show production patterns

This is why you have to repeatedly explain you want proper memory management.

Why It Fails: The Evidence¶

"Lost in the Middle" Research (Liu et al., 2023)

LLMs exhibit a U-shaped attention curve—content at the start and end of context receives attention, middle content is systematically ignored. Stuffing history into the middle of a prompt means important context gets lost.

The 75% Rule (Claude Code, Anthropic)

When Claude Code operated above 90% context utilization, output quality degraded significantly. Implementing auto-compaction at 75% produced dramatic quality improvements. The lesson: capacity ≠ capability. Empty headroom enables reasoning, not just retrieval.

Context Rot

Old, irrelevant details don't just waste tokens—they actively confuse the model. A discussion about error handling from 50 turns ago can distract from the current task, even if technically within the context window.

The Correct Model: State-Managed Memory¶

Memory should be first-class state, not prompt injection. The framework handles storage, retrieval, trimming, and injection—your code focuses on logic.

Core Principles¶

1. Separation of Concerns

Concern	Responsibility	Your Code
Storage	Persist messages to durable store	Configure checkpointer
Retrieval	Load relevant history for thread	Provide thread_id
Trimming	Keep context within limits	Set thresholds
Injection	Add history to model calls	Automatic

2. Thread Isolation

Each conversation gets a unique thread_id. The framework maintains separate history per thread, enabling concurrent conversations without interference.

3. Resumability

Conversations can be paused and resumed—even across process restarts. The checkpointer persists state to durable storage.

4. Automatic Management

You don't manually append messages or manage context length. The framework handles this based on configuration.

LangGraph: Checkpointer Pattern¶

from langgraph.checkpoint.memory import InMemorySaver
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.graph import StateGraph, MessagesState

# Development: in-memory
checkpointer = InMemorySaver()

# Production: persistent storage
# checkpointer = SqliteSaver.from_conn_string("conversations.db")

# Define your graph
builder = StateGraph(MessagesState)
builder.add_node("agent", call_model)
builder.add_edge("__start__", "agent")

# Compile WITH checkpointer
graph = builder.compile(checkpointer=checkpointer)

# Each conversation gets a thread_id
config = {"configurable": {"thread_id": "user-123-session-1"}}

# Framework handles history automatically
response = graph.invoke(
    {"messages": [{"role": "user", "content": "Hello!"}]},
    config
)

# Same thread_id = conversation continues
response = graph.invoke(
    {"messages": [{"role": "user", "content": "What did I just say?"}]},
    config  # Same config = same thread
)

What the framework does:

Before invoke: Loads existing messages for thread_id
Prepends history to new messages
Calls model with full context
After invoke: Persists new messages to checkpointer
Handles context limits based on configuration

OpenAI Agents SDK: Session Pattern¶

from agents import Agent, Runner
from agents.sessions import SQLiteSession

# Create persistent session storage
session = SQLiteSession("conversations.db")

agent = Agent(
    name="assistant",
    instructions="You are a helpful assistant.",
    model="gpt-4o"
)

runner = Runner()

# Session handles history automatically
response = await runner.run(
    agent,
    "Hello!",
    session=session,
    session_id="user-123-session-1"
)

# Same session_id = conversation continues
response = await runner.run(
    agent,
    "What did I just say?",
    session=session,
    session_id="user-123-session-1"
)

What the session does:

Before run: Retrieves conversation history for session_id
Prepends history to input items
Executes agent with full context
After run: Stores new items (user input, responses, tool calls)
Handles continuity across runs

Memory Types¶

Agent memory isn't monolithic. Different types serve different purposes and have different scopes.

Short-Term Memory (Thread-Scoped)¶

Scope: Single conversation thread Purpose: Maintain context within an ongoing session Lifetime: Duration of conversation (or until explicitly cleared)

Framework	Implementation
LangGraph	Checkpointer with `thread_id`
OpenAI SDK	Session with `session_id`
General	Thread-isolated message store

What belongs in short-term memory:

User messages and assistant responses
Tool calls and results
Reasoning traces (if using chain-of-thought)
Current task state

Long-Term Memory (Cross-Session)¶

Scope: Across multiple conversations Purpose: Persist facts, preferences, learned patterns Lifetime: Indefinite (or until explicitly deleted)

Structured Long-Term Memory¶

Facts, relationships, and decisions stored in queryable format.

# LangGraph Store pattern
from langgraph.store.memory import InMemoryStore

store = InMemoryStore()

# Store user preference (persists across threads)
store.put(
    namespace=("users", "user-123", "preferences"),
    key="timezone",
    value={"timezone": "America/New_York", "updated": "2025-01-17"}
)

# Retrieve in any thread
prefs = store.get(("users", "user-123", "preferences"), "timezone")

Semantic Long-Term Memory¶

Embedding-based retrieval for finding relevant past context.

# Conceptual pattern (framework-independent)
from your_vector_store import VectorStore

memory_store = VectorStore()

# Store interaction summary with embedding
memory_store.add(
    text="User prefers concise responses without code comments",
    metadata={"user_id": "user-123", "type": "preference"},
    embedding=embed("User prefers concise responses...")
)

# Retrieve relevant memories for new context
relevant = memory_store.search(
    query="How should I format code for this user?",
    filter={"user_id": "user-123"}
)

Episodic Memory¶

Scope: Cross-session, timestamped Purpose: Record past interactions for learning and audit Lifetime: Configurable retention

# Record interaction outcome
episodic_store.add({
    "timestamp": "2025-01-17T10:30:00Z",
    "user_id": "user-123",
    "thread_id": "session-456",
    "task": "debug authentication error",
    "outcome": "resolved",
    "approach": "checked token expiration, found clock skew",
    "user_feedback": "positive"
})

# Query past approaches for similar tasks
past_successes = episodic_store.query(
    task_type="debug authentication",
    outcome="resolved",
    user_id="user-123"
)

Memory Layers Summary¶

Layer	Scope	Storage	Retrieval	Example Use
Short-term	Thread	Checkpointer/Session	By thread_id	Conversation context
Long-term (Structured)	User/Global	Key-value store	By namespace + key	User preferences
Long-term (Semantic)	User/Global	Vector store	By similarity	Relevant past context
Episodic	User/Global	Event log	By query + time	Past task outcomes

State-Over-History Principle¶

A key insight for efficient memory management: prefer passing current state over full history.

The Problem with Full History¶

# Anti-pattern: Passing full transcript to sub-agent
sub_agent_prompt = f"""
Here's the full conversation so far:
{format_messages(all_300_messages)}

Now help with: {current_task}
"""

Problems:

Token explosion
Attention dilution
Irrelevant context pollution
Latency increase

State-Over-History Pattern¶

# Better: Pass current state, not history
current_state = {
    "user_goal": "Build a REST API for user management",
    "completed_steps": ["schema design", "database setup"],
    "current_step": "implement CRUD endpoints",
    "decisions_made": {
        "database": "PostgreSQL",
        "framework": "FastAPI",
        "auth": "JWT tokens"
    },
    "open_questions": [],
    "artifacts": ["schema.sql", "models.py"]
}

sub_agent_prompt = f"""
Current project state:
{json.dumps(current_state, indent=2)}

Task: {current_task}
"""

Benefits:

Minimal tokens
Focused attention
No stale context
Faster inference

What Belongs in State vs History¶

State (Pass Forward)	History (Store, Don't Pass)
Current goal	How goal was established
Decisions made	Discussion leading to decisions
Artifacts created	Iterations and revisions
Open questions	Resolved questions
Error context (if debugging)	Successful operations

Implementing State Extraction¶

# LangGraph: Custom state schema
from typing import TypedDict, Annotated
from langgraph.graph import add_messages

class ProjectState(TypedDict):
    messages: Annotated[list, add_messages]  # Short-term (auto-managed)

    # Extracted state (you manage)
    current_goal: str
    decisions: dict
    artifacts: list[str]
    phase: str

# Update state after significant events
def extract_state(messages: list, current_state: ProjectState) -> ProjectState:
    """Extract/update state from recent messages."""
    # Use LLM or rules to identify:
    # - New decisions made
    # - Artifacts created
    # - Phase transitions
    return updated_state

Managing History Growth¶

Even with proper memory architecture, history grows. You need strategies to keep it bounded.

Strategy 1: Trimming¶

Keep only the last N turns, drop the rest.

LangGraph: trim_messages

from langgraph.prebuilt import create_react_agent
from langchain_core.messages import trim_messages

def trim_to_recent(messages: list) -> list:
    """Keep system message + last 10 messages."""
    return trim_messages(
        messages,
        max_tokens=4000,
        strategy="last",
        token_counter=len,  # Or use tiktoken
        include_system=True,
        allow_partial=False
    )

# Apply before model call
agent = create_react_agent(
    model,
    tools,
    state_modifier=trim_to_recent
)

When to use trimming:

Short, transactional conversations
Tasks where old context is truly irrelevant
When latency is critical

Anti-patterns with trimming:

Losing critical decisions from early in conversation
Trimming mid-tool-call (orphaned tool results)
Using for planning tasks that need long-range context

Strategy 2: Summarization¶

Compress older messages into a synthetic summary.

LangGraph: SummarizationMiddleware

from langchain.agents import create_agent, SummarizationMiddleware

agent = create_agent(
    model="gpt-4o",
    tools=tools,
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",  # Cheaper model for summarization
            trigger={"tokens": 4000},  # Trigger when context exceeds
            keep={"messages": 10}  # Keep last 10 verbatim
        )
    ]
)

What summarization produces:

[Summary of turns 1-50]:
- User requested help building a REST API
- Decided on FastAPI + PostgreSQL
- Completed: schema design, database models
- Current focus: authentication implementation
- User prefers concise code without excessive comments

[Recent messages 51-60 kept verbatim]

When to use summarization:

Long-running planning conversations
Support threads spanning multiple issues
Tasks requiring long-range continuity

Anti-patterns with summarization:

Summary drift: Facts get reinterpreted incorrectly
Context poisoning: Errors in summary propagate indefinitely
Over-compression: Losing critical details
Summarizing too frequently: Latency overhead

Strategy 3: Hybrid (Recommended)¶

Combine summarization for old context + trimming for recent.

class HybridMemoryConfig:
    # Summarize when total exceeds this
    summarize_threshold_tokens: int = 8000

    # Keep this many recent messages verbatim
    keep_recent_messages: int = 20

    # Maximum summary length
    max_summary_tokens: int = 500

    # Model for summarization (use cheaper model)
    summary_model: str = "gpt-4o-mini"

Flow:

Check total token count
If under threshold: no action
If over threshold:
Keep last N messages verbatim
Summarize older messages
Replace older messages with summary
Continue with bounded context

Tool Call Preservation¶

A critical mistake when trimming or formatting history: stripping tool calls and their results.

The problem:

# WRONG: Converting to simple text format
def format_history(messages):
    formatted = []
    for msg in messages:
        if msg.role == "user":
            formatted.append(f"User: {msg.content}")
        elif msg.role == "assistant":
            formatted.append(f"Assistant: {msg.content}")
        # Tool calls and results are silently dropped!
    return formatted

Why this fails:

Model loses context about what actions were taken
Can't reason about results of previous operations
May re-execute tools unnecessarily
Breaks continuity for multi-step tool sequences

Correct approach:

# RIGHT: Preserve full message structure
def format_history(messages):
    """Preserve tool calls and results in their native format."""
    return [
        {
            "role": msg.role,
            "content": msg.content,
            # Preserve tool call metadata
            **({"tool_calls": msg.tool_calls} if hasattr(msg, 'tool_calls') and msg.tool_calls else {}),
            **({"tool_call_id": msg.tool_call_id} if hasattr(msg, 'tool_call_id') and msg.tool_call_id else {}),
        }
        for msg in messages
    ]

Best practice: Let frameworks handle history formatting. When you must custom-format:

Message Type	Preserve
User messages	Content
Assistant (text)	Content
Assistant (tool call)	Content + tool_calls array
Tool result	role="tool", content, tool_call_id

When multiple agents collaborate, memory sharing becomes critical.

Pattern 1: Shared State Object¶

Agents read from and write to a common state.

# LangGraph: Shared state across nodes
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, add_messages

class SharedState(TypedDict):
    messages: Annotated[list, add_messages]

    # Shared across all agents
    research_findings: list[str]
    draft_content: str
    review_feedback: list[str]
    final_output: str

def researcher(state: SharedState) -> SharedState:
    """Research agent adds findings to shared state."""
    findings = do_research(state["messages"][-1])
    return {"research_findings": state["research_findings"] + findings}

def writer(state: SharedState) -> SharedState:
    """Writer agent reads research, produces draft."""
    draft = write_draft(state["research_findings"])
    return {"draft_content": draft}

def reviewer(state: SharedState) -> SharedState:
    """Reviewer reads draft, adds feedback."""
    feedback = review(state["draft_content"])
    return {"review_feedback": feedback}

# Wire agents together
graph = StateGraph(SharedState)
graph.add_node("researcher", researcher)
graph.add_node("writer", writer)
graph.add_node("reviewer", reviewer)

Pattern 2: Artifact Passing (Not Transcript Passing)¶

Anti-pattern: Context telephone

# DON'T DO THIS
def orchestrator_delegates_to_specialist(conversation_history):
    # Passing full history degrades information
    specialist_result = specialist.run(
        f"Here's the conversation:\n{conversation_history}\n\nDo task X"
    )
    return specialist_result

Problems:

Information degrades through each handoff
Irrelevant context pollutes specialist focus
Token waste compounds at each level

Better: Pass artifacts and state

# DO THIS
def orchestrator_delegates_to_specialist(task_state):
    # Pass only what specialist needs
    specialist_result = specialist.run(
        task_description=task_state["current_task"],
        input_artifacts=task_state["relevant_artifacts"],
        constraints=task_state["constraints"],
        # NOT the full conversation history
    )
    return specialist_result

Scenario	Memory Strategy
Agents working on same task	Shared state object
Agents with different domains	Isolated memory, share artifacts
Parallel independent tasks	Fully isolated threads
Validator reviewing creator's work	Read-only access to creator's output

LangGraph: Isolated sub-agents

# Each specialist gets its own thread
def delegate_to_specialist(state, specialist_graph, task):
    # Create isolated thread for specialist
    specialist_thread_id = f"{state['thread_id']}-{specialist_graph.name}-{uuid4()}"

    result = specialist_graph.invoke(
        {"messages": [{"role": "user", "content": task}]},
        {"configurable": {"thread_id": specialist_thread_id}}
    )

    # Return only the result, not specialist's internal history
    return result["final_output"]

For long-term memory that should be shared across agents:

# Shared user preferences (all agents can read)
user_namespace = ("users", user_id, "preferences")

# Agent-specific learned patterns (isolated)
agent_namespace = ("agents", agent_id, "patterns")

# Project-specific context (shared within project)
project_namespace = ("projects", project_id, "context")

The 75% Rule¶

Never fill context to capacity. Reserve headroom for reasoning.

Why Headroom Matters¶

Context Usage	Effect
< 50%	Optimal reasoning space
50-75%	Good balance
75-90%	Degraded quality, trigger compaction
> 90%	Significant quality loss

Implementation¶

def should_compact(messages: list, model_context_limit: int) -> bool:
    """Check if context needs compaction."""
    current_tokens = count_tokens(messages)
    threshold = model_context_limit * 0.75
    return current_tokens > threshold

def auto_compact_middleware(state: AgentState) -> AgentState:
    """Middleware that triggers compaction at 75%."""
    if should_compact(state["messages"], MODEL_CONTEXT_LIMIT):
        state["messages"] = summarize_and_trim(state["messages"])
    return state

Implementation Checklist¶

When building agents, verify:

[ ] No manual history concatenation in prompt building
[ ] Checkpointer/Session configured for conversation persistence
[ ] Thread IDs assigned for conversation isolation
[ ] Trimming or summarization configured for long conversations
[ ] State-over-history for sub-agent delegation
[ ] Artifacts passed, not transcripts, between agents
[ ] 75% threshold for context compaction
[ ] Long-term memory separated from short-term (if needed)

Quick Reference¶

Pattern Selection¶

Situation	Pattern	Framework Feature
Basic conversation persistence	Checkpointer/Session	LangGraph: `InMemorySaver`, OpenAI: `SQLiteSession`
Long conversations	Summarization middleware	LangGraph: `SummarizationMiddleware`
Multi-agent shared context	Shared state schema	LangGraph: `StateGraph` with shared `TypedDict`
Cross-session user data	Long-term store	LangGraph: `InMemoryStore`, MongoDB Store
Semantic memory retrieval	Vector store integration	External: Pinecone, Chroma, pgvector

Anti-Pattern Recognition¶

If you see...	It's wrong because...	Replace with...
`history.append(msg)`	Manual management	Checkpointer
`prompt += history`	String concatenation	Session with auto-injection
Full transcript to sub-agent	Context telephone	Artifact/state passing
No thread_id	No isolation	Explicit thread management
No trimming/summarization	Unbounded growth	Memory middleware

Research Basis¶

Source	Key Finding
"Lost in the Middle" (Liu et al., 2023)	U-shaped attention; middle content ignored
Claude Code 75% Rule (Anthropic)	Quality degrades above 75% context usage
LangChain Short-Term Memory Guide	Checkpointer + summarization patterns
OpenAI Agents SDK Session Docs	Session-based auto-persistence
AWS Memory-Augmented Agents	Memory layer architecture patterns
A-Mem (2025)	Dynamic vs predefined memory access

Agent Memory Architecture¶

The Anti-Pattern: Manual Prompt Concatenation¶

What It Looks Like¶

Why AI Coding Assistants Default to This¶

Why It Fails: The Evidence¶

The Correct Model: State-Managed Memory¶

Core Principles¶

LangGraph: Checkpointer Pattern¶

OpenAI Agents SDK: Session Pattern¶

Memory Types¶

Short-Term Memory (Thread-Scoped)¶

Long-Term Memory (Cross-Session)¶

Structured Long-Term Memory¶

Semantic Long-Term Memory¶

Episodic Memory¶

Memory Layers Summary¶

State-Over-History Principle¶

The Problem with Full History¶

State-Over-History Pattern¶

What Belongs in State vs History¶

Implementing State Extraction¶

Managing History Growth¶

Strategy 1: Trimming¶

Strategy 2: Summarization¶

Strategy 3: Hybrid (Recommended)¶

Tool Call Preservation¶

Multi-Agent Memory Sharing¶

Pattern 1: Shared State Object¶

Pattern 2: Artifact Passing (Not Transcript Passing)¶

Pattern 3: Memory Isolation vs Sharing¶

Pattern 4: Namespace-Based Sharing¶

The 75% Rule¶

Why Headroom Matters¶

Implementation¶

Implementation Checklist¶

Quick Reference¶

Pattern Selection¶

Anti-Pattern Recognition¶

Research Basis¶

See Also¶