Context Engineering, Part 2: Memory, Agentic RAG, and the Patterns Part 1 Didn't Cover
Persistent Memory, Agentic RAG, and the Production Patterns That Make AI Systems Actually Scale
Senior Developer

Part 1 covered the core model: the five layers, why prompt engineering stopped being enough, and the four techniques — Write, Select, Compress, Isolate. This is the second half: how persistent memory actually works under the hood, what changes when retrieval becomes a decision instead of a fixed step, and two production patterns that got cut from Part 1 for length.
The memory layer, properly
Part 1's fifth layer — user state and memory — got one paragraph. That undersells it. Conversation history (layer 3) and persistent memory (layer 5) look similar but solve different problems: history is what happened in this session; memory is what's still true across sessions. Confuse them and you end up either re-explaining yourself to the AI every time, or stuffing the entire chat log into every prompt and paying for tokens that mostly say nothing new.
Two tools dominate this space right now, and they're not interchangeable.
Mem0 treats memory as a set of discrete, evolving facts. When you call add(), it doesn't just append an embedding — it extracts the facts from what was said, checks them against what's already stored, and decides whether to add a new fact, update an existing one, or discard a duplicate. A user who says "I prefer Python" in session one and "actually, use Python, not JavaScript" in session five ends up with one canonical preference, not two contradictory ones.
from mem0 import Memory
m = Memory()
messages = [
{"role": "user", "content": "I'm a vegetarian and allergic to nuts."},
{"role": "assistant", "content": "Got it — I'll keep that in mind."}
]
m.add(messages, user_id="alex")
# Later, in a different session:
m.search("What are this user's dietary restrictions?", user_id="alex")
Zep, by contrast, is built around a temporal knowledge graph (its engine, Graphiti, runs underneath). Instead of just storing the current state of a fact, it tracks when each fact was true and for how long — so the system can distinguish "the user's job title is Senior Engineer" from "the user's job title was Engineer until March, then became Senior Engineer." That distinction matters enormously for enterprise agents reasoning over things that change: account status, project ownership, pricing tiers, org charts.
The practical split: reach for Mem0-style memory when you need an agent to remember stable facts and preferences about a user. Reach for a temporal graph like Zep when your agent needs to reason about change over time — when "what's true now" and "what was true then" are both questions someone will actually ask.
Either way, the rule from Part 1 still applies: memory is a layer you curate, not a log you accumulate. A memory store that never forgets anything is just conversation history with worse compression.
Picture the support agent from Part 1's before/after example. The first conversation ends, the refund gets processed, the session closes. Three weeks later the same customer messages again about something unrelated. Without memory, the agent starts cold — it has to re-fetch the account and learn nothing from the fact that this exact double-charge issue happened once before. With Mem0-style memory, it already knows the customer is on Pro Annual, prefers email follow-ups over chat, and has had this billing issue before. With Zep-style temporal memory, it additionally knows when their plan changed — so if the new question is about a charge from before the upgrade, the agent can reason about which pricing rules actually applied at the time.
Agentic RAG: when retrieval is a decision, not a step
Part 1's Select technique described contextual retrieval — enriching chunks before embedding them. That's still a fixed pipeline: embed query, retrieve top-k, generate. It works, but it has one blind spot: it can't tell when it's retrieved the wrong thing, and it can't try again.
Agentic RAG removes that blind spot by putting an agent in the retrieval loop instead of a fixed function. Instead of always retrieving once and generating, the agent:
Decides whether retrieval is even needed for this query
Grades the relevance of what comes back
If the results are weak, rewrites the query and tries again
Only generates an answer once it's satisfied the context actually supports one
from langgraph.graph import StateGraph, START, END
def retrieve(state):
state["docs"] = vector_store.search(state["query"])
return state
def grade(state):
state["sufficient"] = grader_llm(state["query"], state["docs"])
return state
def rewrite_query(state):
state["query"] = rewriter_llm(state["query"], state["docs"])
state["attempts"] += 1
return state
def generate(state):
state["answer"] = generator_llm(state["query"], state["docs"])
return state
graph = StateGraph(dict)
graph.add_node("retrieve", retrieve)
graph.add_node("grade", grade)
graph.add_node("rewrite", rewrite_query)
graph.add_node("generate", generate)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "grade")
graph.add_conditional_edges(
"grade",
lambda s: "generate" if s["sufficient"] or s["attempts"] >= 3 else "rewrite",
)
graph.add_edge("rewrite", "retrieve")
graph.add_edge("generate", END)
app = graph.compile()
app.invoke({"query": "what's our refund policy?", "attempts": 0})
LangGraph is the framework you'll see most often in production write-ups of this pattern — its state-graph model maps naturally onto a retrieve-grade-rewrite loop with retry limits. LlamaIndex's workflow system and a few other frameworks support the same pattern with a different programming model, so it's worth knowing it's not the only option, just the most common one.
The honest tradeoff: agentic RAG is more accurate on complex, multi-part questions and noticeably more expensive and slower per query, because every grading and rewriting step is its own LLM call. Don't reach for it by default. Reach for it when you've measured that your fixed pipeline is retrieving the wrong thing often enough to matter — and not before, because the added latency and cost are real, even when the accuracy gain is small.
A quick gut check for which one you actually need: if most of your queries are simple lookups against a stable knowledge base — "what's our refund policy" — a fixed retrieval step is faster, cheaper, and just as accurate. Agentic RAG earns its cost on the harder slice: multi-hop questions, ambiguous phrasing, or a knowledge base broad enough that the first guess at what to search for is often wrong.
Two patterns Part 1 cut for length
Scoping tools by phase, with an actual structure. Part 1 mentioned this in passing; here's the shape of it. Don't hand an agent every tool for the entire task — split tools by what each phase of the work actually requires:
PLANNING_TOOLS = [web_search, read_file]
EXECUTION_TOOLS = [write_file, run_code, query_db]
REVIEW_TOOLS = [read_file, query_db] # no write access during review
NOTIFY_TOOLS = [send_email]
This isn't just tidiness. A smaller tool list is a smaller decision surface — the model has fewer ways to pick the wrong action — and it caps how much damage a single wrong call can do, since a review-phase agent literally can't call write_file even if it tries.
Consistency enforcement after compression. Part 1's Compress section explained the sliding-window-plus-summarization pattern, but left out what happens to your invariants when that compression runs. A system prompt's hard constraints, the agent's core identity, its actual objective — these are supposed to be permanent. But after several rounds of summarization, the summary an LLM writes of "everything that happened so far" doesn't reliably re-state "and by the way, never do X." It just describes what occurred. The fix is mechanical, not clever: after any compression or trimming pass, re-inject the invariant elements verbatim, separately from the summary. Don't trust them to survive being paraphrased.
INVARIANTS = """
[CORE CONSTRAINTS — do not summarize or paraphrase this block]
- Never issue a refund above $200 without escalation.
- Never speculate about account status; always look it up first.
- Objective: resolve the customer's billing question, not just acknowledge it.
"""
def rebuild_context(summary: str, recent_turns: list) -> list:
return [
{"role": "system", "content": INVARIANTS},
{"role": "system", "content": f"Summary of earlier conversation: {summary}"},
*recent_turns,
]
Between the two posts, that's the full stack: what context engineering is, why it matters now, the five layers, the four core techniques, the security model, and — here — the memory and retrieval patterns that only show up once a system has real users and real history. If you've read both and you're still getting inconsistent results from a specific system, the fix is almost always to find which of those pieces is weak, not to go looking for a technique neither post mentioned.
Comments (0)
Login to post a comment.