ZyVOP Logo
Content That Connects
SeriesAI NewsCategoriesTags
ZyVOP Logo
Content That Connects

Empowering developers and creators with cutting-edge insights, comprehensive tutorials, and innovative solutions for the digital future.

Content

  • Tags
  • Write Article
  • Newsletter

Company

  • About Us
  • Contact

Connect

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • DMCA Policy
  • Code of Conduct

© 2026 ZyVOP. Crafted with care for the developer community.

Made with ❤️ by the ZyVOP team
All systems operational
HomeContext Engineering, Part 2: Memory, Agentic RAG, and the Patterns Part 1 Didn't Cover

Context Engineering, Part 2: Memory, Agentic RAG, and the Patterns Part 1 Didn't Cover

Persistent Memory, Agentic RAG, and the Production Patterns That Make AI Systems Actually Scale

#AI Engineering#context engineering#Retrieval Augmented Generation#LangGraph#Mem0#Zep
AIScrapper
AIScrapper

Senior Developer

June 22, 2026
6 min read
7 views
Context Engineering, Part 2: Memory, Agentic RAG, and the Patterns Part 1 Didn't Cover

Part 1 covered the core model: the five layers, why prompt engineering stopped being enough, and the four techniques — Write, Select, Compress, Isolate. This is the second half: how persistent memory actually works under the hood, what changes when retrieval becomes a decision instead of a fixed step, and two production patterns that got cut from Part 1 for length.


The memory layer, properly

Part 1's fifth layer — user state and memory — got one paragraph. That undersells it. Conversation history (layer 3) and persistent memory (layer 5) look similar but solve different problems: history is what happened in this session; memory is what's still true across sessions. Confuse them and you end up either re-explaining yourself to the AI every time, or stuffing the entire chat log into every prompt and paying for tokens that mostly say nothing new.

Two tools dominate this space right now, and they're not interchangeable.

Mem0 treats memory as a set of discrete, evolving facts. When you call add(), it doesn't just append an embedding — it extracts the facts from what was said, checks them against what's already stored, and decides whether to add a new fact, update an existing one, or discard a duplicate. A user who says "I prefer Python" in session one and "actually, use Python, not JavaScript" in session five ends up with one canonical preference, not two contradictory ones.

from mem0 import Memory

m = Memory()

messages = [
    {"role": "user", "content": "I'm a vegetarian and allergic to nuts."},
    {"role": "assistant", "content": "Got it — I'll keep that in mind."}
]
m.add(messages, user_id="alex")

# Later, in a different session:
m.search("What are this user's dietary restrictions?", user_id="alex")

Zep, by contrast, is built around a temporal knowledge graph (its engine, Graphiti, runs underneath). Instead of just storing the current state of a fact, it tracks when each fact was true and for how long — so the system can distinguish "the user's job title is Senior Engineer" from "the user's job title was Engineer until March, then became Senior Engineer." That distinction matters enormously for enterprise agents reasoning over things that change: account status, project ownership, pricing tiers, org charts.

The practical split: reach for Mem0-style memory when you need an agent to remember stable facts and preferences about a user. Reach for a temporal graph like Zep when your agent needs to reason about change over time — when "what's true now" and "what was true then" are both questions someone will actually ask.

Either way, the rule from Part 1 still applies: memory is a layer you curate, not a log you accumulate. A memory store that never forgets anything is just conversation history with worse compression.

Picture the support agent from Part 1's before/after example. The first conversation ends, the refund gets processed, the session closes. Three weeks later the same customer messages again about something unrelated. Without memory, the agent starts cold — it has to re-fetch the account and learn nothing from the fact that this exact double-charge issue happened once before. With Mem0-style memory, it already knows the customer is on Pro Annual, prefers email follow-ups over chat, and has had this billing issue before. With Zep-style temporal memory, it additionally knows when their plan changed — so if the new question is about a charge from before the upgrade, the agent can reason about which pricing rules actually applied at the time.


Agentic RAG: when retrieval is a decision, not a step

Part 1's Select technique described contextual retrieval — enriching chunks before embedding them. That's still a fixed pipeline: embed query, retrieve top-k, generate. It works, but it has one blind spot: it can't tell when it's retrieved the wrong thing, and it can't try again.

Agentic RAG removes that blind spot by putting an agent in the retrieval loop instead of a fixed function. Instead of always retrieving once and generating, the agent:

  1. Decides whether retrieval is even needed for this query

  2. Grades the relevance of what comes back

  3. If the results are weak, rewrites the query and tries again

  4. Only generates an answer once it's satisfied the context actually supports one

from langgraph.graph import StateGraph, START, END

def retrieve(state):
    state["docs"] = vector_store.search(state["query"])
    return state

def grade(state):
    state["sufficient"] = grader_llm(state["query"], state["docs"])
    return state

def rewrite_query(state):
    state["query"] = rewriter_llm(state["query"], state["docs"])
    state["attempts"] += 1
    return state

def generate(state):
    state["answer"] = generator_llm(state["query"], state["docs"])
    return state

graph = StateGraph(dict)
graph.add_node("retrieve", retrieve)
graph.add_node("grade", grade)
graph.add_node("rewrite", rewrite_query)
graph.add_node("generate", generate)

graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "grade")
graph.add_conditional_edges(
    "grade",
    lambda s: "generate" if s["sufficient"] or s["attempts"] >= 3 else "rewrite",
)
graph.add_edge("rewrite", "retrieve")
graph.add_edge("generate", END)

app = graph.compile()
app.invoke({"query": "what's our refund policy?", "attempts": 0})

LangGraph is the framework you'll see most often in production write-ups of this pattern — its state-graph model maps naturally onto a retrieve-grade-rewrite loop with retry limits. LlamaIndex's workflow system and a few other frameworks support the same pattern with a different programming model, so it's worth knowing it's not the only option, just the most common one.

The honest tradeoff: agentic RAG is more accurate on complex, multi-part questions and noticeably more expensive and slower per query, because every grading and rewriting step is its own LLM call. Don't reach for it by default. Reach for it when you've measured that your fixed pipeline is retrieving the wrong thing often enough to matter — and not before, because the added latency and cost are real, even when the accuracy gain is small.

A quick gut check for which one you actually need: if most of your queries are simple lookups against a stable knowledge base — "what's our refund policy" — a fixed retrieval step is faster, cheaper, and just as accurate. Agentic RAG earns its cost on the harder slice: multi-hop questions, ambiguous phrasing, or a knowledge base broad enough that the first guess at what to search for is often wrong.


Two patterns Part 1 cut for length

Scoping tools by phase, with an actual structure. Part 1 mentioned this in passing; here's the shape of it. Don't hand an agent every tool for the entire task — split tools by what each phase of the work actually requires:

PLANNING_TOOLS = [web_search, read_file]
EXECUTION_TOOLS = [write_file, run_code, query_db]
REVIEW_TOOLS = [read_file, query_db]       # no write access during review
NOTIFY_TOOLS = [send_email]

This isn't just tidiness. A smaller tool list is a smaller decision surface — the model has fewer ways to pick the wrong action — and it caps how much damage a single wrong call can do, since a review-phase agent literally can't call write_file even if it tries.

Consistency enforcement after compression. Part 1's Compress section explained the sliding-window-plus-summarization pattern, but left out what happens to your invariants when that compression runs. A system prompt's hard constraints, the agent's core identity, its actual objective — these are supposed to be permanent. But after several rounds of summarization, the summary an LLM writes of "everything that happened so far" doesn't reliably re-state "and by the way, never do X." It just describes what occurred. The fix is mechanical, not clever: after any compression or trimming pass, re-inject the invariant elements verbatim, separately from the summary. Don't trust them to survive being paraphrased.

INVARIANTS = """
[CORE CONSTRAINTS — do not summarize or paraphrase this block]
- Never issue a refund above $200 without escalation.
- Never speculate about account status; always look it up first.
- Objective: resolve the customer's billing question, not just acknowledge it.
"""

def rebuild_context(summary: str, recent_turns: list) -> list:
    return [
        {"role": "system", "content": INVARIANTS},
        {"role": "system", "content": f"Summary of earlier conversation: {summary}"},
        *recent_turns,
    ]

Between the two posts, that's the full stack: what context engineering is, why it matters now, the five layers, the four core techniques, the security model, and — here — the memory and retrieval patterns that only show up once a system has real users and real history. If you've read both and you're still getting inconsistent results from a specific system, the fix is almost always to find which of those pieces is weak, not to go looking for a technique neither post mentioned.

AIScrapper

AIScrapper

Passionate developer sharing knowledge about modern web technologies and best practices.

Comments (0)

Login to post a comment.

Stay Updated

Get the latest articles delivered to your inbox.

We respect your privacy. Unsubscribe anytime.

Related Posts

Token Budgeting: The Engineering Skill Nobody Talks About

Most developers think token optimization means shorter prompts. In 2026, the biggest costs come from bloated chat history, unused tool schemas, cache misses, and overusing expensive models. This guide covers five high-impact levers, with pricing, cost breakdowns, and a case study that cut a Claude bill from $2,400/month to $680.

Read article

Context Engineering: The Skill That's Actually Replacing Prompt Engineering in 2026

For two years, prompt engineering was the AI skill everyone wanted. LinkedIn courses, boot camps, six-figure job titles. Then something changed. In 2026, the teams building the most reliable AI systems have mostly stopped talking about prompts — and started talking about context. This is what that shift means, and how to get ahead of it.

Read article

How to Scrape Websites for AI Training Data & RAG Pipelines with Python

Pre-trained models have knowledge limits. If your AI needs fresh, domain-specific information, web scraping and RAG are the answer. This guide walks through building a complete AI data pipeline with Crawl4AI, ChromaDB, embeddings, and LLMs.

Read article

Popular Tags

#.env.example Node.js#0x profiling#10x faster python scraper tutorial#12-factor#2026#AI#AI Backend#AI Comparison#AI Cost Optimization#AI Engineering