I Built a Tiny AI Agent From Scratch — Every Line Tested Before It Touched a Real API
No frameworks, no magic. Just two Python functions, a loop, and Claude's tool-use API — plus the offline test suite that proves it actually works.
Senior Developer

What an "agent" actually is, stripped of the hype
Every few weeks there's a new framework promising to make "agentic AI" easy. Most of them are wrappers around one core idea: the model doesn't just generate text — it can pause, say "I need to call this function with these arguments," wait for the result, and then keep going with that information in hand.
That's it. That's the whole trick. Anthropic calls this tool use, and it's the same mechanism powering everything from "let Claude check the weather" to multi-step coding agents.
This tutorial builds a working version of that loop from the ground up — no LangChain, no agent framework, just the Claude API and plain Python. By the end you'll have a small agent that can do arithmetic and count words by actually calling real Python functions, decide on its own when to use them, and chain them together when a question needs both.
What you'll need
Python 3.9 or newer
An Anthropic API key (from the Claude Console)
The official SDK:
pip install anthropic
That's the whole list. No vector databases, no Docker, nothing else.
Step 1: Write the actual tools (just functions)
This is the part people often overcomplicate. A "tool" is just a regular Python function, plus a small JSON description telling Claude what it does and what arguments it takes.
We'll build two: a calculator and a word counter. Save this as tools.py:
"""
The actual Python functions our agent can call, plus the JSON-schema
descriptions of those tools that we hand to the Claude API.
"""
import ast
import operator
_OPS = {
ast.Add: operator.add,
ast.Sub: operator.sub,
ast.Mult: operator.mul,
ast.Div: operator.truediv,
ast.Pow: operator.pow,
ast.USub: operator.neg,
}
def calculate(expression: str) -> str:
"""Safely evaluate a basic arithmetic expression like '12 * (3 + 4)'."""
def _eval(node):
if isinstance(node, ast.Constant) and isinstance(node.value, (int, float)):
return node.value
if isinstance(node, ast.BinOp) and type(node.op) in _OPS:
return _OPS[type(node.op)](_eval(node.left), _eval(node.right))
if isinstance(node, ast.UnaryOp) and type(node.op) in _OPS:
return _OPS[type(node.op)](_eval(node.operand))
raise ValueError(f"Unsupported expression: {expression!r}")
tree = ast.parse(expression, mode="eval")
result = _eval(tree.body)
return str(result)
def count_words(text: str) -> str:
"""Count the words in a piece of text."""
return str(len(text.split()))
TOOLS = [
{
"name": "calculate",
"description": (
"Evaluate a basic arithmetic expression and return the numeric "
"result as a string. Supports +, -, *, /, **, parentheses, and "
"negative numbers. Use this any time the user asks for a "
"calculation, even a simple one -- do not do math in your head."
),
"input_schema": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "A valid arithmetic expression, e.g. '127 * 38' or '(12 + 4) / 2'.",
}
},
"required": ["expression"],
},
},
{
"name": "count_words",
"description": (
"Count how many words are in a given piece of text and return "
"the count as a string. Use this when the user asks for a word "
"count of something rather than estimating it yourself."
),
"input_schema": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The text to count words in.",
}
},
"required": ["text"],
},
},
]
TOOL_FUNCTIONS = {
"calculate": calculate,
"count_words": count_words,
}A couple of deliberate choices worth calling out. The calculate function uses Python's ast module to parse the expression into a syntax tree and walk it manually, rather than calling eval() directly — eval("import os; os.system(...)") is exactly the kind of thing you don't want an AI-controlled function anywhere near, even though ast.parse(mode="eval") would itself reject statements like import. The description fields are also longer than feels natural at first. That's intentional — Claude's tool selection quality depends heavily on how clearly each tool explains what it does and when to use it.
Step 2: The agent loop
This is the part that actually makes it "agentic." Save this as agent.py:
"""
The agent loop: send a message, check whether Claude wants to use a tool,
run that tool locally, send the result back, and repeat until Claude
gives a final text answer.
"""
from tools import TOOLS, TOOL_FUNCTIONS
def run_agent(client, user_message, model="claude-sonnet-4-6", max_iterations=5, verbose=True):
messages = [{"role": "user", "content": user_message}]
for step in range(max_iterations):
response = client.messages.create(
model=model,
max_tokens=1024,
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason != "tool_use":
return "".join(
block.text for block in response.content if block.type == "text"
)
tool_results = []
for block in response.content:
if block.type == "text" and verbose and block.text.strip():
print(f" [Claude says]: {block.text.strip()}")
if block.type == "tool_use":
func = TOOL_FUNCTIONS.get(block.name)
if verbose:
print(f" [tool call]: {block.name}({block.input})")
if func is None:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"Unknown tool: {block.name}",
"is_error": True,
})
continue
try:
result = func(**block.input)
if verbose:
print(f" [tool result]: {result}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
except Exception as exc:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(exc),
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
return "Reached max_iterations without a final answer -- something is looping."Three things in here are easy to get wrong, and the API will reject your request with a 400 error if you do:
The tool_result blocks have to go in a new user message, not appended to the assistant's message. The tool_result blocks must come first in that message's content array — any text from your side has to come after them. And every tool_use block in the assistant's response needs a matching tool_result with the same tool_use_id, including when a tool errors out — which is why even the error case still appends a tool_result, just with is_error: true.
Step 3: Test the loop before it ever calls the real API
Here's the part most tutorials skip, and it's the most useful part for actually trusting your code. The Claude API's tool-use responses have a documented, predictable shape — a stop_reason, and a content list of blocks that are either text or tool_use. So we can fake that shape, feed it to run_agent, and verify the loop, the tool dispatch, and the actual math/word-counting logic all work — without an API key, without spending a token.
Save this as test_agent_offline.py:
from types import SimpleNamespace
from agent import run_agent
from tools import calculate, count_words
def block(**kwargs):
return SimpleNamespace(**kwargs)
class FakeMessages:
def __init__(self, script):
self.script = script
self.calls = 0
def create(self, **kwargs):
response = self.script[self.calls]
self.calls += 1
return response
class FakeClient:
def __init__(self, script):
self.messages = FakeMessages(script)
def test_parallel_tool_calls():
turn1 = SimpleNamespace(
stop_reason="tool_use",
content=[
block(type="text", text="I'll do both of those."),
block(type="tool_use", id="toolu_010", name="calculate",
input={"expression": "(12 + 4) / 2"}),
block(type="tool_use", id="toolu_011", name="count_words",
input={"text": "the quick brown fox jumps over the lazy dog"}),
],
)
turn2 = SimpleNamespace(
stop_reason="end_turn",
content=[block(type="text", text="(12 + 4) / 2 is 8.0, and that sentence has 9 words.")],
)
client = FakeClient([turn1, turn2])
answer = run_agent(client, "Two things for you...", verbose=True)
assert "8.0" in answer
assert "9 words" in answer
print("test_parallel_tool_calls passed\n")
def test_underlying_functions_directly():
assert calculate("127 * 38") == "4826"
assert calculate("(12 + 4) / 2") == "8.0"
assert calculate("-3 + 7 ** 2") == "46"
assert count_words("the quick brown fox jumps over the lazy dog") == "9"
try:
calculate("import os")
raise AssertionError("should have raised")
except (ValueError, SyntaxError):
pass
print("test_underlying_functions_directly passed\n")
if __name__ == "__main__":
test_underlying_functions_directly()
test_parallel_tool_calls()
print("All offline tests passed.")Running this with python3 test_agent_offline.py produces:
test_underlying_functions_directly passed
[Claude says]: I'll do both of those.
[tool call]: calculate({'expression': '(12 + 4) / 2'})
[tool result]: 8.0
[tool call]: count_words({'text': 'the quick brown fox jumps over the lazy dog'})
[tool result]: 9
test_parallel_tool_calls passed
All offline tests passed.That output is from actually running the code above — not a transcript I wrote by hand. It confirms three things at once: the calculator handles operator precedence and negative numbers correctly, the agent loop correctly processes multiple tool calls in a single turn (Claude often does both calculations in parallel rather than one at a time), and the message history gets built in the shape the real API expects.
If you change anything — add a tool, change a schema, rewrite the loop — rerun this file first. It catches the majority of "why did my agent just 400" problems in seconds, with zero API cost.
Step 4: Run it for real
Once the offline tests pass, swap in the real client. Save this as run.py:
"""
Run with a real API key:
export ANTHROPIC_API_KEY="sk-ant-..."
pip install anthropic
python3 run.py
"""
from anthropic import Anthropic
from agent import run_agent
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment
if __name__ == "__main__":
question = (
"What's 127 * 38, and how many words are in the sentence "
"'the quick brown fox jumps over the lazy dog'?"
)
answer = run_agent(client, question)
print("\nFinal answer:", answer)Set your API key as an environment variable, install the SDK, and run it:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
pip install anthropic
python3 run.pyBecause the offline test already exercised the exact same run_agent function with a response shaped the way the real API responds, what you're really testing here is just "does my API key work and does the real model behave the way the documented shape says it will" — which is a much smaller, cheaper thing to debug.
What's actually happening, step by step
For the question above, here's the real sequence:
Claude receives the question along with the two tool definitions. It decides this needs both tools, and — because Claude 4-generation models default to parallel tool calling — it can return both tool_use blocks in a single response, often with a short sentence of context first ("I'll calculate that and count the words for you").
Our loop sees stop_reason == "tool_use", runs calculate("127 * 38") and count_words(...) locally, and sends both results back in a single new user message, with the tool_result blocks first.
Claude receives those results and, now having everything it needs, responds with stop_reason == "end_turn" and a plain text answer. Our loop sees that and returns the text. Done — two API calls total, with real computation happening in real Python in between.
Things to watch out for as you extend this
Pick the right model for the job. Anthropic's own guidance is to use a larger model like Opus for tools with ambiguous inputs or many options, and a smaller model like Haiku for simple, well-defined tools — smaller models are more likely to guess at missing parameters rather than asking.
Don't skip max_iterations. If a tool's result regularly causes Claude to call the same tool again, you can end up in a loop. The cap in run_agent is a blunt but effective safety net while you're developing.
Tool descriptions are most of the work. If Claude picks the wrong tool, or the right tool with weird arguments, the fix is almost always a clearer description — what the tool does, when to use it, when not to, and what each parameter means — rather than a change to your loop logic.
For anything beyond a toy, look at the SDK's tool runner. Once you're comfortable with the manual loop above (and understand why it's shaped the way it is), Anthropic's Python, TypeScript, and Ruby SDKs include a beta "tool runner" that handles the request/response cycle and conversation state for you. It's worth learning the manual version first — it's what the tool runner is doing under the hood, and it's much easier to debug when something goes wrong.
Related tutorials on this blog
A couple of places to go from here:
Getting Claude to Actually Talk to Your Files: A Real-World MCP Setup Guide — the loop you just built by hand is conceptually what MCP standardizes; this shows the same idea via a config file instead of code.
Your Laptop Can Run Its Own AI Now — Here's How to Actually Do It — for experimenting with the agent loop above using a free local model instead of API calls while you're still debugging.
Comments (0)
Login to post a comment.