Anatomy of a Sane AI Agent: Memory, Tools, and 'Stop Doing Stupid Stuff' Rules

 


Your AI agent just called the same API 47 times in a row because it forgot it already had the answer. The user waited 3 minutes for a response that should have taken 5 seconds. 

You should've asked yourself why you ever thought agentic AI was a good idea.

Most agent tutorials show you how to wire up tools and call an LLM. They skip the part where your agent enters infinite loops, forgets critical information mid-conversation, calls expensive APIs unnecessarily, or confidently does the exact opposite of what users asked for. 

Production agents need memory systems that actually remember, tool-calling logic that knows when to stop, and guardrails that prevent disasters before they happen.

The gap between "demo that impresses investors" and "system that survives real users" is filled with unsexy engineering decisions about state management, error handling, and constraint enforcement. Those decisions separate working products from expensive lessons in why you should have read this post first.

Read: Let Machine Learning Turn into Your Side Hustle with Automated Content Generation

Why most agents fail (and it has nothing to do with the LLM)

The interesting failure mode for production agents happens after the LLM generates a reasonable response. The system fails because it lacks the scaffolding around the LLM that handles memory, manages tool execution, and enforces boundaries.

Research into production agent failures identifies six primary causes: hallucinations, prompt injection vulnerabilities, latency from poor orchestration, incorrect tool selection, memory degradation from context window limitations, and distribution shift when production data differs from training data. 

The LLM contributes to some of these failures, but system architecture causes most of them.

A common complaint from developers building agents: the agent works perfectly for the first 3-5 turns of conversation, then starts forgetting critical details, making contradictory statements, or repeating information the user already provided. 

The issue is not model quality. The issue is that someone stuffed 20 conversation turns into a context window, exceeded token limits, and now the agent only sees the most recent messages while earlier context silently disappeared.

Another recurring problem: agents that enter infinite tool-calling loops. The agent calls a search tool, processes results, decides it needs more information, calls the same search tool with the same query, processes the same results, decides it still needs more information, and loops until hitting max iteration limits. 

The fix requires explicit loop detection and stopping conditions, but most implementations assume the LLM will "figure it out naturally."

Agents need an operating system. The LLM is the kernel. Without memory management, I/O coordination, and permission systems, the kernel runs but nothing useful happens at scale.

Memory systems that don't collapse under real usage

Agent memory divides into two categories: short-term memory for the current conversation and long-term memory for information that persists across sessions. 

Most agents implement short-term memory by dumping conversation history into the prompt. This works until it doesn't, which is like after a minute of practical usage.

Short-term memory without context window disasters

The naive approach: append every message to an array, join the array into a string, stuff it into the prompt. This breaks around turn 15-20 when the concatenated history exceeds the model's context window. The system either truncates early messages (losing context) or errors out (losing the user).

A working approach uses structured state management instead of raw message accumulation. Store conversation state in a database or state manager that tracks:

  • User's stated goals and constraints

  • Key facts extracted from the conversation

  • Completed tasks and their outcomes

  • Open questions requiring clarification

When constructing the prompt, retrieve only relevant state instead of the full message history. If the current turn involves scheduling, load scheduling-related facts. 

If it involves data analysis, load data context. This selective retrieval keeps prompt sizes manageable while maintaining useful context.

Frameworks like LangGraph provide checkpoint mechanisms for managing conversation state across turns. 

The checkpoint stores the full state externally and loads only what the current turn needs. This decouples state persistence from prompt construction, preventing context window overflow.

A practical implementation for small teams: use a simple JSON document per conversation stored in Redis or a similar key-value store. 

Each turn updates the JSON with new facts, task completions, and state changes. The prompt construction step reads this JSON and formats only relevant fields for the LLM.

Long-term memory that learns instead of just storing

Long-term memory handles information that spans multiple conversations or sessions. Users hate repeating themselves. "I told you last week I prefer concise responses" should not need repeating every session.

The storage approach is straightforward: write facts to a database indexed by user ID. The retrieval approach is where implementations diverge between simple and sophisticated.

Simple retrieval: dump all stored facts about the user into every prompt. This suffers from the same context window problems as naive short-term memory and often includes irrelevant information that confuses the agent.

Sophisticated retrieval: extract relevant facts based on the current query using semantic search over stored memories. 

If the user asks about scheduling, retrieve memories related to schedule preferences, time zones, and calendar constraints. If they ask about data analysis, retrieve memories about preferred visualization styles and past analysis topics.

Amazon Bedrock's AgentCore Memory implements a three-stage process: extraction (identify memorable facts from conversations), consolidation (merge related facts across time), and retrieval (surface relevant memories for current context). This mirrors human memory more closely than simple storage and retrieval.

A production-ready memory system needs additional components:

  • Time-to-live policies that expire stale information automatically. A user's preference from 6 months ago might no longer be valid.

  • Access control that restricts which agents can read which memories, particularly in multi-tenant systems.

  • Audit logging for every memory read and write, essential in regulated industries.

Memory systems also need consolidation logic that identifies contradictory facts and resolves them. If stored memory says the user prefers brief responses but recent conversations show they ask for detailed explanations, the system should flag this conflict and resolve it based on recency or explicit user confirmation.

Read: Side Hustles That Actually Work in 2025

Tool calling VS spirals into recursive hell

Tools are functions the agent can call to perform actions: search databases, call APIs, execute code, retrieve documents. 

Tool calling is where agents deliver value. Tool calling is also where agents consume budgets and test your patience.

An agent has access to a search tool and a summarization tool. 

User asks a question. The agent searches, gets results, summarizes them, then decides the summary needs more context, searches again, summarizes the new results, merges summaries, decides the merge needs verification, searches a third time, and continues until you implement max iteration limits and the agent stops mid-task with a "maximum iterations exceeded" error.

This failure is most obviously due to unclear stopping conditions. The agent doesn't know when "enough information" has been gathered or when a task is "complete enough" to return results.

Explicit stopping conditions fix this. After each tool call, the agent must answer: "Did this tool call satisfy the user's request, or is additional work needed?" This decision should be deterministic when possible and LLM-based only when necessary.

For deterministic stopping:

  • Search tool returns zero results: stop and report no findings

  • API call returns an error: stop and report the error

  • API call returns success: stop and return the result

  • Result count exceeds expected range: stop and ask for clarification

For LLM-based stopping:

  • Present the accumulated results to the LLM and ask: "Is this sufficient to answer the user's question completely?"

  • If yes, format the final response

  • If no, specify what additional information is needed and execute ONE more tool call

The key is making the stopping decision explicit rather than assuming the agent will naturally terminate.

Tool selection without guessing games

When an agent has access to 10+ tools, tool selection becomes a mini decision problem. 

The LLM reviews tool descriptions, matches them to the user query, and picks one. This works when tool descriptions are clear and distinct. This fails when descriptions are vague or overlapping.

Poor tool descriptions lead to wrong tool selection. An agent with both "search_documents" and "search_web" tools might pick the wrong one because both descriptions say "searches for information." The agent guesses based on subtle prompt variations, and LLM guesses are often wrong.

Good tool descriptions include:

  • What the tool does

  • When to use it (specific scenarios)

  • When NOT to use it (common mistakes)

  • Example inputs and outputs

  • Cost and latency implications if significant

and keep it all concise and under a few hundred tokens for best performance.

A search_documents tool description might read: "Searches internal company documents for specific information. Use this when the question requires company-specific knowledge, policies, or internal data. 

Do NOT use this for general knowledge questions or web information. Average latency: 200ms. 

Example: 'What is our vacation policy?' → use this tool. Example: 'What is the capital of France?' → do NOT use this tool, answer from general knowledge."

Longer descriptions increase prompt size but dramatically improve tool selection accuracy. The trade-off favors accuracy for production systems, but MAY deteriorate overall reasoning capabilities on the task. You'll really need to experiment for that sweet spot.

Some implementations add a tool-selection agent separate from the main agent. The tool-selection agent receives the user query and available tools, then returns the best tool choice with reasoning. 

The main agent executes using that tool. This adds latency but reduces wrong tool calls that waste time and money, and keeps the primary task at high quality.

Preventing tools from breaking things

Tools that read data are relatively safe. Tools that write data or trigger external actions are dangerous.

An agent with database access can generate valid SQL that deletes important records. An agent with email access can send messages to incorrect recipients. An agent with billing access can process refunds inappropriately.

Input validation before tool execution catches many problems:

  • Type checking: ensure parameters match expected types

  • Range validation: check that numeric values fall within reasonable bounds

  • Permission checking: verify the current user has rights to perform the requested action

  • Dry-run simulation: preview the action's impact before executing

Output validation after tool execution catches others:

  • Result sanity checking: does the output match expected format and content

  • Side-effect verification: did the action produce unintended consequences

  • Error rate monitoring: track tool failure rates and alert on anomalies

Guardrails implement these validations as reusable enforcement mechanisms. A guardrail can be a simple function that checks conditions or an LLM-based agent that evaluates request safety. If a guardrail detects a violation, it raises an exception and prevents tool execution.

Production systems implement guardrails as first-class components that run concurrently with the primary agent using optimistic execution. 

The agent generates tool calls while guardrails validate them in parallel. Violations trigger immediate stops before execution.

Read: How To Make Money Blogging

The 'stop doing stupid stuff' rules that actually work

Guardrails enforce constraints that prevent agent misbehavior. The difficulty is anticipating which stupid behaviors need prevention before users encounter them in production.

Preventing common agent stupidity through constraints

Repetition loops: Agent repeats the same information across multiple turns. Constraint: track message similarity, prevent sending messages with >80% overlap with recent messages.

Ignoring user corrections: User says "no, I meant X" and the agent continues with Y. Constraint: explicit correction detection that overwrites previous context when correction phrases appear.

Unauthorized actions: Agent attempts actions the current user lacks permissions for. Constraint: role-based access control that filters available tools based on user identity.

Hallucinated data: Agent invents plausible-sounding but false information. Constraint: citation requirements that force the agent to reference sources for factual claims, plus validation checks against known data.

Excessive tool usage: Agent calls expensive tools unnecessarily. Constraint: tool budget limits per conversation, plus warning messages when approaching limits.

Unsafe operations: Agent performs destructive actions without confirmation. Constraint: require explicit user confirmation for any DELETE, DROP, or modifying operation.

These constraints get implemented as middleware that sits between the agent and tool execution. The middleware inspects planned actions, applies rules, and either allows execution, blocks it, or requires additional user input.

Guardrails as code vs guardrails as agents

Simple guardrails work as pure functions: input validation, regex matching, keyword filtering. These execute fast, cost nothing, and catch predictable violations. Use function-based guardrails for:

  • Blocklist enforcement (banned words or phrases)

  • Format validation (email addresses, phone numbers)

  • Numeric range checking

  • Permission verification

Complex guardrails require LLM reasoning: intent classification, jailbreak detection, relevance checking, safety evaluation. These execute slower, cost tokens, but catch sophisticated violations. Use LLM-based guardrails for:

  • Detecting prompt injection attempts

  • Evaluating whether a request is within scope

  • Assessing output appropriateness

  • Identifying subtle policy violations

A production system uses both. Fast guardrails run first to catch obvious problems cheaply. LLM guardrails run second for edge cases requiring reasoning. This layered approach balances cost and effectiveness.

Building evaluation that catches failures before users do

Guardrails prevent specific violations. Evaluation checks whether the agent accomplishes its intended purpose correctly. Evaluation happens at two stages: development and production.

Development evaluation uses test sets with known correct answers. For a customer support agent, test questions might include common inquiries with verified correct responses. The agent processes each test case, and evaluation metrics compare generated responses to reference answers. Metrics track:

  • Accuracy: percentage of correct responses

  • Latency: time from query to response

  • Tool usage: number and type of tool calls

  • Token consumption: input and output token counts

Production evaluation monitors live traffic using different techniques since correct answers are not known beforehand. Instead of checking correctness, production evaluation checks for warning signs:

  • Conversation abandonment rates (users give up mid-conversation)

  • Explicit negative feedback (users mark responses as unhelpful)

  • Retry patterns (users rephrase the same question multiple times)

  • Anomalous latency or cost spikes

Martin Fowler's work on GenAI products emphasizes that evaluation is not optional—it becomes the primary method for ensuring LLM-based systems behave as intended. Unlike deterministic systems where unit tests provide certainty, probabilistic systems require statistical evaluation across scenarios.

A practical evaluation workflow for small teams:

  1. Build a test set of 50-100 representative queries with reference answers

  2. Run the agent against this test set after each change

  3. Track metrics over time to detect regressions

  4. Add failing production cases to the test set to prevent recurrence

  5. Review logs weekly to identify new failure modes

This continuous evaluation cycle catches problems early and accumulates institutional knowledge about agent behavior.

Read: Why Adversarially-Regularized Mixed Effects Deep learning Models (ARMED) are awesome

The actual architecture decisions that matter

Textbook agent architectures show clean diagrams with the LLM in the center, tools in a circle around it, and memory as a separate box. Reality is messier. Production agents need decisions about state persistence, error recovery, logging, and cost management.

State management without distributed systems headaches

Agent state includes conversation history, tool call results, user preferences, task progress, and temporary computations. This state needs persistence across requests, especially for long-running tasks that span multiple user interactions.

The simplest state management: store everything in a database keyed by conversation ID. Each agent invocation loads state, processes the turn, saves updated state. This works for low-traffic applications where loading and saving state adds acceptable latency.

Higher-traffic applications need caching layers. Store frequently accessed state in Redis or similar in-memory stores, with periodic flushes to persistent storage. This reduces database load and improves response times.

Stateless agent architectures push state management to external systems entirely. 

The agent itself maintains no state between invocations; all state lives in a state service that the agent queries. This enables horizontal scaling where any agent instance can handle any request without session affinity.

Choosing between stateful and stateless depends on consistency requirements and scaling needs. Customer support agents handling hundreds of concurrent conversations benefit from stateless designs. Personal assistant agents with single-user workloads run fine with stateful designs.

Error recovery that doesn't lose user trust

Agents fail. LLMs refuse to generate output. APIs timeout. Tools return errors. Users submit malformed requests. The difference between a working agent and a frustrating one is how failures get handled.

Silent failures destroy trust. The agent encounters an error, generates a bland response like "I couldn't complete that request," and moves on. The user has no idea what went wrong or how to fix it.

Transparent failures maintain trust. The agent encounters an error, explains specifically what failed ("The database query timed out after 30 seconds"), suggests recovery actions ("Would you like me to retry with a smaller date range?"), and logs the failure for debugging.

Automatic retry with exponential backoff handles transient failures. If an API call fails, wait 1 second and retry. If it fails again, wait 2 seconds. After 3 failures, report the error to the user rather than continuing retries indefinitely.

Graceful degradation provides partial results when complete results are unavailable. If 8 out of 10 data sources return results successfully, show those 8 with a note that 2 sources are temporarily unavailable. This beats showing nothing while waiting for all 10 sources.

Logging and observability without drowning in data

Production agents generate massive log volumes. Every LLM call, tool execution, state change, and error contributes to logs. Without structure, these logs become noise rather than signal.

Structured logging with consistent schemas makes logs searchable and analyzable. Each log entry includes:

  • Timestamp

  • Conversation ID

  • User ID (anonymized if necessary)

  • Event type (LLM call, tool execution, error)

  • Input and output data

  • Latency

  • Token counts and costs

Aggregate metrics from logs enable monitoring:

  • Requests per minute

  • Average latency by conversation stage

  • Tool usage frequency and success rates

  • Token consumption trends

  • Error rates by type

Alerting on metric anomalies catches problems early. 

A sudden spike in tool call failures might indicate an API outage. A jump in average latency could signal context window bloat. A drop in conversation completion rates suggests users are abandoning the agent due to poor responses.

Sampling reduces log volume for high-traffic systems. Log every conversation fully during development. In production, log every conversation that encounters errors, plus a random sample of successful conversations for baseline analysis.

Building agents cheap enough to ship

Most agent tutorials assume unlimited budgets for API calls. Reality imposes constraints. A conversational agent that costs $5 per user session cannot support free-tier users or high-volume applications.

Cost control through smart prompting and caching

Prompt size directly impacts cost. Every token in the prompt costs money. Long system prompts, verbose tool descriptions, and full conversation histories accumulate fast.

Prompt compression techniques reduce costs without losing capability:

  • Use abbreviations in tool descriptions while maintaining clarity

  • Implement prompt caching for static content that repeats across requests

  • Truncate old conversation history intelligently rather than including everything

  • Use smaller models (GPT-4o-mini vs GPT-4) for simple classification and routing tasks

OpenAI and Anthropic offer prompt caching that stores frequently used prompt sections server-side. Subsequent requests pay reduced rates for cached content. This provides 50-90% cost reduction for prompts with substantial static content like system instructions and tool descriptions.

Tool choice impacts cost even more than prompt size. Calling external APIs, performing web searches, or executing complex database queries adds charges beyond the LLM. Unnecessary tool calls multiply costs without adding value.

A cost-monitoring system tracks spending per conversation and per user. Set budget limits that trigger warnings or halt execution when exceeded. This prevents runaway costs from misconfigured agents or adversarial users attempting to drain your budget.

The minimal stack that actually ships

You do not need enterprise MLOps platforms to build working agents. A functional minimal stack for small teams:

Components:

  • An LLM API (OpenAI, Anthropic, or open source models via Replicate/Together)

  • A database for state storage (PostgreSQL, MongoDB)

  • A cache for fast state access (Redis optional but recommended)

  • A web framework for the API layer (FastAPI, Express)

  • Basic logging infrastructure (structured logs to files or log service)

Workflow:

  1. User sends message via API

  2. Load conversation state from database/cache

  3. Construct prompt with state and user message

  4. Call LLM API, include available tools

  5. Process tool calls if requested, execute with validation

  6. Save updated state

  7. Return response to user

Cost breakdown for 1,000 daily active users:

  • LLM API calls: $100-300/month depending on model and usage patterns

  • Database: $20-50/month for managed PostgreSQL

  • Redis cache: $10-30/month or free tier for low volume

  • Compute for web server: $20-50/month for basic cloud VM

  • Monitoring and logs: $10-20/month for basic observability

Total: $160-450/month, covering 30,000 monthly conversations at moderate complexity.

This stack scales to 10x volume by upgrading database and compute. It scales to 100x volume by adding horizontal scaling and distributed caching. Start simple, scale when needed.

Read: How to get 10000+ Clicks on AdSense Ads Per Month

Framework choices that don't trap you later

Agent frameworks promise faster development by abstracting infrastructure. They deliver on that promise initially, then impose constraints and complexity as your requirements grow.

When frameworks help (and when they hurt)

LangChain, LlamaIndex, AutoGen, CrewAI, and similar frameworks provide pre-built components for memory, tools, and orchestration. They excel at rapid prototyping where you need a working agent in hours.

Frameworks hurt when you need customization beyond their supported use cases. Debugging framework internals when things break becomes painful because you are navigating abstraction layers rather than your own code. 

Performance optimization requires understanding framework implementation details you did not want to learn.

A pragmatic approach: use frameworks for prototyping and learning, then selectively extract what works for production. 

Take the memory management logic you like, reimplement it directly in your code without the framework wrapper. Keep the tool-calling abstractions if they fit your needs. Drop the orchestration layer if it adds complexity without value.

For teams building production agents as core products, custom implementations often win long-term despite higher initial cost. You control every component, debug more easily, and optimize without framework constraints.

For teams building agents as features within larger products, frameworks can be fine if they remain maintainable and the framework's direction aligns with your needs.

The "just use the API directly" approach

Calling LLM APIs directly without frameworks requires more code initially. You write your own prompt construction, tool-calling logic, and state management. In return, you get simplicity and full control.

A direct API implementation for a simple agent fits in 200-300 lines of Python:

  • 50 lines for state management

  • 50 lines for prompt construction

  • 50 lines for tool-calling logic

  • 50 lines for API communication and error handling

  • 50 lines for validation and guardrails

This simplicity makes debugging straightforward and onboarding new developers fast. The code does exactly what it says with no hidden framework behaviors.

The trade-off is that you implement features frameworks provide for free: memory persistence, conversation management, multi-agent coordination. 

For single-agent use cases, this overhead is minimal. For complex multi-agent systems, frameworks provide more value.

What production-ready agents actually need

Production readiness means the agent works reliably for real users over weeks and months, not just in demos. The requirements:

Deterministic behavior for critical paths: When users perform important actions (payments, data modifications), the agent should execute deterministic code paths, not LLM-generated ones.

Comprehensive logging: Every decision, tool call, and error gets logged with enough context to reconstruct what happened and why.

Graceful degradation: When components fail, the system provides partial functionality rather than complete failure.

Cost monitoring and budget enforcement: Track spending in real-time, alert on anomalies, halt execution at budget limits.

Security boundaries: Validate inputs to prevent prompt injection, limit tool access based on user permissions, sanitize outputs before display.

Evaluation and testing: Continuous testing against known scenarios, monitoring for regressions, user feedback collection and analysis.

Documentation: Clear operational runbooks for common issues, system architecture diagrams that match reality, known limitations documented for users.

These requirements feel boring compared to the excitement of building your first working agent. They separate systems that survive production from systems that generate expensive incident post-mortems.

Looking forward: Agents that learn from mistakes

The current generation of agents forgets everything after each session unless you explicitly implement memory. The next generation will need to learn from mistakes and improve over time without constant human intervention.

This learning happens through evaluation loops that detect patterns in failures, automatically adjust system prompts or guardrails, and validate that changes improve behavior. 

The technical pieces exist today—we have evaluation frameworks, automated prompt optimization, and behavioral analysis tools. Connecting them into reliable self-improvement loops remains an active area of development.

For now, build agents with good memory, clear stopping conditions, and strong guardrails. Let them execute tools safely. 

Track their behavior carefully. When they mess up, fix the specific failure and add it to your test set. This manual improvement loop works and scales to teams handling thousands of conversations.

The autonomous improvement loop can wait until you have the foundation solid.

I hope this guide gives you the practical pieces for building agents that work reliably instead of spectacularly failing in production. Come back later for more posts on building AI systems that respect both your users and your budget.

Read: The Fundamentals of Keyword Research for Blogging

Comments

Popular Posts