Most agent demos work because someone curated the happy path. Shipping one means accepting that the model will hallucinate a tool argument, the retrieval step will return three contradictory memories, and an API will time out mid-conversation. I have built an LLM "operating system" with persistent memory and tool calling, and the parts that mattered were almost never the prompt. They were the plumbing around it. Here is what I keep coming back to.
Memory Is a Retrieval Problem, Not a Storage Problem
Persisting everything is easy. The hard part is deciding what to surface into a finite context window on each turn. I treat memory as a small retrieval pipeline with three stages: candidate fetch, scoring, and deduplication.
Scoring is where the quality lives. A pure vector similarity search will happily return five paraphrases of the same fact. I combine semantic relevance with recency and a usage count so that things the agent actually relied on before float up:
- Relevance: cosine similarity between the query embedding and the memory embedding.
- Recency: an exponential decay on
last_accessed, so stale memories fade without being deleted. - Reinforcement: a small boost each time a memory is retrieved and used in a successful turn.
Deduplication then collapses near-identical entries above a similarity threshold, keeping the most recent and dropping the rest. This is unglamorous and it is the single biggest lever on perceived "intelligence."
A model with mediocre reasoning and excellent memory retrieval beats a strong model fed irrelevant context. Spend your time on retrieval ranking before you reach for a bigger model.
Tool Calling Needs Strict Contracts
The model proposes, your code disposes. I never let a tool call execute against the raw model output. Every tool has a schema, and arguments are validated before anything runs. Define the contract once and let the runtime enforce it:
import { z } from "zod";
const lookupOrderArgs = z.object({
orderId: z.string().regex(/^ORD-\d{6}$/),
includeLineItems: z.boolean().default(false),
});
export const lookupOrder = {
name: "lookup_order",
description: "Fetch an order by its ID. Use only when the user gives an explicit order number.",
input_schema: {
type: "object",
properties: {
orderId: { type: "string", description: "Format ORD-123456" },
includeLineItems: { type: "boolean" },
},
required: ["orderId"],
},
handler: async (raw) => {
const args = lookupOrderArgs.parse(raw); // throws on bad input
return db.orders.findById(args.orderId, args.includeLineItems);
},
};Two things matter here. The description is prompt engineering — it tells the model when not to call the tool, which prevents most spurious invocations. And the parse step converts a malformed call into a typed error you can feed back to the model as a tool result, letting it self-correct on the next turn instead of crashing the run.
Trace Every Step or You Are Debugging Blind
An agent run is a tree of decisions, and when it goes wrong in production you cannot reproduce it from a chat log. I record a structured trace for every turn: the messages sent, the tool calls proposed, the validated arguments, the raw tool results, latency, and token counts. Each run gets an ID that ties together every step.
This pays for itself the first time a customer reports a bad answer. You pull the trace and see the agent called lookup_order with an ID it invented because a memory entry was stale. Without the trace you would be guessing. With it, you fix the retrieval scoring and move on.
Guardrails and Failure Handling
Agents fail in predictable ways, so handle them predictably:
- Loop detection: cap tool-call iterations per turn and break if the same tool is called with the same arguments twice.
- Tool errors as data: never throw out of a handler into the void. Return the error as a tool result so the model can react.
- Output validation: if a turn must produce structured output, validate it and re-prompt once on failure before giving up.
- Timeouts and fallbacks: every external call gets a timeout, and a failed tool returns a graceful "I could not reach that system" rather than hanging.
The discipline that ties this together is treating the model as one untrusted component in a deterministic system, not as the system itself. The model is creative; your runtime is boring on purpose. When you draw that line clearly, agents stop being impressive demos and start being software you can put a customer in front of.