Metamatic Systems

If you’ve ever built an AI agent that executes multiple steps, you’ve probably encountered the dreaded “mid-loop crash.”

You give the AI a complex task. It thinks. It runs a tool to read a file. It thinks again. It runs a tool to query a database. It thinks again. And then—right before the final step—a network timeout occurs. Or the API limit is hit. Or the system restarts.

And just like that, the entire context, all the expensive LLM tokens, and all the progress are lost. The agent has amnesia. You have to start over.

In the hide-ide project, our primary voice assistant orchestrator (agent_loop.py) suffered from exactly this problem. It was a monolithic while loop holding the entire conversational state in memory.

Here is what the traditional, fragile architecture looks like:

mermaid sequenceDiagram participant P as Speech Processor participant A as AgentLoop (Memory) participant L as LLM participant T as Tools

P->>A: User input
loop until complete
    A->>L: Predict next step
    L-->>A: Call Tool (e.g. bash_execute)
    A->>T: Execute Tool
    T-->>A: Tool result
    note over A: Crash happens here!
    note over A,P: 💥 Fatal Error: Process Restarted
    note over A,P: All previous LLM context & tool results are lost
end

The Durable Solution: Restate + Pydantic AI

To solve this, we looked at Durable Execution. A durable execution engine intercepts every non-deterministic action (like an API call or a tool execution) and records it in a persistent journal. If the process crashes, the engine simply restarts the function and replays the journal. It fast-forwards through the parts that already succeeded and resumes exactly where it left off.

We decided to build an experimental alternative loop (pydantic_loop.py) using two powerful tools:

Pydantic AI: A modern framework for building type-safe agents with simple @agent.tool() decorators.
Restate: A lightweight durable execution runtime.

Here is the durable architecture we implemented:

mermaid sequenceDiagram participant P as Speech Processor participant R as Restate Server (Journal) participant A as PydanticLoop (Stateless) participant L as LLM participant T as Tools

P->>R: User input (POST /PydanticLoop/run)
R->>A: Invoke run handler
loop until complete
    A->>L: Predict next step
    L-->>A: Call Tool
    A->>R: run_typed(tool_call)
    R->>A: Execute Tool (if not journaled)
    A->>T: Execute Tool
    T-->>A: Tool result
    A->>R: Save result to Journal
    R-->>A: Continue
    note over A: Crash happens here! 💥
end
note over R,A: Restate detects crash, replays journal
R->>A: Re-invoke run handler
note over A,R: Agent skips executed tools, uses journaled results!

The Implementation Details

Using Pydantic AI’s integration with Restate, wrapping an agent in a durable context is incredibly elegant.

First, we define our standard agent:

Then, we define our tools. The magic happens inside the tool execution. Instead of just running the bash command, we wrap it in restate_context().run_typed(). This ensures that the execution of this specific bash command is journaled. If the agent crashes after this tool runs, Restate will supply the journaled output on the next retry instead of running the tool twice!

Finally, we expose the agent as a Restate Service:

The Trade-offs

This new architecture brings massive benefits for long-running, multi-step agent tasks, but it isn’t without its challenges:

Stateless Operations Required: Our original agent_loop used a pexpect background process to maintain a stateful bash shell (so cd /dir and export VAR=1 would persist across tool calls). Restate replays expect deterministic, stateless closures. A dead pexpect process cannot be rehydrated from a journal, so our new tools had to be rewritten to be completely stateless.
Infrastructure Complexity: It requires running the Restate Server alongside our application. However, deploying the Restate Docker container (ghcr.io/restatedev/restate) and linking it to our Hypercorn ASGI app via the Restate admin API (/deployments) was a relatively smooth process.

Conclusion

In-memory loops are fine for simple, one-shot chatbot interactions. But when your AI assistant is operating as an autonomous agent—navigating file systems, modifying code, and orchestrating multiple tools over several minutes—fragility is unacceptable.

By combining the schema-enforced power of Pydantic AI with the crash-proof journaling of Restate, we can finally build AI agents that are as resilient as the distributed systems they help us manage.

If you’re interested in discussing AI architecture or want to see more experiments like this, let’s connect!

Best Regards,
Heikki Kupiainen / Metamatic Systems