3. Building a Distributed Agent
This chapter covers
- From the terminal to the cloud
- Distributed computation
- Distributed Async Await
- The Research Agent
Building Agents? Learn more about Resonate's Distributed Async Await, a dead simple programming model for building reliable agentic applications.
Join the Discord
When you build an agentic application that runs on your machine, you enjoy the simplicity of building a local application. When you build a distributed agentic application, you face the challenges of building a distributed application. This gap separates proof of concept from production.
In Chapter 2, we built the Desktop Warrior, a local agent living in a single process. That process boundary quietly solved some of the hardest problems in systems design: identity, oneness, continuity in time and space. The process was the agent. We never had to ask "Which Desktop Warrior?" because the agent instance was identified by the process. We never had to wonder "What does Desktop Warrior remember?" because the agent instance's memory was contained in the process's memory.
When we move from local to global, that boundary disappears. One logical agent instance may span many physical processes across time and space. What felt continuous becomes fragmented. What felt durable becomes ephemeral.
In this chapter, we will build The Research Agent—a distributed, recursive agent that breaks a research topic into subtopics, researches each subtopic recursively, and synthesizes the results. Recursion is a key property: one agent instance spawns additional instances, creating a sprawling multi-agent system. From a simple code base, we’ll uncover the core challenges of distributed computation, coordination, and recovery (see Figure 3.1).
Figure 3.1: The distributed, recursive Research Agent
The defining characteristic of concurrency is partial order. In a concurrent system, we do not know what will happen next. To mitigate concurrency, we employ coordination. Coordination refers to constraining possible schedules to desirable schedules, while synchronization refers to enforcing that constraint. The fundamental operation of synchronization is to wait.
E₁ | E₂ ≡ (E₁ ; E₂) ∨ (E₂ ; E₁)
Ideally, two executions that run concurrently should produce the same result as running one after the other
The defining characteristic of distribution is partial failure. In a distributed system, we do not know what will fail next. To mitigate distribution, we employ recovery. Recovery refers to extending partial schedules to complete schedules, while supervision refers to enforcing that extension. The fundamental operation of supervision is to restart.
E | ⚡️ ≡ E
Ideally, a process that may fail should produce the same result as running to completion
3.1 Moving from the Terminal to the Cloud
Running applications locally is a joy. My laptop rarely restarts, my terminal window never crashes, and my terminal tabs stay open for days, sometimes weeks. The process running in that tab is still there, faithfully executing or patiently waiting for the next input, ready to continue as if no time has passed. Sure, in theory, the process could crash at any moment, but in practice, we hardly think about that possibility. Failure is an edge case, not the norm (see Figure 3.2).
Figure 3.2: A logical execution mapping one-to-one to a physical execution. One-to-one relationships simplify reasoning about objects: each uniquely identifies the other, effectively collapsing both into one.
Distributed systems shatter this simplicity. Your application no longer lives in one process but spans many processes across machines, data centers, and continents, executing over minutes, hours, days, or weeks. Failure is no longer an edge case but becomes the regular case when processes crash and networks partition (see Figure 3.3).
Figure 3.3: A logical execution mapping one-to-many to multiple physical executions. One-to-many relationships complicate reasoning about objects: one no longer identifies the other, effectively exploding one into many.
Before building the Research Agent, we need a model of computations that spans space and time.
When we think of distribution, we think of distribution in space: one execution invokes another execution on another location (another process or machine). The invoke creates a cut in space.
But we also must think of distribution in time: one execution awaits another execution, suspending while waiting and subsequently resuming. The await creates a cut in time.
3.1.1 Model of Computation
Computation is a graph with nodes representing executions and edges representing relationships or dependencies between executions.
For example, the root execution in Figure 3.1, representing the topic "What are distributed systems", spawns two executions representing the topics "What is scalability" and "What is reliability". Here, an edge between executions represents invoking another execution and awaiting its result.
The root of the graph has a dual nature: The root represents both the initial execution and the entire collection of executions spawned during its lifetime. This collection, the complete graph of executions, is often referred to as a workflow.
This pattern is recursive: every execution represents itself and the collection of executions spawned during its lifetime. Depending on context, we might refer to the complete graph as "a workflow", a subgraph as "a workflow", or, to emphasize the hierarchy, as "a sub-workflow" or "a child workflow".
Logical vs Physical View
Consider what happens when the Research Agent spawns an agent to research a subtopic. The logical execution—"What are distributed systems"—exists independent of any physical executions. Yet, the logical execution must map to a physical execution (a function running in a process running on a machine) to make progress. This mapping is not permanent. The logical execution may map to different physical executions over time. For example, when a crash occurs, recovery remaps the logical execution to a new physical execution.
One logical execution maps to one or more physical executions. We say that the logical execution is composed of its physical executions and the physical executions contribute to the logical execution.
This logical–physical split underpins how we reason about identity, coordination, and recovery.