Skip to main content
Lab Notebook Shipped Talon

Building a Multi-Agent PM Framework on Claude Code

May 20, 2026 · 7 min read

I have four projects in flight at once. A blog, a portfolio site, an OS prototype, and a personal-agent harness that’s supposed to coordinate the other three. Until Phase 15 of the harness, every cross-project decision came back to me: which one needs attention now, what’s the next phase, is this scope creep or actually the work. The harness was supposed to be the thing that absorbed that load. Instead it was a context-thrashing single agent that couldn’t hold more than one project in working memory at a time.

This post is about what we built to fix that, what it looks like in code, and what happened when we used it to plan the article you’re reading.

The Topology

Phase 15 of the Talon harness introduced a project-manager layer. Talon’s top-level Claude Code session — call it the Director — no longer plans projects itself. Instead it spawns a PM agent per active project. Each PM owns its project’s .planning/ directory, decomposes phases into plans, dispatches executor subagents, and reports back. The Director keeps a portfolio view: which PMs are alive, which are blocked, who’s burning budget, who needs human input.

Here is the actual topology of the four-project portfolio as of the day this post shipped. The bracketed annotations show the gates each Executor must pass before launching.

Director (Claude Code — talon session)
├── PM: brainbytes (THIS session — writing this post)
│   ├── Executor: gsd-roadmapper      [budget ✓  loop ✓  sibling ✓]
│   ├── Executor: gsd-planner          [budget ✓  loop ✓  sibling ✓]
│   ├── Executor: gsd-plan-checker     [budget ✓  loop ✓  sibling ✓]
│   ├── Executor: gsd-executor (01-01) [budget ✓  loop ✓  sibling ✓]
│   └── Executor: gsd-executor (01-02) [budget ✓  loop ✓  sibling ✓]
├── PM: evanmusick-dev
├── PM: vulcanos
└── (Talon harness)

Each PM is a standalone Claude Code session with an overlay file (~/Talon/shared/talon-pm-project.md) that defines its responsibilities, escalation triggers, and post-dispatch hooks. The PM cannot escape its overlay; the Director cannot directly write to a project’s files. The boundary is the contract.

The Mechanisms

The framework’s value isn’t the topology — that’s just a tree. The value is in four small shell scripts that act as gates. Every Executor dispatch from a PM must pass all four before launching. Read them as Unix tools: they do one thing, exit 1 on failure, and use the filesystem as their database.

budget-check.sh

Per-PM budget, stored as cents in var/pm/<pm>/budget.txt. The Director sets a cap when spawning the PM; the PM decrements after every Executor dispatch.

[[ ! -f "$f" ]] && { echo "budget-check: no budget file at $f" >&2; exit 1; }

cur=$(cat "$f")
if [[ "$mode" == "--decrement" ]]; then
  cur=$((cur - amount))
  echo "$cur" > "$f"
fi
[[ "$cur" -le 0 ]] && { echo "budget-check: budget exhausted ($cur cents) for $pm" >&2; exit 1; }

This PM was given 1500¢ ($15) for the whole phase. After planning, plan-checking, and the first two execution waves, the file reads 1320 — about a buck-eighty spent. The gate has not tripped. If it did, the PM would halt and escalate.

sibling-guard.sh

A file-as-mutex. Before dispatching an Executor on task_id, the PM creates var/pm/<pm>/active-tasks/<task_id>. The guard refuses any sibling dispatch on the same task_id while the lock exists. After the Executor returns, the PM removes the lock.

lock="$root/$pm/active-tasks/$task"
[[ -e "$lock" ]] && {
  echo "sibling-guard: sibling executor active on $task (lock: $lock)" >&2
  exit 1
}

The mutex is decoupled from the gate that reads it. The overlay (not the script) is responsible for touching the lock at dispatch and rm-ing it at completion. This is a deliberate split: the script stays pure read-only, so it can run inside any pre-commit-like environment. The state-mutation logic lives in the orchestration layer where it can be audited as part of the PM’s behavior contract.

loop-check.sh

The PM keeps a running tally of (task_id, executor, failure_category) tuples in var/pm/<pm>/loops.txt. Each Executor dispatch increments the count for its tuple. At ≥3 retries of the same tuple, the gate trips.

v1.5-roadmap         gsd-roadmapper      init  1
phase-01-plan        gsd-planner         init  1
phase-01-plan-check  gsd-plan-checker    init  1
phase-01-01-execute  gsd-executor        init  1
phase-01-02-execute  gsd-executor        init  1

That is the actual loops.txt for this session at the time of writing. Five rows, all count=1. Nothing retried; nothing looped. If wave 1 had failed and we’d retried, the executor row would say count=2. A third retry on the same failure category and the gate would refuse and escalate — saving the budget and forcing a human-in-loop decision.

escalate.sh

When something needs the Director’s attention — scope creep, budget exhaustion, a blocking decision the PM can’t make alone — the PM calls escalate.sh. The script writes to three channels.

# Channel 1: file (hard-fail).
cat > "$f" <<EOF
# Escalation: $pm @ $ts
**Reason:** $reason
**PM:** $pm
EOF

# Channel 2: Telegram marker (best-effort).
# Channel 3: agent-comm send-file to Director session (best-effort).

Channel 1 is the source of truth — a markdown file on disk under var/pm/<pm>/escalations/. The other two channels are notifications. The Director might be in another session, asleep, or busy with a different PM. The file persists. The PM has done its job once the file lands.

There is a fourth principle hiding in this design: state lives in the filesystem, behavior lives in scripts, decisions live in agents. Each layer is independently inspectable. Want to see what a PM has been doing? ls var/pm/brainbytes/. Want to know if a gate is working? Read the bash. Want to know why a particular escalation happened? Read the markdown file the gate produced.

Premortem

Before any of this shipped, we ran a structured premortem. The goal: identify the failure modes that would erode trust in the framework, then design them out before the first real run.

Five failure modes came up. Three got designed out, two got accepted as known limitations:

  • Context accumulation in long PM sessions. Mitigated: re-anchoring cadence built into the overlay. Every 4 hours OR every 10 Executor dispatches, the PM re-reads its charter and STATE.md and posts a brief status if drift is detected. Silent otherwise.
  • Executor diverging from plan without PM noticing. Mitigated: gsd-plan-checker runs after the planner, before any Executor launches. A plan that doesn’t honor the requirement IDs gets sent back for revision (up to 3 iterations).
  • Stale lock file after a crashed session. Accepted: the overlay treats sibling-guard as advisory. If the lock is stale, the PM can detect this by reading the mtime and the running-process state; for now the Director clears stale locks manually. Not worth the engineering to auto-recover yet.
  • Budget cap too low for complex phases. Mitigated: the Director can raise the cap via cap-raise.sh (already tested). The gate is a circuit breaker, not a ceiling.
  • Tier-3 destructive operations from a PM session. Accepted with hard block: git reset --hard, git push --force, and similar are blocked by a global hook regardless of mode. A PM that needs to do one of these escalates to the Director, who runs it in their terminal.

That last one fired during this session. A planning commit landed on the wrong branch. The cleanup wanted a git reset --hard — and the hook refused. The PM escalated, the Director gave a one-paragraph approval, the human ran the reset. Working as designed: the framework is willing to lose a few minutes of wall-clock to keep the destructive operations under human eyes.

The Live Test

The first real run of this framework is the one writing these words.

When the Director spawned this PM session, the brief was: “Open the BBL /lab section, write the inaugural lab post about Phase 15.” The PM ran /gsd:progress, found the project between milestones, recommended completing v1.4 and opening v1.5, and got the Director’s approval. Then it ran the full workflow: complete-milestone, new-milestone, plan-phase, execute-phase. Five Executor dispatches across two waves. One real escalation. One real branch-hygiene problem the PM caught before the Director did. About $13 of the $15 phase budget still in the bank at the point this paragraph was being written.

Honest report on what worked: the gate scripts are unglamorous and fast. Every dispatch was ≤2 seconds of overhead before the Executor launched. The var/pm/brainbytes/ directory makes the PM’s history grep-able. After two days I can reconstruct exactly what the PM decided and when, without re-reading the chat log.

Honest report on what was rough: the GSD tooling’s phase-directory discovery hit stale state from the previous milestone and tried to dispatch the planner against the wrong directory. The PM had to detect this manually (init plan-phase returned a path that didn’t match the new phase) and archive the old phase dirs before re-init. This is a tooling issue, not a framework issue, but it cost about ten minutes and is the kind of thing a less attentive PM would have shipped past. Future work: have the milestone-complete CLI optionally archive phase directories by default.

Honest report on what surprised me: the PM session, working from a tightly-scoped charter, made better local decisions than I expected. Faced with two routing paths for a new milestone (add-phase vs new-milestone), it picked the higher-ceremony option that matched the charter framing, documented the trade-off, and asked the Director to confirm. That’s the behavior I wanted but didn’t think I’d get on day one.

Lessons

If you are building anything like this, here are the four patterns I would carry to the next system.

State in files. Logic in scripts. Decisions in agents. Each layer is independently inspectable, independently testable, and independently replaceable. A ls var/pm/ tells you the state of the system in two seconds.

Gates are exit codes, not advice. A PM that “considers” the budget is a PM that ignores the budget. The script exits 1, the dispatch refuses to launch, and the PM has to actually deal with the failure. This is the difference between policy and enforcement.

Three-channel escalation, hard-fail on the first. File on disk is the source of truth. Telegram and inter-session messaging are notifications. The PM has done its job the moment the markdown file exists. Anything else is the Director’s responsibility.

Make the framework live in its own dogfooding. Phase 15 doesn’t get marked shipped until the first PM session runs end-to-end against a real project. This post is the artifact of that test. If you can’t dogfood it, you don’t actually know what you’ve built.

For more on AI agent behavior, see the articles section.

Written by

Evan Musick

Computer Science & Data Science student at Missouri State University. Building at the intersection of AI, software development, and human cognition.