SOP-010: Agent Management (Spawn, Monitor, Circuit Breaker)

Purpose

Manage ADLC subagents: spawning, monitoring, tracking results, handling failures, and applying the circuit breaker pattern. Ensures efficient resource usage and prevents runaway failures.

Scope

Applies to the ADLC orchestrator managing all subagent types: dev, ba, architect, security, devops, pr, deploy, e2e-test, scenario, auditor, resolver, provisioner.

Prerequisites

Orchestrator running in ods-claude tmux session
Available memory > 2000MB for agent spawning
Agent definitions in ~/.claude/agents/
CLI tools at ~/dev/ops/adlc-v2/scripts/cli/
Slack credentials for alerting

Procedure

1. Pre-spawn resource check

Always check memory before spawning:

MEM=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
echo "Available memory: ${MEM}MB"

Memory	Action
> 2000MB	Spawn freely, launch all pending work in parallel
1000-2000MB	Queue new spawns, wait for running agents to finish
< 1000MB	Do NOT spawn, post CRITICAL to Slack DM
< 512MB	Kill no agents but queue ALL new spawns

Critical rule: Do NOT artificially limit to 1-2 agents when RAM is available. Do NOT sleep between spawns. Check memory once, then launch everything.

Exception: Max 1 Rust cargo build at a time (lesson from 2026-03-20 OOM crash). Each Rust compilation uses 500MB-1.5GB.

2. Spawn agents by type

Dev agents (one task per agent)

/agent dev "SERVICE: {service}. PROJECT: {project}. TASK: {task_id} -- {description}. Spec: ~/dev/specs/{project}/specs/{service}/spec.md. Focus ONLY on this task."

Review agents (spawn in parallel after tests pass)

/agent ba "SERVICE: {service}. PROJECT: {project}. Review against spec. Write JSON to ~/dev/ops/reviews/{service}/ba-report.json."
/agent architect "SERVICE: {service}. PROJECT: {project}. Write JSON to ~/dev/ops/reviews/{service}/architect-report.json"
/agent security "SERVICE: {service}. PROJECT: {project}. Write JSON to ~/dev/ops/reviews/{service}/security-report.json"
/agent devops "SERVICE: {service}. PROJECT: {project}. MODE: review. Write JSON to ~/dev/ops/reviews/{service}/devops-report.json"

PR agent

/agent pr "SERVICE: {service}. PROJECT: {project}. Reviews in ~/dev/ops/reviews/{service}/. Create PR and merge to staging."

Deploy agent

/agent devops "SERVICE: {service}. PROJECT: {project}. MODE: deploy. Config: ~/dev/ops/coolify/{service}.json. Verify health check."

E2E agents (sequential: scenario first, then test)

/agent scenario "SERVICE: {service}. PROJECT: {project}. Generate E2E scenarios. Write to ~/dev/projects/{service}/tests/e2e/"

Wait for completion, then:

/agent e2e-test "SERVICE: {service}. PROJECT: {project}. Execute E2E tests. Scenarios: ~/dev/projects/{service}/tests/e2e/"

Auditor (every 6 hours)

/agent auditor "Audit all active projects. Check pipeline sequence, review quality, registry, test coverage. Write to ~/dev/ops/reviews/"

3. Monitor agent completion

After spawning, agents write their outputs to: - Status files: ~/dev/ops/outputs/{service}-{agent}.status - JSON reports: ~/dev/ops/reviews/{service}/{agent}-report.json - Pipeline state: ~/.claude/agent-memory/pipeline/state.md - Lessons learned: ~/dev/ops/lessons-learned.md

Check for completion:

# Check all status files for a service
for f in ~/dev/ops/outputs/{service}-*.status; do
  [ -f "$f" ] && echo "$(basename $f): $(head -1 $f)"
done

Validate status file integrity:

bash ~/dev/ops/adlc-v2/scripts/validate-status.sh

4. Handle agent results

Status	Meaning	Next Action
DONE	Agent completed successfully	Advance pipeline
FAILED	Agent found issues	Read report, spawn fix
RUNNING	Agent still working	Wait, check again next cycle
BLOCKED	Agent cannot proceed	Analyze cause, escalate
BLOCKED_EXTERNAL	Missing external dep	Follow SOP-007

Read JSON reports to determine next steps:

python3 -c "
import json, glob
for f in glob.glob('$HOME/dev/ops/reviews/{service}/*-report.json'):
    r = json.load(open(f))
    name = f.split('/')[-1]
    verdict = r.get('verdict', r.get('status', '?'))
    print(f'{name}: {verdict}')
"

5. Circuit breaker (3-strike rule)

Track retries per service per agent type:

RETRY_FILE="$HOME/.claude/agent-memory/pipeline/retries-{service}-{agent}.txt"
RETRIES=$(cat "$RETRY_FILE" 2>/dev/null || echo "0")

On each failure:

RETRIES=$((RETRIES + 1))
echo "$RETRIES" > "$RETRY_FILE"

if [ "$RETRIES" -ge 3 ]; then
  echo "CIRCUIT BREAKER: {service}/{agent} failed 3 times"
  # Mark BLOCKED
  CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
  bash $CLI/write-status.sh {service} {agent} BLOCKED "Circuit breaker: 3 failures"

  # Post to Slack DM immediately
  source ~/.env.adlc
  curl -sf -X POST "https://slack.com/api/chat.postMessage" \
    -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(python3 -c "
import json
print(json.dumps({
    'channel': 'D0AGRAVEC1K',
    'text': ':rotating_light: CIRCUIT BREAKER -- {service}/{agent} failed 3 times\nLast error: {error_description}\nAction needed: Manual investigation required'
}))
")"
fi

On success: Reset the retry counter:

echo "0" > "$RETRY_FILE"

6. Handle common agent failure patterns

Pattern	Cause	Fix
Agent writes JSON to status file	Agent captured API response and wrote it raw	Use CLI tools only (lesson from 2026-03-21)
Agent invents status keywords	Agent uses SECURITY_PASS, SCENARIOS_READY, etc.	Use validate-status.sh –fix (lesson from 2026-03-23)
Agent cannot find spec	Wrong path convention	Glob for spec: `~/dev/specs/*/spec*` (lesson from 2026-03-23)
Agent OOM killed	Concurrent Rust builds	Max 1 Rust build at a time (lesson from 2026-03-20)
BA marks pending tasks as MISSING	BA doesn’t know which tasks are pending	Include completed task list in BA agent prompt
Dev agent leaves stub module	Stub not replaced when real module added	Agent prompt must include “remove stub, update all imports” (lesson from 2026-03-23)

7. Parallel execution rules

Independent work: spawn all at once. Dev tasks across different services, review agents for same service.
Sequential dependencies: Scenario agent MUST complete before E2E test agent. Tests MUST pass before reviews.
Never serialize when RAM is available. Check memory once, launch everything.
Never sleep between spawns. No sleep calls between agent launches.

8. Status file format enforcement

All agents MUST use CLI tools for output:

CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
bash $CLI/write-status.sh {service} {agent} {STATUS} "{details}"
bash $CLI/write-review.sh {service} {agent_type} < report.json
bash $CLI/write-pipeline-state.sh {project} {service} {STATE} "{details}"
bash $CLI/write-lesson.sh {agent} {service} "{problem}" "{root_cause}" "{fix}" "{prevention}"

NEVER write these files directly with Write or Edit tool – the CLI validates format and rejects invalid input.

Verification

All spawned agents have corresponding status files: ls ~/dev/ops/outputs/{service}-*.status
No stale RUNNING status (agent completed but status not updated)
Validate format: bash ~/dev/ops/adlc-v2/scripts/validate-status.sh
Retry counters are reasonable: cat ~/.claude/agent-memory/pipeline/retries-*.txt 2>/dev/null
Memory is healthy: awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo > 2000MB

Rollback

If an agent corrupts state: 1. Validate and fix status files: bash ~/dev/ops/adlc-v2/scripts/validate-status.sh --fix 2. Reset retry counter: echo "0" > ~/.claude/agent-memory/pipeline/retries-{service}-{agent}.txt 3. Clear stale RUNNING status: re-run the agent or manually set to previous stable state 4. If agent corrupted source code: cd ~/dev/projects/{service} && git checkout dev -- .

References

CLI tools: ~/dev/ops/adlc-v2/scripts/cli/
Agent definitions: ~/.claude/agents/
Status format: ~/dev/ops/adlc-v2/scripts/validate-status.sh
Dispatcher: ~/dev/ops/adlc-v2/scripts/dispatcher-v3.sh (write_status helper)
Pipeline scanner: ~/.claude/skills/check-pipeline/SKILL.md
Lesson: OOM from concurrent Rust builds (2026-03-20) – max 1 build at a time
Lesson: Status file corruption (2026-03-21, 2026-03-22, 2026-03-23) – use CLI tools
Lesson: BA FAIL blocking dev (2026-03-21) – mark unimplemented tasks as N/A
Lesson: Spec not found (2026-03-23) – Glob for spec files
Slack channels: DM (D0AGRAVEC1K) for blockers, ADLC (C0AN0N8AUGZ) for milestones