SOP-010: Agent Management (Spawn, Monitor, Circuit Breaker)

SOP-010: Agent Management (Spawn, Monitor, Circuit Breaker)

Purpose

Manage ADLC subagents: spawning, monitoring, tracking results, handling failures, and applying the circuit breaker pattern. Ensures efficient resource usage and prevents runaway failures.

Scope

Applies to the ADLC orchestrator managing all subagent types: dev, ba, architect, security, devops, pr, deploy, e2e-test, scenario, auditor, resolver, provisioner.

Prerequisites

Procedure

1. Pre-spawn resource check

Always check memory before spawning:

MEM=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
echo "Available memory: ${MEM}MB"
Memory Action
> 2000MB Spawn freely, launch all pending work in parallel
1000-2000MB Queue new spawns, wait for running agents to finish
< 1000MB Do NOT spawn, post CRITICAL to Slack DM
< 512MB Kill no agents but queue ALL new spawns

Critical rule: Do NOT artificially limit to 1-2 agents when RAM is available. Do NOT sleep between spawns. Check memory once, then launch everything.

Exception: Max 1 Rust cargo build at a time (lesson from 2026-03-20 OOM crash). Each Rust compilation uses 500MB-1.5GB.

2. Spawn agents by type

Dev agents (one task per agent)

/agent dev "SERVICE: {service}. PROJECT: {project}. TASK: {task_id} -- {description}. Spec: ~/dev/specs/{project}/specs/{service}/spec.md. Focus ONLY on this task."

Review agents (spawn in parallel after tests pass)

/agent ba "SERVICE: {service}. PROJECT: {project}. Review against spec. Write JSON to ~/dev/ops/reviews/{service}/ba-report.json."
/agent architect "SERVICE: {service}. PROJECT: {project}. Write JSON to ~/dev/ops/reviews/{service}/architect-report.json"
/agent security "SERVICE: {service}. PROJECT: {project}. Write JSON to ~/dev/ops/reviews/{service}/security-report.json"
/agent devops "SERVICE: {service}. PROJECT: {project}. MODE: review. Write JSON to ~/dev/ops/reviews/{service}/devops-report.json"

PR agent

/agent pr "SERVICE: {service}. PROJECT: {project}. Reviews in ~/dev/ops/reviews/{service}/. Create PR and merge to staging."

Deploy agent

/agent devops "SERVICE: {service}. PROJECT: {project}. MODE: deploy. Config: ~/dev/ops/coolify/{service}.json. Verify health check."

E2E agents (sequential: scenario first, then test)

/agent scenario "SERVICE: {service}. PROJECT: {project}. Generate E2E scenarios. Write to ~/dev/projects/{service}/tests/e2e/"

Wait for completion, then:

/agent e2e-test "SERVICE: {service}. PROJECT: {project}. Execute E2E tests. Scenarios: ~/dev/projects/{service}/tests/e2e/"

Auditor (every 6 hours)

/agent auditor "Audit all active projects. Check pipeline sequence, review quality, registry, test coverage. Write to ~/dev/ops/reviews/"

3. Monitor agent completion

After spawning, agents write their outputs to: - Status files: ~/dev/ops/outputs/{service}-{agent}.status - JSON reports: ~/dev/ops/reviews/{service}/{agent}-report.json - Pipeline state: ~/.claude/agent-memory/pipeline/state.md - Lessons learned: ~/dev/ops/lessons-learned.md

Check for completion:

# Check all status files for a service
for f in ~/dev/ops/outputs/{service}-*.status; do
  [ -f "$f" ] && echo "$(basename $f): $(head -1 $f)"
done

Validate status file integrity:

bash ~/dev/ops/adlc-v2/scripts/validate-status.sh

4. Handle agent results

Status Meaning Next Action
DONE Agent completed successfully Advance pipeline
FAILED Agent found issues Read report, spawn fix
RUNNING Agent still working Wait, check again next cycle
BLOCKED Agent cannot proceed Analyze cause, escalate
BLOCKED_EXTERNAL Missing external dep Follow SOP-007

Read JSON reports to determine next steps:

python3 -c "
import json, glob
for f in glob.glob('$HOME/dev/ops/reviews/{service}/*-report.json'):
    r = json.load(open(f))
    name = f.split('/')[-1]
    verdict = r.get('verdict', r.get('status', '?'))
    print(f'{name}: {verdict}')
"

5. Circuit breaker (3-strike rule)

Track retries per service per agent type:

RETRY_FILE="$HOME/.claude/agent-memory/pipeline/retries-{service}-{agent}.txt"
RETRIES=$(cat "$RETRY_FILE" 2>/dev/null || echo "0")

On each failure:

RETRIES=$((RETRIES + 1))
echo "$RETRIES" > "$RETRY_FILE"

if [ "$RETRIES" -ge 3 ]; then
  echo "CIRCUIT BREAKER: {service}/{agent} failed 3 times"
  # Mark BLOCKED
  CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
  bash $CLI/write-status.sh {service} {agent} BLOCKED "Circuit breaker: 3 failures"

  # Post to Slack DM immediately
  source ~/.env.adlc
  curl -sf -X POST "https://slack.com/api/chat.postMessage" \
    -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(python3 -c "
import json
print(json.dumps({
    'channel': 'D0AGRAVEC1K',
    'text': ':rotating_light: CIRCUIT BREAKER -- {service}/{agent} failed 3 times\nLast error: {error_description}\nAction needed: Manual investigation required'
}))
")"
fi

On success: Reset the retry counter:

echo "0" > "$RETRY_FILE"

6. Handle common agent failure patterns

Pattern Cause Fix
Agent writes JSON to status file Agent captured API response and wrote it raw Use CLI tools only (lesson from 2026-03-21)
Agent invents status keywords Agent uses SECURITY_PASS, SCENARIOS_READY, etc. Use validate-status.sh –fix (lesson from 2026-03-23)
Agent cannot find spec Wrong path convention Glob for spec: ~/dev/specs/**/*spec* (lesson from 2026-03-23)
Agent OOM killed Concurrent Rust builds Max 1 Rust build at a time (lesson from 2026-03-20)
BA marks pending tasks as MISSING BA doesn’t know which tasks are pending Include completed task list in BA agent prompt
Dev agent leaves stub module Stub not replaced when real module added Agent prompt must include “remove stub, update all imports” (lesson from 2026-03-23)

7. Parallel execution rules

8. Status file format enforcement

All agents MUST use CLI tools for output:

CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
bash $CLI/write-status.sh {service} {agent} {STATUS} "{details}"
bash $CLI/write-review.sh {service} {agent_type} < report.json
bash $CLI/write-pipeline-state.sh {project} {service} {STATE} "{details}"
bash $CLI/write-lesson.sh {agent} {service} "{problem}" "{root_cause}" "{fix}" "{prevention}"

NEVER write these files directly with Write or Edit tool – the CLI validates format and rejects invalid input.

Verification

Rollback

If an agent corrupts state: 1. Validate and fix status files: bash ~/dev/ops/adlc-v2/scripts/validate-status.sh --fix 2. Reset retry counter: echo "0" > ~/.claude/agent-memory/pipeline/retries-{service}-{agent}.txt 3. Clear stale RUNNING status: re-run the agent or manually set to previous stable state 4. If agent corrupted source code: cd ~/dev/projects/{service} && git checkout dev -- .

References