SOP-001: Pipeline Boot / Resume

Purpose

Reconstruct ADLC pipeline state after a daily restart (4am systemd timer), server reboot, or manual session restart. Ensures all agents, services, and infrastructure are healthy before resuming autonomous operations.

Scope

Applies to the ADLC orchestrator session (ods-claude tmux) on srv-agents (this machine). Covers context rebuild, interrupted work recovery, resource validation, and pipeline loop start.

Prerequisites

SSH access to srv-agents (jniox_orbusdigital_com@odsgpcicd-srv-agents)
tmux session ods-claude exists (created by systemd ods-claude.service)
PostgreSQL running on 127.0.0.1:5433 (ods-postgres container)
Slack credentials in ~/.env.adlc (SLACK_BOT_TOKEN, COOLIFY_API_URL)
Agent memory directory: ~/.claude/agent-memory/pipeline/

Procedure

1. Verify systemd services are running

systemctl --user status ods-claude.service
systemctl --user status ods-restart.timer

If stopped:

systemctl --user start ods-claude.service
systemctl --user start ods-restart.timer

2. Attach to the orchestrator session

tmux attach -t ods-claude

3. Run the /boot skill (inside the Claude session)

The boot skill executes these steps automatically:

Step 3a – Load agent memory:

for d in ~/.claude/agent-memory/*/; do
  echo "=== $(basename "$d") ==="
  head -20 "$d/MEMORY.md" 2>/dev/null
done

Step 3b – Load project progress:

for p in ~/dev/specs/*/gestion/progress.md; do
  PROJECT=$(basename "$(dirname "$(dirname "$p")")")
  echo "=== $PROJECT ==="
  tail -20 "$p"
done

Step 3c – Check git state of all services:

for d in ~/dev/projects/*/; do
  echo "--- $(basename "$d") ---"
  cd "$d" && git log --oneline -3 2>/dev/null && git status -s 2>/dev/null
done

Step 3d – Detect interrupted work:

grep -l "^RUNNING" ~/dev/ops/outputs/*.status 2>/dev/null

For each interrupted task: reset status to previous stable phase and re-queue.

Step 3e – System resource check:

free -h
df -h /home
docker ps --format "table {{.Names}}\t{{.Status}}" 2>/dev/null
pg_isready -h 127.0.0.1 -p 5433 2>/dev/null && echo "PostgreSQL OK" || echo "PostgreSQL DOWN"

Step 3f – Validate status file integrity:

bash ~/dev/ops/adlc-v2/scripts/validate-status.sh

If violations found:

bash ~/dev/ops/adlc-v2/scripts/validate-status.sh --fix

4. Check staging health (7 services)

for svc in oid docstore pdf-engine notification-hub workflow-engine form-engine; do
  code=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}.staging.orbusdigital.com/health" 2>/dev/null || echo "000")
  echo "$svc: $code"
done
code=$(curl -sf -o /dev/null -w "%{http_code}" "https://ods-dashboard.staging.orbusdigital.com/api/health" 2>/dev/null || echo "000")
echo "ods-dashboard: $code"

5. Start the pipeline loop

/loop 5m /check-pipeline

6. Post boot summary to Slack

Post to ADLC channel (C0AN0N8AUGZ) with: - Number of active projects and their phases - Any interrupted work that was re-queued - System resource status (RAM, disk) - Staging health status

Verification

tmux ls shows ods-claude session running
grep "RUNNING" ~/dev/ops/outputs/*.status returns no stale entries
All 7 staging services return HTTP 200 on health endpoint
Pipeline state file is current: head -5 ~/.claude/agent-memory/pipeline/state.md
Available memory > 2000MB: awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo

Rollback

If boot fails: 1. Check systemd journal: journalctl --user -u ods-claude.service -n 50 2. Kill orphan Claude processes: pkill -f "claude" && sleep 5 3. Restart the service: systemctl --user restart ods-claude.service 4. If PostgreSQL is down: docker restart ods-postgres 5. If Redpanda is down: docker restart redpanda

References

Boot skill: ~/.claude/skills/boot/SKILL.md
Check pipeline skill: ~/.claude/skills/check-pipeline/SKILL.md
Dispatcher: ~/dev/ops/adlc-v2/scripts/dispatcher-v3.sh
Pipeline state: ~/.claude/agent-memory/pipeline/state.md
Lesson: OOM crash from concurrent Rust builds (2026-03-20) – max 1 Rust build at a time