SOP-005: Incident Response (Staging)

Handle staging incidents including 503 errors, health check failures, container crashes, and proxy routing issues. Restore service availability and document root cause for prevention.

Scope

Applies to all services deployed on srv-staging (35.195.54.220) via Coolify. Covers diagnosis, resolution, and communication for staging environment incidents. Production incidents follow SOP-009 with additional approval gates.

Prerequisites

Procedure

1. Detect the incident

Incidents are detected via: - Health check polling (pipeline scanner runs every 5 minutes) - Slack alert from monitoring - Manual discovery

2. Initial diagnosis

for svc in oid docstore pdf-engine notification-hub workflow-engine form-engine; do
  code=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}.staging.orbusdigital.com/health" 2>/dev/null || echo "000")
  echo "$svc: $code"
done
code=$(curl -sf -o /dev/null -w "%{http_code}" "https://ods-dashboard.staging.orbusdigital.com/api/health" 2>/dev/null || echo "000")
echo "ods-dashboard: $code"

HTTP 000 = connection refused or DNS failure. HTTP 502/503 = proxy up but backend down.

3. Diagnose by error type

HTTP 503 / 502 – Backend down

source ~/.env.adlc
UUID=$(python3 -c "import json; print(json.load(open('$HOME/dev/ops/coolify/{service}.json'))['coolify']['app_uuid'])")

# Check application status
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Status: {d.get(\"status\", \"unknown\")}')
"

ssh jniox_orbusdigital_com@35.195.54.220 "docker ps -a --filter 'name={service}' --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
ssh jniox_orbusdigital_com@35.195.54.220 "docker logs --tail 50 \$(docker ps -a --filter 'name={service}' -q | head -1)"

HTTP 000 – Connection refused

This typically means: 1. DNS not resolving: dig {service}.staging.orbusdigital.com 2. Traefik proxy down on srv-staging (known recurring issue with ods-dashboard) 3. SSH to srv-staging timing out (GCP network issue)

# Test direct IP access
curl -sf -o /dev/null -w "%{http_code}" "http://35.195.54.220:8080/health" 2>/dev/null || echo "000"

Container crash loop

# Via Coolify API -- check restart count
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(json.dumps(d, indent=2))" | head -30

Common crash causes: - Missing environment variables (check .env vs .env.example) - Database connection refused (ods-postgres not on coolify network) - Port conflict - OOM kill (Rust services during compilation)

4. Common resolutions

curl -sf -X POST "$COOLIFY_API_URL/api/v1/applications/$UUID/restart" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN"

Fix Docker network (lesson from 2026-03-21): Ensure the app is on the coolify network. If not, update via Coolify UI > Custom Docker Options > --network=coolify.

Traefik routing issue (known for ods-dashboard): The dashboard runs on srv-agents with a Caddy proxy on srv-staging. If Traefik loses the route:

# Restart Traefik on srv-staging (if SSH works)
ssh jniox_orbusdigital_com@35.195.54.220 "docker restart coolify-proxy"

pg_isready -h 127.0.0.1 -p 5433 2>/dev/null && echo "OK" || echo "DOWN"
# If down:
docker restart ods-postgres

5. Post to Slack

source ~/.env.adlc
curl -sf -X POST "https://slack.com/api/chat.postMessage" \
  -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(python3 -c "import json; print(json.dumps({'channel':'D0AGRAVEC1K','text':':rotating_light: INCIDENT -- {service} staging returning {status_code}. Root cause: {description}. Action: {action_taken}'}))")"

curl -sf -X POST "https://slack.com/api/chat.postMessage" \
  -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(python3 -c "import json; print(json.dumps({'channel':'C0AN0N8AUGZ','text':':white_check_mark: {service} staging recovered -- {resolution}'}))")"

6. Document the incident

CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
bash $CLI/write-lesson.sh orchestrator {service} \
  "{problem description}" \
  "{root cause}" \
  "{fix applied}" \
  "{prevention measure}"

Verification

Rollback

If the service cannot be restored: 1. Mark as BLOCKED in pipeline state 2. Post to Slack DM with full diagnosis 3. If the issue is in new code: revert to last known good commit on staging branch 4. If infrastructure issue: escalate to human for server-level intervention

SOP-005: Incident Response (Staging)

SOP-005: Incident Response (Staging)

Purpose

Scope

Prerequisites

Procedure

1. Detect the incident

2. Initial diagnosis

3. Diagnose by error type

HTTP 503 / 502 – Backend down

HTTP 000 – Connection refused

Container crash loop

4. Common resolutions

5. Post to Slack

6. Document the incident

Verification

Rollback

References