SOP-005: Incident Response (Staging)

SOP-005: Incident Response (Staging)

Purpose

Handle staging incidents including 503 errors, health check failures, container crashes, and proxy routing issues. Restore service availability and document root cause for prevention.

Scope

Applies to all services deployed on srv-staging (35.195.54.220) via Coolify. Covers diagnosis, resolution, and communication for staging environment incidents. Production incidents follow SOP-009 with additional approval gates.

Prerequisites

Procedure

1. Detect the incident

Incidents are detected via: - Health check polling (pipeline scanner runs every 5 minutes) - Slack alert from monitoring - Manual discovery

Classify severity: | Severity | Condition | Response Time | |———-|———–|—————| | CRITICAL | All staging services down | Immediate | | HIGH | Single service down, blocks pipeline | Within 15 minutes | | MEDIUM | Intermittent 503, service recovers | Within 1 hour | | LOW | Performance degradation, no downtime | Next pipeline cycle |

2. Initial diagnosis

Check all services from srv-agents:

for svc in oid docstore pdf-engine notification-hub workflow-engine form-engine; do
  code=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}.staging.orbusdigital.com/health" 2>/dev/null || echo "000")
  echo "$svc: $code"
done
code=$(curl -sf -o /dev/null -w "%{http_code}" "https://ods-dashboard.staging.orbusdigital.com/api/health" 2>/dev/null || echo "000")
echo "ods-dashboard: $code"

HTTP 000 = connection refused or DNS failure. HTTP 502/503 = proxy up but backend down.

3. Diagnose by error type

HTTP 503 / 502 – Backend down

source ~/.env.adlc
UUID=$(python3 -c "import json; print(json.load(open('$HOME/dev/ops/coolify/{service}.json'))['coolify']['app_uuid'])")

# Check application status
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Status: {d.get(\"status\", \"unknown\")}')
"

If SSH to srv-staging works:

ssh jniox_orbusdigital_com@35.195.54.220 "docker ps -a --filter 'name={service}' --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
ssh jniox_orbusdigital_com@35.195.54.220 "docker logs --tail 50 \$(docker ps -a --filter 'name={service}' -q | head -1)"

HTTP 000 – Connection refused

This typically means: 1. DNS not resolving: dig {service}.staging.orbusdigital.com 2. Traefik proxy down on srv-staging (known recurring issue with ods-dashboard) 3. SSH to srv-staging timing out (GCP network issue)

# Test direct IP access
curl -sf -o /dev/null -w "%{http_code}" "http://35.195.54.220:8080/health" 2>/dev/null || echo "000"

Container crash loop

# Via Coolify API -- check restart count
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(json.dumps(d, indent=2))" | head -30

Common crash causes: - Missing environment variables (check .env vs .env.example) - Database connection refused (ods-postgres not on coolify network) - Port conflict - OOM kill (Rust services during compilation)

4. Common resolutions

Restart the service:

curl -sf -X POST "$COOLIFY_API_URL/api/v1/applications/$UUID/restart" \
  -H "Authorization: Bearer $COOLIFY_API_TOKEN"

Fix Docker network (lesson from 2026-03-21): Ensure the app is on the coolify network. If not, update via Coolify UI > Custom Docker Options > --network=coolify.

Traefik routing issue (known for ods-dashboard): The dashboard runs on srv-agents with a Caddy proxy on srv-staging. If Traefik loses the route:

# Restart Traefik on srv-staging (if SSH works)
ssh jniox_orbusdigital_com@35.195.54.220 "docker restart coolify-proxy"

PostgreSQL unreachable:

pg_isready -h 127.0.0.1 -p 5433 2>/dev/null && echo "OK" || echo "DOWN"
# If down:
docker restart ods-postgres

5. Post to Slack

For HIGH/CRITICAL incidents, post immediately to DM channel:

source ~/.env.adlc
curl -sf -X POST "https://slack.com/api/chat.postMessage" \
  -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(python3 -c "import json; print(json.dumps({'channel':'D0AGRAVEC1K','text':':rotating_light: INCIDENT -- {service} staging returning {status_code}. Root cause: {description}. Action: {action_taken}'}))")"

For resolution, post to ADLC channel:

curl -sf -X POST "https://slack.com/api/chat.postMessage" \
  -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(python3 -c "import json; print(json.dumps({'channel':'C0AN0N8AUGZ','text':':white_check_mark: {service} staging recovered -- {resolution}'}))")"

6. Document the incident

CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
bash $CLI/write-lesson.sh orchestrator {service} \
  "{problem description}" \
  "{root cause}" \
  "{fix applied}" \
  "{prevention measure}"

Verification

Rollback

If the service cannot be restored: 1. Mark as BLOCKED in pipeline state 2. Post to Slack DM with full diagnosis 3. If the issue is in new code: revert to last known good commit on staging branch 4. If infrastructure issue: escalate to human for server-level intervention

References