Handle staging incidents including 503 errors, health check failures, container crashes, and proxy routing issues. Restore service availability and document root cause for prevention.
Applies to all services deployed on srv-staging (35.195.54.220) via Coolify. Covers diagnosis, resolution, and communication for staging environment incidents. Production incidents follow SOP-009 with additional approval gates.
ssh jniox_orbusdigital_com@35.195.54.220 (may require GCP
OS Login)source ~/.env.adlcIncidents are detected via: - Health check polling (pipeline scanner runs every 5 minutes) - Slack alert from monitoring - Manual discovery
Classify severity: | Severity | Condition | Response Time | |———-|———–|—————| | CRITICAL | All staging services down | Immediate | | HIGH | Single service down, blocks pipeline | Within 15 minutes | | MEDIUM | Intermittent 503, service recovers | Within 1 hour | | LOW | Performance degradation, no downtime | Next pipeline cycle |
Check all services from srv-agents:
for svc in oid docstore pdf-engine notification-hub workflow-engine form-engine; do
code=$(curl -sf -o /dev/null -w "%{http_code}" "https://${svc}.staging.orbusdigital.com/health" 2>/dev/null || echo "000")
echo "$svc: $code"
done
code=$(curl -sf -o /dev/null -w "%{http_code}" "https://ods-dashboard.staging.orbusdigital.com/api/health" 2>/dev/null || echo "000")
echo "ods-dashboard: $code"HTTP 000 = connection refused or DNS failure. HTTP 502/503 = proxy up but backend down.
source ~/.env.adlc
UUID=$(python3 -c "import json; print(json.load(open('$HOME/dev/ops/coolify/{service}.json'))['coolify']['app_uuid'])")
# Check application status
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
-H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Status: {d.get(\"status\", \"unknown\")}')
"If SSH to srv-staging works:
ssh jniox_orbusdigital_com@35.195.54.220 "docker ps -a --filter 'name={service}' --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
ssh jniox_orbusdigital_com@35.195.54.220 "docker logs --tail 50 \$(docker ps -a --filter 'name={service}' -q | head -1)"This typically means: 1. DNS not resolving:
dig {service}.staging.orbusdigital.com 2. Traefik proxy
down on srv-staging (known recurring issue with ods-dashboard) 3. SSH to
srv-staging timing out (GCP network issue)
# Test direct IP access
curl -sf -o /dev/null -w "%{http_code}" "http://35.195.54.220:8080/health" 2>/dev/null || echo "000"# Via Coolify API -- check restart count
curl -sf "$COOLIFY_API_URL/api/v1/applications/$UUID" \
-H "Authorization: Bearer $COOLIFY_API_TOKEN" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(json.dumps(d, indent=2))" | head -30Common crash causes: - Missing environment variables (check
.env vs .env.example) - Database connection
refused (ods-postgres not on coolify network) - Port
conflict - OOM kill (Rust services during compilation)
Restart the service:
curl -sf -X POST "$COOLIFY_API_URL/api/v1/applications/$UUID/restart" \
-H "Authorization: Bearer $COOLIFY_API_TOKEN"Fix Docker network (lesson from 2026-03-21): Ensure
the app is on the coolify network. If not, update via
Coolify UI > Custom Docker Options >
--network=coolify.
Traefik routing issue (known for ods-dashboard): The dashboard runs on srv-agents with a Caddy proxy on srv-staging. If Traefik loses the route:
# Restart Traefik on srv-staging (if SSH works)
ssh jniox_orbusdigital_com@35.195.54.220 "docker restart coolify-proxy"PostgreSQL unreachable:
pg_isready -h 127.0.0.1 -p 5433 2>/dev/null && echo "OK" || echo "DOWN"
# If down:
docker restart ods-postgresFor HIGH/CRITICAL incidents, post immediately to DM channel:
source ~/.env.adlc
curl -sf -X POST "https://slack.com/api/chat.postMessage" \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d "$(python3 -c "import json; print(json.dumps({'channel':'D0AGRAVEC1K','text':':rotating_light: INCIDENT -- {service} staging returning {status_code}. Root cause: {description}. Action: {action_taken}'}))")"For resolution, post to ADLC channel:
curl -sf -X POST "https://slack.com/api/chat.postMessage" \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-H "Content-Type: application/json" \
-d "$(python3 -c "import json; print(json.dumps({'channel':'C0AN0N8AUGZ','text':':white_check_mark: {service} staging recovered -- {resolution}'}))")"CLI="$HOME/dev/ops/adlc-v2/scripts/cli"
bash $CLI/write-lesson.sh orchestrator {service} \
"{problem description}" \
"{root cause}" \
"{fix applied}" \
"{prevention measure}"curl -sf https://{service}.staging.orbusdigital.com/healthdocker ps --filter 'name={service}' (on srv-staging)docker logs --since 5m {container_name} (on
srv-staging)If the service cannot be restored: 1. Mark as BLOCKED in pipeline state 2. Post to Slack DM with full diagnosis 3. If the issue is in new code: revert to last known good commit on staging branch 4. If infrastructure issue: escalate to human for server-level intervention
~/dev/ops/coolify/*.jsoncoolify
(2026-03-21)/health (Rust),
/api/health (Node/Next.js), never / for
Next.js (returns 307)