P1 Incidents
1
4 total open
▲ 1 new this hour
Active Alerts
18
3 crit · 6 high · 9 warn
▲ +4 this hour
Services Impact
3
14 total / 11 OK
▲ +1 degraded
Hosts Up
142
147 total · 5 issues
▸ 96.6% avail
SLO Health
2
SLOs breached of 14
▲ 1 new breach
Capacity Alerts
2
stor-nas · log-elk
▲ growing
Metrics/sec
48k
99.8% drop-free
▼ -3% load
// SERVICE HEALTH — STATUS PROPAGATION
PROPAGATION: app-prod-02 (DOWN) → Customer API Gateway (DEGRADED) · 14.2k users affected
⌚ 51 min ago · INC-2847 P1
PROPAGATION: k8s-node-02 (DEGRADED) + k8s-node-03 (DOWN) → Search & Indexing (PARTIAL OUTAGE) · Search stale 15min
⌚ 2h 18m ago · INC-2846 P2
TIER 1
MISSION CRITICAL
Revenue/safety-critical — any degradation triggers immediate response
🚨 Response: 5 min
📣 Notify: SMS + Voice + Slack
📉 SLO: 99.99%
1 DEGRADED
2 OK
Payment Platform
@payments-team
UPTIME 30D
99.97%
SLO TARGET
99.99%
// RESOURCES
db-prod-01
db-prod-02
Billing API
Stripe WH
Customer API Gateway
@platform-team
UPTIME 30D
99.72%
ERROR RATE
8.4%
// RESOURCES
web-prod-01
web-prod-02
app-prod-02
fw-edge-01
Database Cluster
@db-team
UPTIME 30D
99.98%
SLO TARGET
99.99%
// RESOURCES
db-prod-01
db-prod-02
TIER 2
BUSINESS CRITICAL
Core business ops — significant user/revenue impact
⏱ Response: 15 min
📣 Notify: Slack + Email + SMS
📉 SLO: 99.9%
1 PARTIAL OUTAGE
3 OK
Customer Portal
@frontend-team
UPTIME
100.0%
MEMBERS
2/2 OK
Auth Service
@security-team
UPTIME
99.99%
SLO
✓ MET
Search & Indexing
@data-team
UPTIME
98.10%
MEMBERS
2/3 ↓
Email / Notifs
@platform-team
UPTIME
99.85%
SLO
✓ MET
// ACTIVE ALERTS & OPEN INCIDENTS
Active Alerts
18 FIRING
CRIT0:51
app-prod-02 unreachable — ICMP & TCP timeout
CRIT0:23
BGP Session Down: core-rt-01 → ISP-A (Cogent AS174)
HIGH6:00
SSL Certificate Expiring: api.prod.internal (6 days)
HIGH1:05
K8s Node Memory Pressure: k8s-node-02 at 94%
WARN2:13
High Disk I/O: db-prod-01 — io_wait 87%
WARN4:22
NAS Storage Warning: stor-nas-01 pool at 91.2%
Open Incidents
4 OPEN
P1 — Customer API Degradation · Elevated 5xx
P2 — Elasticsearch Split Brain · Search Degraded
P2 — BGP Failover · ISP-A Link Down
P3 — Log Aggregation Backlog · ELK 8-12m delay
// HOST STATUS & NETWORK TOPOLOGY
Network Topology
core-rt-01
fw-edge-01
sw-dist-01
AWS us-east-1
sw-acc-02
web-cluster
app-prod-02 ✕
db-cluster
WAN IN
820 Mbps
WAN OUT
315 Mbps
LATENCY
4.2 ms
PKT LOSS
0.01%
Host Status
▲ 142 UP▼ 5 DOWN
| HOST | STATUS | ROLE | CPU | MEM | DISK |
|---|---|---|---|---|---|
web-prod-01 10.0.1.11 | UP | web | |||
db-prod-01 10.0.2.10 | WARN | db | |||
app-prod-02 10.0.1.22 | DOWN | app | UNREACHABLE — metrics unavailable | ||
k8s-node-02 10.0.3.22 | WARN | k8s | |||
stor-nas-01 10.0.4.5 | WARN | storage | |||
core-rt-01 10.0.0.1 | WARN | router | |||
// METRICS & BUSINESS KPI
Host Metrics — db-prod-01
CPU
78%
8-core · 6.2 used
MEM
85%
64GB · 54.4 used
DISK
40%
2TB · 800GB used
IO WAIT
88%
HIGH · ALERT FIRING
CPU 2H TREND
Business KPI
ORDERS / HR
4,280
▲ +8.3% vs 1h avg
API ERROR RATE
8.4%
⚠ above 0.5% target
P99 API LATENCY
312ms
▲ target 250ms
REVENUE / MIN
$18.4k
▲ on target
// SLA / SLO STATUS & LIVE LOGS
SLA / SLO Status
2 BREACHED
Customer API GW
Target: 99.9%
99.72%
Payment Platform
Target: 99.99%
99.97%
Search & Indexing
Target: 99.5%
98.10%
Auth Service
Target: 99.99%
99.99%
Database Cluster
Target: 99.99%
99.98%
Customer Portal
Target: 99.9%
100.0%
30-DAY WINDOW · MTTR avg: 42min · MTBF avg: 18.4d · 2 of 14 SLOs BREACHED
Live Log Stream
● LIVE
14:31:02db-prod-01mysqldSlow query: 4821ms — SELECT * FROM orders WHERE ...
14:31:05web-prod-01nginxGET /api/v2/items 200 12ms — 10.0.5.44
14:31:07app-prod-02systemdERROR: Service app-backend.service entered failed state
14:31:09core-rt-01bgpdBGP peer 203.0.113.1 state: Established → Active
14:31:12k8s-node-03kubeletPod flux-collector-7d4f9 started successfully
14:31:14k8s-node-02kubeletMemory pressure: evicting pod search-worker-7x2
14:31:16fw-edge-01ipsecVPN tunnel vpn-dc-02 established — 10.255.1.1
14:31:18collectorsnmpPolled 142 interfaces — 0 errors — queue: 14
// CAPACITY FORECAST & INTEGRATIONS
Capacity Forecast
stor-nas-01 /data
Log ELK Storage
db-prod-01 /var
k8s-node-03 RAM
Metrics TSDB
Integrations
6 CONNECTED
Email SMTP
● connected
Slack
● connected
MS Teams
◑ no webhook
PagerDuty
● connected
ServiceNow
○ disabled
Jira Cloud
● connected
AWS SNS/SMS
● connected
Flux Notify
● bridged
FLUX NOTIFY BRIDGE · ● connected · notifications → flux-notify:4004