FLUX MONITOR v3 — Infrastructure Monitoring Platform

FLUX MONITOR INFRASTRUCTURE v3.0.0 · docker · self-hosted

// SYSTEM HEALTH

Collector

Metrics DB

Alert Engine

WS Server

Log Ingest

// OVERVIEW

// SERVICES

// INFRASTRUCTURE

// OBSERVABILITY

// ALERTING

// REPORTING

// PLATFORM

P1 Incidents

1

4 total open

▲ 1 new this hour

Active Alerts

18

3 crit · 6 high · 9 warn

▲ +4 this hour

Services Impact

3

14 total / 11 OK

▲ +1 degraded

Hosts Up

142

147 total · 5 issues

▸ 96.6% avail

SLO Health

2

SLOs breached of 14

▲ 1 new breach

Capacity Alerts

2

stor-nas · log-elk

▲ growing

Metrics/sec

48k

99.8% drop-free

▼ -3% load

// SERVICE HEALTH — STATUS PROPAGATION

⚡ PROPAGATION: app-prod-02 (DOWN) → Customer API Gateway (DEGRADED) · 14.2k users affected ⌚ 51 min ago · INC-2847 P1

⚡ PROPAGATION: k8s-node-02 (DEGRADED) + k8s-node-03 (DOWN) → Search & Indexing (PARTIAL OUTAGE) · Search stale 15min ⌚ 2h 18m ago · INC-2846 P2

TIER 1 MISSION CRITICAL Revenue/safety-critical — any degradation triggers immediate response

🚨 Response: 5 min 📣 Notify: SMS + Voice + Slack 📉 SLO: 99.99%

1 DEGRADED 2 OK

💳

Payment Platform

@payments-team

OPERATIONAL

UPTIME 30D

99.97%

SLO TARGET

99.99%

// RESOURCES

db-prod-01 db-prod-02 Billing API Stripe WH

30-day99.97%

🌐

Customer API Gateway

@platform-team

DEGRADED

UPTIME 30D

99.72%

ERROR RATE

8.4%

// RESOURCES

web-prod-01 web-prod-02 app-prod-02 fw-edge-01

30-day99.72% ⚠

🗄

Database Cluster

@db-team

OPERATIONAL

UPTIME 30D

99.98%

SLO TARGET

99.99%

// RESOURCES

db-prod-01 db-prod-02

30-day99.98%

TIER 2 BUSINESS CRITICAL Core business ops — significant user/revenue impact

⏱ Response: 15 min 📣 Notify: Slack + Email + SMS 📉 SLO: 99.9%

1 PARTIAL OUTAGE 3 OK

🏠

Customer Portal

@frontend-team

OK

UPTIME

100.0%

MEMBERS

2/2 OK

🔑

Auth Service

@security-team

OK

UPTIME

99.99%

SLO

✓ MET

🔍

Search & Indexing

@data-team

PARTIAL

UPTIME

98.10%

MEMBERS

2/3 ↓

📧

Email / Notifs

@platform-team

OK

UPTIME

99.85%

SLO

✓ MET

// ACTIVE ALERTS & OPEN INCIDENTS

Active Alerts

18 FIRING

CRIT

app-prod-02 unreachable — ICMP & TCP timeout

HOST: app-prod-02→ INC-2847 P1

0:51

CRIT

BGP Session Down: core-rt-01 → ISP-A (Cogent AS174)

DEVICE: core-rt-01→ INC-2845 P2

0:23

HIGH

SSL Certificate Expiring: api.prod.internal (6 days)

HOST: web-prod-01RULE: cert-expiry-7d

6:00

HIGH

K8s Node Memory Pressure: k8s-node-02 at 94%

HOST: k8s-node-02RULE: memory-pressure

1:05

WARN

High Disk I/O: db-prod-01 — io_wait 87%

HOST: db-prod-01RULE: io-wait-high

2:13

WARN

NAS Storage Warning: stor-nas-01 pool at 91.2%

HOST: stor-nas-01RULE: disk-cap-90

4:22

Open Incidents

4 OPEN

P1 — Customer API Degradation · Elevated 5xx

INC-2847@platform-sreOPEN 51m

14:31

P2 — Elasticsearch Split Brain · Search Degraded

INC-2846@data-teamOPEN 2h 18m

12:14

P2 — BGP Failover · ISP-A Link Down

INC-2845@netopsACK 23m

14:09

P3 — Log Aggregation Backlog · ELK 8-12m delay

INC-2844@infra-teamOPEN 3h 10m

11:22

// HOST STATUS & NETWORK TOPOLOGY

Network Topology

🔌

core-rt-01

🛡

fw-edge-01

⚠️

sw-dist-01

☁

AWS us-east-1

🔀

sw-acc-02

🖥

web-cluster

🖥

app-prod-02 ✕

🗄

db-cluster

WAN IN

820 Mbps

WAN OUT

315 Mbps

LATENCY

4.2 ms

PKT LOSS

0.01%

Host Status

▲ 142 UP▼ 5 DOWN

HOST	STATUS	ROLE	CPU	MEM	DISK
web-prod-01 10.0.1.11	UP	web	32%	71%	45%
db-prod-01 10.0.2.10	WARN	db	78%	85%	38%
app-prod-02 10.0.1.22	DOWN	app	UNREACHABLE — metrics unavailable
k8s-node-02 10.0.3.22	WARN	k8s	62%	94%	29%
stor-nas-01 10.0.4.5	WARN	storage	12%	48%	91%
core-rt-01 10.0.0.1	WARN	router	72%	31%	18%

// METRICS & BUSINESS KPI

Host Metrics — db-prod-01

CPU

78%

8-core · 6.2 used

MEM

85%

64GB · 54.4 used

DISK

40%

2TB · 800GB used

IO WAIT

88%

HIGH · ALERT FIRING

CPU 2H TREND

Business KPI

ORDERS / HR

4,280

▲ +8.3% vs 1h avg

API ERROR RATE

8.4%

⚠ above 0.5% target

P99 API LATENCY

312ms

▲ target 250ms

REVENUE / MIN

$18.4k

▲ on target

// SLA / SLO STATUS & LIVE LOGS

SLA / SLO Status

2 BREACHED

Customer API GW

Target: 99.9%

99.72%

Payment Platform

Target: 99.99%

99.97%

Search & Indexing

Target: 99.5%

98.10%

Auth Service

Target: 99.99%

99.99%

Database Cluster

Target: 99.99%

99.98%

Customer Portal

Target: 99.9%

100.0%

            30-DAY WINDOW · MTTR avg: 42min · MTBF avg: 18.4d · 2 of 14 SLOs BREACHED
          

Live Log Stream

● LIVE

14:31:02db-prod-01mysqldSlow query: 4821ms — SELECT * FROM orders WHERE ...

14:31:05web-prod-01nginxGET /api/v2/items 200 12ms — 10.0.5.44

14:31:07app-prod-02systemdERROR: Service app-backend.service entered failed state

14:31:09core-rt-01bgpdBGP peer 203.0.113.1 state: Established → Active

14:31:12k8s-node-03kubeletPod flux-collector-7d4f9 started successfully

14:31:14k8s-node-02kubeletMemory pressure: evicting pod search-worker-7x2

14:31:16fw-edge-01ipsecVPN tunnel vpn-dc-02 established — 10.255.1.1

14:31:18collectorsnmpPolled 142 interfaces — 0 errors — queue: 14

// CAPACITY FORECAST & INTEGRATIONS

Capacity Forecast

stor-nas-01 /data

~18 days

Log ELK Storage

~22 days

db-prod-01 /var

~47 days

k8s-node-03 RAM

~90 days

Metrics TSDB

>6 months

Integrations

6 CONNECTED

📧

Email SMTP

● connected

💬

Slack

● connected

🔷

MS Teams

◑ no webhook

🚨

PagerDuty

● connected

🎟

ServiceNow

○ disabled

🔵

Jira Cloud

● connected

📱

AWS SNS/SMS

● connected

🔗

Flux Notify

● bridged

            FLUX NOTIFY BRIDGE · ● connected · notifications → flux-notify:4004
          

Operational

9

all clear

Degraded

1

Customer API GW

Partial Outage

1

Search & Indexing

Major Outage

0

none

Maintenance

1

Load Test Lab

Total Services

14

across 4 tiers

⚡ PROPAGATION: 2 resource failures → 2 services degraded/outage → 14.2k users impacted

TIER 1 MISSION CRITICAL Revenue or safety-critical. Any degradation = immediate P1 response.

🚨 Response SLA: 5 min 📣 Channels: SMS + Voice + Slack + Email 📉 Availability SLO: 99.99% ⚠ Degrade trigger: >1% members down

1 DEGRADED2 OK

💳

Payment Platform

@payments-team · Tier 1

OPERATIONAL

UPTIME 30D

99.97%

SLO TARGET

99.99%

MEMBERS

4/4

ERROR BUDGET

82%

// RESOURCES

db-prod-01 (warn)db-prod-02Billing APIStripe Webhook

30d99.97%

🌐

Customer API Gateway

@platform-team · Tier 1

DEGRADED

UPTIME 30D

99.72%

ERROR RATE

8.4%

MEMBERS

1/4 ↓

ERROR BUDGET

28%

// RESOURCES — PROPAGATION SOURCE

web-prod-01web-prod-02app-prod-02 ← DOWNfw-edge-01

30d99.72% ⚠ SLO BREACH

🗄

Database Cluster

@db-team · Tier 1

OPERATIONAL

UPTIME 30D

99.98%

SLO TARGET

99.99%

MEMBERS

2/2

ERROR BUDGET

91%

// RESOURCES

db-prod-01 (high io)db-prod-02

30d99.98%

TIER 2 BUSINESS CRITICAL Core business operations. Significant user/revenue impact if degraded.

⏱ Response SLA: 15 min 📣 Channels: Slack + Email + SMS 📉 Availability SLO: 99.9% ⚠ Degrade trigger: >10% members down

1 PARTIAL OUTAGE3 OK

🏠

Customer Portal

@frontend-team · Tier 2

OPERATIONAL

UPTIME

100.0%

MEMBERS

2/2

🔑

Authentication Service

@security-team · Tier 2

OPERATIONAL

UPTIME

99.99%

SLO

✓ MET

🔍

Search & Indexing

@data-team · Tier 2

PARTIAL OUTAGE

UPTIME

98.10%

MEMBERS

2/3 ↓

// PROPAGATION SOURCE

k8s-node-01k8s-node-02k8s-node-03 ← DOWN

📧

Email / Notifications

@platform-team · Tier 2

OPERATIONAL

UPTIME

99.85%

SLO

✓ MET

TIER 3 OPERATIONAL Internal tools and supporting services. Limited user/revenue impact.

⏱ Response SLA: 60 min 📣 Channels: Slack + Email 📉 Availability SLO: 99.0% ⚠ Degrade trigger: >25% members down

1 DEGRADED3 OK

⚙

CI/CD Pipeline

@devops-team · Tier 3

OPERATIONAL

UPTIME

99.40%

MEMBERS

OK

🔒

Internal VPN

@netops-team · Tier 3

OPERATIONAL

UPTIME

99.90%

MEMBERS

OK

📊

Monitoring Stack

@infra-team · Tier 3

OPERATIONAL

UPTIME

99.98%

MEMBERS

OK

📜

Log Aggregation

@infra-team · Tier 3

DEGRADED

UPTIME

97.50%

INGEST LAG

12 min

TIER 4 DEVELOPMENT / LAB Non-production environments. Best-effort availability. No on-call escalation.

⏱ Response SLA: 240 min (best effort) 📣 Channels: Email only 📉 Availability SLO: 90.0% ⚠ Degrade trigger: >50% members down

1 MAINTENANCE2 OK

🧪

Staging Environment

@devops-team · Tier 4

OPERATIONAL

UPTIME

98.20%

SLO

✓ MET

🛠

Dev Sandbox

@devops-team · Tier 4

OPERATIONAL

UPTIME

99.10%

NO SLA

best-effort

🏋

Load Testing Lab

@qa-team · Tier 4

MAINTENANCE

UPTIME

85.00%

STATUS

🔧 MAINT