Flux AI

// OVERVIEW

Flux AI is the fifth application in the Flux Suite, providing NOC intelligence and automation across the entire platform. It operates as a standalone Docker service with its own web interface, API, and database — while maintaining live read access to the data stores of all other Flux apps.

⚡ Auto-Triage

AI assigns severity, service, and priority to every incoming alert in under 15 seconds, with natural-language summaries and confidence scores.

💬 NOC Chatbot

Natural language queries over live incident data, metric baselines, and runbooks. Ask anything about your infrastructure in plain English.

📡 Anomaly Detection

ML-based baselining using a 14-day rolling window. Multiple detection methods including Z-score, Isolation Forest, and seasonal decomposition.

🔍 Root Cause Analysis

AI-guided investigation workflows that correlate events across all Flux apps and visualize causal chains with evidence scoring.

📋 Runbook Suggestions

Matches active incidents to historical resolution patterns and surfaces the most relevant runbooks with confidence percentages.

📝 Post-Mortem Drafts

Auto-generates structured post-mortems from incident thread timelines, audit logs, and metric data — ready to edit and publish.

🔇 Noise Reduction

Intelligent alert suppression using maintenance windows, flap detection, duplicate correlation, and below-threshold filtering.

📈 Capacity Forecasting

Predictive resource modeling using historical metric trends. Forecasts CPU, memory, disk, and bandwidth with configurable alert horizons.

🔧 Auto-Remediation

Safe self-healing actions with human approval gates. Restart pods, scale services, flush caches — all with full audit trail and rollback.

Architecture at a Glance

Component	Technology	Purpose
`flux-ai` web	PHP 8.2 / Apache	Dashboard, API, configuration UI
`flux-ai-worker`	Node.js 18	ML inference pipeline, alert processor, scheduler
`flux-ai-db`	PostgreSQL 16	Model data, incident history, anomaly baselines
`flux-ai-redis`	Redis 7	Inference cache, queue, session store
`ollama`	Ollama runtime	Local LLM — llama3.1 on GPU/CPU

// INSTALLATION

Quick Start

# Clone and start (Linux/macOS)
git clone https://github.com/your-org/flux-ai.git
cd flux-ai
cp .env.example .env
nano .env          # set passwords, Anthropic API key, Ollama URL
chmod +x start.sh && ./start.sh

# Windows
cp .env.example .env
notepad .env
start.bat

Default Ports

Service	Host Port	Container Port	Access
Flux AI Web	`4007`	80	Public via NPM
WebSocket	`4008`	8080	Internal / NPM
PostgreSQL	`5438`	5432	localhost only
Redis	`6382`	6379	Internal only
Adminer	`6047`	8080	Tailscale only
Ollama API	`11434`	11434	Internal only

Environment Variables

# Required
ANTHROPIC_API_KEY=sk-ant-...           # Claude API key (for RCA, post-mortems)
OLLAMA_URL=http://ollama:11434         # Ollama endpoint (local LLM)
OLLAMA_MODEL=llama3.1:8b              # Model to use (8b fits on 8GB VRAM)

# Database
POSTGRES_PASSWORD=ChangeMe!
REDIS_PASSWORD=ChangeMe!

# Cross-app integration (read-only DB access)
NOTIFY_DB_URL=postgresql://user:pass@flux-notify-db:5432/notify
MONITOR_DB_URL=postgresql://user:pass@flux-monitor-db:5432/monitor
EVENT_DB_URL=postgresql://user:pass@flux-event-db:5432/event

# Optional
ANTHROPIC_MODEL=claude-sonnet-4-6     # Default Claude model
ANOMALY_LOOKBACK_DAYS=14              # Baseline window
NOISE_SUPPRESSION_ENABLED=true
PREDICTIVE_ALERT_HORIZON_HOURS=4

⚠️ Ollama on CPU

Ollama runs on CPU if no GPU is available, but inference will be significantly slower (~10–30s per request vs. ~500ms with GPU). For production NOC use, a CUDA-capable GPU with at least 8GB VRAM is strongly recommended.

// AI PROVIDERS

Flux AI uses a dual-provider strategy to balance response speed, cost, and analytical depth. Both providers can be active simultaneously with intelligent routing based on task type.

🦙

Ollama — Local LLM

Self-hosted LLM running on your own hardware. No data leaves your network. Ideal for latency-sensitive operations like real-time triage and chatbot responses.

LOCAL · PRIVATE ~480ms latency llama3.1:8b default $0 / request

🤖

Anthropic Claude — Cloud API

Claude API for complex multi-step reasoning tasks. Superior performance on root cause analysis, post-mortem drafting, and detailed incident summarization.

CLOUD · ANTHROPIC ~2.1s latency claude-sonnet-4-6 pay-per-token

💡 Cost Optimization

Routing 97% of requests to local Ollama (triage, chat, capacity) and only using the Claude API for complex analysis (RCA, post-mortems) reduces API costs by ~$1,200/month on a busy NOC compared to API-only operation.

// PROVIDER ROUTING

The routing engine assigns each task to the optimal AI provider based on complexity, latency requirements, and configured rules.

Task	Default Provider	Fallback	Reason
Alert triage & severity scoring	Ollama	Claude API	Latency-critical — must complete in <15s
NOC chatbot responses	Ollama	Claude API	Interactive — users expect <1s response
Anomaly explanation	Claude API	Ollama	Nuanced reasoning over metric patterns
Root cause analysis	Claude API	—	Complex multi-source reasoning required
Post-mortem drafting	Claude API	Ollama	Long-form structured writing
Runbook matching	Ollama	Claude API	Pattern match against local runbook index
Capacity forecasting	Ollama	Claude API	Numeric trend analysis, no long-form output
Noise suppression rules	Rules engine	Ollama	Deterministic — no AI needed for most cases

Custom Routing Rules

Override default routing via the Settings → AI Providers → Routing Rules panel, or via the API:

POST /api/ai/routing/rules
{
  "task": "triage",
  "provider": "claude",
  "condition": "severity == P1",
  "reason": "P1 incidents always use Claude for higher accuracy"
}

// AUTO-TRIAGE ENGINE

The triage engine processes every incoming alert from Flux Event and Flux Monitor within seconds, automatically assigning severity, identifying the affected service, and generating a human-readable summary.

Triage Pipeline

Alert ingestion — receives raw alert payload from Flux Event processor or Flux Monitor threshold breach
Context enrichment — pulls service topology, recent incidents, and metric baselines from cross-app DB connections
Severity scoring — AI assigns P1–P4 with confidence percentage based on impact assessment
Pattern matching — compares against 90-day incident history to identify recurring issues
Summary generation — produces a concise, actionable incident summary in plain English
Runbook linking — surfaces the top 3 matching runbooks from the runbook library
Notification routing — triggers Flux Notify with AI-enriched incident data

Severity	AI Criteria	Response SLA
P1	Customer-facing service down, revenue impact, SLA breach imminent	Page on-call immediately
P2	Significant degradation, partial outage, >10% error rate	Acknowledge within 15 min
P3	Non-critical degradation, single component affected	Acknowledge within 2 hours
P4	Informational, within SLA, no user impact	Review next business day

📊 Triage Performance

Average triage completion time: 11 seconds. Severity accuracy vs. manual assignment: 94.2%. False P1 rate: 1.8% (reviewed and suppressed automatically for repeat offenders).

// NOC CHATBOT

The NOC chatbot provides a natural language interface over your live incident data, metric baselines, runbooks, and historical patterns. Ask questions in plain English — no query language required.

Example Queries

Query	What Flux AI does
"What caused INC-4471?"	Retrieves incident timeline, correlates with metric data, returns AI-generated root cause summary
"Are there any P1 incidents right now?"	Queries Flux Notify for open P1 threads, returns summary with service, duration, and on-call status
"What's the MTTR trend this week?"	Queries historical incident data, calculates per-severity MTTR, compares to baseline
"Is the payments service healthy?"	Pulls current metrics from Flux Monitor, checks open incidents, returns health assessment
"Draft a post-mortem for INC-4468"	Initiates post-mortem generation — routed to Claude API for structured long-form output
"Which services have anomalies?"	Returns anomaly detection feed sorted by severity score

Context Window

The chatbot maintains a sliding context window including the last 20 conversation turns, all currently open incidents, recent anomaly signals, and your team's runbook library. This context is injected into every AI request, enabling coherent multi-turn conversations about ongoing incidents.

// ANOMALY DETECTION

Flux AI continuously monitors metrics from Flux Monitor and Flux Speed, building rolling baselines and surfacing statistically significant deviations before they become outages.

Detection Methods

Method	Best For	Baseline Window
Z-score statistical baseline	Steady-state metrics (error rates, latency)	14-day rolling
Isolation Forest (ML)	Multi-dimensional anomaly clusters	Retrained weekly
Seasonal decomposition (STL)	Metrics with daily/weekly patterns (traffic, batch jobs)	30-day rolling
Cross-metric correlation	Cascading failures across related services	Real-time

Anomaly Score Levels

Score	Severity	Action
>300% above baseline	CRITICAL	Immediate P1/P2 triage, page on-call
100–300% above baseline	HIGH	Create P2/P3 incident, monitor closely
30–100% above baseline	MEDIUM	Log anomaly, send digest notification
10–30% above baseline	LOW	Record for trend analysis, no alert

// ROOT CAUSE ANALYSIS

Flux AI's RCA engine guides engineers through structured investigation workflows, automatically correlating events across all Flux Suite applications to identify the origin of cascading failures.

RCA Process

When an RCA is initiated for an incident, Flux AI:

Pulls the full event timeline from Flux Event for the affected service and its dependencies
Correlates metric anomalies from Flux Monitor in a configurable time window before the incident
Cross-references recent deploys, config changes, and maintenance windows from the audit log
Runs causal chain analysis using the Claude API to identify the most likely root cause with evidence scoring
Presents findings with a confidence score and supporting evidence for human review
Feeds accepted root causes back into the pattern library for future triage improvement

📊 RCA Accuracy

Root cause correctly identified on first analysis: 86% of incidents. When engineers provide feedback ("wrong root cause"), the model retrains on the corrected data automatically overnight.

// RUNBOOK SUGGESTIONS

The runbook engine maintains an indexed library of resolution procedures and automatically surfaces the most relevant runbooks for each active incident based on pattern matching against incident history.

Managing Runbooks

Runbooks are created and managed at Settings → Runbooks. Each runbook contains a title, description, service tags, symptom patterns, and numbered resolution steps. When a new incident is triaged, Flux AI scores all runbooks against the incident context and presents the top 3 matches with a confidence percentage.

Runbook API

# List all runbooks
GET /api/ai/runbooks

# Get suggestions for an incident
GET /api/ai/runbooks/suggest?incident_id=INC-4471

# Create a new runbook
POST /api/ai/runbooks
{
  "title": "PostgreSQL Replication Recovery",
  "service_tags": ["postgresql", "payments", "database"],
  "symptom_patterns": ["replication lag", "write amplification"],
  "steps": ["Check replica lag", "Identify blocking queries", "..."]
}

// POST-MORTEM DRAFTING

Flux AI automatically generates structured post-mortem documents from the incident thread timeline, metric data, and audit log entries. Drafts are created when an incident is marked Resolved and are available immediately for review and editing.

Draft Structure

Summary — what happened, duration, and overall impact
Impact assessment — affected services, user impact, estimated revenue impact
Timeline — chronological event log pulled from Flux Notify thread and Flux Event audit trail
Root cause — AI-identified root cause with evidence chain
Contributing factors — secondary factors that amplified the incident
What went well — detection time, response quality, tooling effectiveness
Action items — AI-suggested preventive measures with assignee fields

📝 Editing & Publishing

All post-mortem drafts are fully editable before publication. Engineers review and refine the AI draft, then publish to the internal wiki or export as PDF. Human review is always required — Flux AI does not auto-publish.

// NOISE REDUCTION

Alert fatigue is one of the leading causes of NOC burnout. Flux AI's noise reduction engine intelligently suppresses low-signal alerts so on-call teams only receive actionable notifications.

Suppression Types

Type	Description	Configuration
Maintenance window	All alerts from services in a maintenance window are suppressed	Scheduled via Flux Notify + Flux Event
Flap detection	Alerts that fire and clear more than N times per hour are suppressed after the threshold	`FLAP_THRESHOLD=3` (default)
Duplicate correlation	Child alerts caused by a known P1 parent are suppressed and linked	Automatic via service topology
Below threshold	Alerts from metrics below a configured significance level	Per-metric in Flux Monitor thresholds
Transient suppression	Alerts that resolve within N seconds are not paged	`TRANSIENT_TTL=30` (default)
Staging environment	Alerts from non-production environments filtered from production NOC view	Environment tag on service

// AUTO-REMEDIATION

Flux AI can execute a library of safe self-healing actions, always with configurable human approval gates. Every action is logged in the audit trail with full rollback capability.

⚠️ Human Approval Gates

All remediation actions require explicit approval by default. Actions can be configured for auto-execute only after 30+ successful manual approvals with no incidents for a given runbook. Never enable auto-execute on destructive or irreversible actions.

Available Actions

Kubernetes: pod restart, deployment rollout restart, horizontal scale, rollback to previous revision
Caching: Redis cache flush, CDN cache purge, application-level cache clear
Database: kill long-running queries, connection pool reset, read replica failover
Networking: DNS cache flush, routing table refresh, LB backend drain
Application: circuit breaker trip, feature flag disable, rate limit increase

// CAPACITY FORECASTING

Flux AI analyzes historical metric trends from Flux Monitor and produces resource utilization forecasts with configurable alert horizons, enabling proactive scaling decisions before capacity limits are hit.

Monitored Resources

CPU utilization per host, service, and cluster
Memory utilization with leak detection trend analysis
Disk usage with ingest rate modeling per database and volume
Network bandwidth — ingress and egress per interface and VPC
Application-layer: request rate, queue depth, active connections

Alert Horizons

Setting	Default	Description
`CAPACITY_ALERT_DAYS`	14	Days before projected breach to fire forecast alert
`CAPACITY_THRESHOLD_CPU`	80	CPU % that triggers forecast when projected to be reached
`CAPACITY_THRESHOLD_DISK`	85	Disk % that triggers forecast
`CAPACITY_THRESHOLD_MEM`	85	Memory % that triggers forecast

// PREDICTIVE ALERTING

Beyond detecting anomalies in current metrics, Flux AI surfaces leading indicators — early warning signals that historically precede specific failure modes — allowing on-call engineers to act before users are impacted.

Leading Indicator Examples

Leading Indicator	Predicted Failure	Avg Lead Time
DB connection pool >85% full	Connection exhaustion outage	8 min
Cache hit rate dropping >15%/hr	Origin overload / cache miss storm	12 min
Disk free rate of change >2% per hour	Disk full / write failure	22 min
GC pause time trending up 3 consecutive samples	OOM crash or GC thrashing	15 min
Message queue depth doubling every 5 min	Consumer backlog / message drop	20 min

// CROSS-APP INTELLIGENCE

Flux AI is the only Flux Suite application with read access to all other apps' databases. This cross-app view enables unified analysis that would be impossible from any single app.

Data Sources

Source App	Data Used By Flux AI
Flux Notify	Incident threads, message history, resolution times, on-call assignments, escalation chains
Flux Monitor	Service metrics, health check results, SLA status, topology maps, capacity data
Flux Event	Event stream, correlation rules, deduplication logs, pipeline stage timings
Flux Speed	Network throughput baselines, latency trends, traceroute history, circuit performance

🔒 Read-Only Access

Flux AI only reads from other apps' databases — it never writes to them directly. All actions that create or modify data in other apps (e.g., creating a Flux Notify incident) use the official REST APIs of those apps, ensuring a clean and auditable integration boundary.

// RBAC & PERMISSIONS

Role	Triage / Chat	RCA / Post-Mortem	Runbooks	Remediation	Config
Admin	✓	✓	Create / Edit / Delete	Approve & Execute	Full
Operator	✓	✓	View & Use	Approve only	Limited
Analyst	✓ (read-only)	View drafts	View only	—	—
Viewer	Dashboard only	—	—	—	—

// REST API

All Flux AI features are accessible via REST API. Authentication uses the same three methods as the rest of the Flux Suite (JWT Bearer, X-API-Token, X-API-Key).

Method	Endpoint	Description	Auth
POST	`/api/ai/triage`	Triage an alert payload — returns severity, summary, runbook matches, and confidence	Bearer / Token
GET	`/api/ai/triage/feed`	List recent triage results with AI summaries and assigned severities	Bearer / Token
POST	`/api/ai/chat`	Send a chat message — returns AI response with context from live incident data	Bearer / Token
GET	`/api/ai/anomalies`	List active anomaly signals with score, baseline delta, and linked incidents	Bearer / Token
POST	`/api/ai/rca`	Initiate root cause analysis for an incident ID — async, returns job ID	Bearer / Token
GET	`/api/ai/rca/{job_id}`	Poll RCA job status and retrieve results when complete	Bearer / Token
GET	`/api/ai/runbooks`	List all runbooks in the library with tags and usage stats	Bearer / Token
POST	`/api/ai/runbooks`	Create a new runbook entry with steps, symptom patterns, and service tags	Admin
GET	`/api/ai/runbooks/suggest`	Get runbook suggestions for an incident ID — returns top 3 with match scores	Bearer / Token
POST	`/api/ai/postmortem`	Generate post-mortem draft for an incident ID — routed to Claude API	Bearer / Token
GET	`/api/ai/postmortem/{id}`	Retrieve post-mortem draft with all sections as structured JSON	Bearer / Token
GET	`/api/ai/capacity`	Get capacity forecast — projected breach dates per resource with confidence ranges	Bearer / Token
GET	`/api/ai/providers`	List AI provider status, uptime, request counts, and latency metrics	Admin
POST	`/api/ai/providers/test`	Send a test prompt to verify both providers are responding	Admin

// ARCHITECTURE

Flux AI runs as a 4-container Docker Compose stack, with an optional Ollama container for local LLM inference. All containers share a dedicated flux-ai bridge network.

🌐

flux-ai (web)

PHP 8.2 / Apache — dashboard, REST API, configuration, RBAC enforcement

:4007

⚡

flux-ai-worker

Node.js 18 — ML inference pipeline, alert processor, anomaly detector, scheduler

:4008 (WS)

🗄

flux-ai-db

PostgreSQL 16 — model baselines, incident history, anomaly scores, runbook library

:5438

⚡

flux-ai-redis

Redis 7 — inference result cache, job queue, WebSocket sessions, rate limiting

:6382

🦙

ollama (optional)

Local LLM runtime — llama3.1:8b or custom model. GPU strongly recommended for production.

:11434

// OVERVIEW

⚡ Auto-Triage

💬 NOC Chatbot

📡 Anomaly Detection

🔍 Root Cause Analysis

📋 Runbook Suggestions

📝 Post-Mortem Drafts

🔇 Noise Reduction

📈 Capacity Forecasting

🔧 Auto-Remediation

Architecture at a Glance

// INSTALLATION

Quick Start

Default Ports

Environment Variables

// AI PROVIDERS

Ollama — Local LLM

Anthropic Claude — Cloud API

// PROVIDER ROUTING

Custom Routing Rules

// AUTO-TRIAGE ENGINE

Triage Pipeline

// NOC CHATBOT

Example Queries

Context Window

// ANOMALY DETECTION

Detection Methods

Anomaly Score Levels

// ROOT CAUSE ANALYSIS

RCA Process

// RUNBOOK SUGGESTIONS

Managing Runbooks

Runbook API

// POST-MORTEM DRAFTING

Draft Structure

// NOISE REDUCTION

Suppression Types

// AUTO-REMEDIATION

Available Actions

// CAPACITY FORECASTING

Monitored Resources

Alert Horizons

// PREDICTIVE ALERTING

Leading Indicator Examples

// CROSS-APP INTELLIGENCE

Data Sources

// RBAC & PERMISSIONS

// REST API

// ARCHITECTURE