FLUX AI
The intelligent automation layer for your NOC. Flux AI connects to Flux Notify, Flux Monitor, Flux Event, and Flux Speed to provide unified cross-app intelligence — automatically triaging incidents, detecting anomalies, suggesting runbooks, and drafting post-mortems using a dual AI provider strategy with Ollama for speed and Claude API for depth.
// OVERVIEW
Flux AI is the fifth application in the Flux Suite, providing NOC intelligence and automation across the entire platform. It operates as a standalone Docker service with its own web interface, API, and database — while maintaining live read access to the data stores of all other Flux apps.
⚡ Auto-Triage
AI assigns severity, service, and priority to every incoming alert in under 15 seconds, with natural-language summaries and confidence scores.
💬 NOC Chatbot
Natural language queries over live incident data, metric baselines, and runbooks. Ask anything about your infrastructure in plain English.
📡 Anomaly Detection
ML-based baselining using a 14-day rolling window. Multiple detection methods including Z-score, Isolation Forest, and seasonal decomposition.
🔍 Root Cause Analysis
AI-guided investigation workflows that correlate events across all Flux apps and visualize causal chains with evidence scoring.
📋 Runbook Suggestions
Matches active incidents to historical resolution patterns and surfaces the most relevant runbooks with confidence percentages.
📝 Post-Mortem Drafts
Auto-generates structured post-mortems from incident thread timelines, audit logs, and metric data — ready to edit and publish.
🔇 Noise Reduction
Intelligent alert suppression using maintenance windows, flap detection, duplicate correlation, and below-threshold filtering.
📈 Capacity Forecasting
Predictive resource modeling using historical metric trends. Forecasts CPU, memory, disk, and bandwidth with configurable alert horizons.
🔧 Auto-Remediation
Safe self-healing actions with human approval gates. Restart pods, scale services, flush caches — all with full audit trail and rollback.
Architecture at a Glance
| Component | Technology | Purpose |
|---|---|---|
flux-ai web | PHP 8.2 / Apache | Dashboard, API, configuration UI |
flux-ai-worker | Node.js 18 | ML inference pipeline, alert processor, scheduler |
flux-ai-db | PostgreSQL 16 | Model data, incident history, anomaly baselines |
flux-ai-redis | Redis 7 | Inference cache, queue, session store |
ollama | Ollama runtime | Local LLM — llama3.1 on GPU/CPU |
// INSTALLATION
Quick Start
# Clone and start (Linux/macOS)
git clone https://github.com/your-org/flux-ai.git
cd flux-ai
cp .env.example .env
nano .env # set passwords, Anthropic API key, Ollama URL
chmod +x start.sh && ./start.sh
# Windows
cp .env.example .env
notepad .env
start.bat
Default Ports
| Service | Host Port | Container Port | Access |
|---|---|---|---|
| Flux AI Web | 4007 | 80 | Public via NPM |
| WebSocket | 4008 | 8080 | Internal / NPM |
| PostgreSQL | 5438 | 5432 | localhost only |
| Redis | 6382 | 6379 | Internal only |
| Adminer | 6047 | 8080 | Tailscale only |
| Ollama API | 11434 | 11434 | Internal only |
Environment Variables
# Required
ANTHROPIC_API_KEY=sk-ant-... # Claude API key (for RCA, post-mortems)
OLLAMA_URL=http://ollama:11434 # Ollama endpoint (local LLM)
OLLAMA_MODEL=llama3.1:8b # Model to use (8b fits on 8GB VRAM)
# Database
POSTGRES_PASSWORD=ChangeMe!
REDIS_PASSWORD=ChangeMe!
# Cross-app integration (read-only DB access)
NOTIFY_DB_URL=postgresql://user:pass@flux-notify-db:5432/notify
MONITOR_DB_URL=postgresql://user:pass@flux-monitor-db:5432/monitor
EVENT_DB_URL=postgresql://user:pass@flux-event-db:5432/event
# Optional
ANTHROPIC_MODEL=claude-sonnet-4-6 # Default Claude model
ANOMALY_LOOKBACK_DAYS=14 # Baseline window
NOISE_SUPPRESSION_ENABLED=true
PREDICTIVE_ALERT_HORIZON_HOURS=4
Ollama runs on CPU if no GPU is available, but inference will be significantly slower (~10–30s per request vs. ~500ms with GPU). For production NOC use, a CUDA-capable GPU with at least 8GB VRAM is strongly recommended.
// AI PROVIDERS
Flux AI uses a dual-provider strategy to balance response speed, cost, and analytical depth. Both providers can be active simultaneously with intelligent routing based on task type.
Ollama — Local LLM
Self-hosted LLM running on your own hardware. No data leaves your network. Ideal for latency-sensitive operations like real-time triage and chatbot responses.
Anthropic Claude — Cloud API
Claude API for complex multi-step reasoning tasks. Superior performance on root cause analysis, post-mortem drafting, and detailed incident summarization.
Routing 97% of requests to local Ollama (triage, chat, capacity) and only using the Claude API for complex analysis (RCA, post-mortems) reduces API costs by ~$1,200/month on a busy NOC compared to API-only operation.
// PROVIDER ROUTING
The routing engine assigns each task to the optimal AI provider based on complexity, latency requirements, and configured rules.
| Task | Default Provider | Fallback | Reason |
|---|---|---|---|
| Alert triage & severity scoring | Ollama | Claude API | Latency-critical — must complete in <15s |
| NOC chatbot responses | Ollama | Claude API | Interactive — users expect <1s response |
| Anomaly explanation | Claude API | Ollama | Nuanced reasoning over metric patterns |
| Root cause analysis | Claude API | — | Complex multi-source reasoning required |
| Post-mortem drafting | Claude API | Ollama | Long-form structured writing |
| Runbook matching | Ollama | Claude API | Pattern match against local runbook index |
| Capacity forecasting | Ollama | Claude API | Numeric trend analysis, no long-form output |
| Noise suppression rules | Rules engine | Ollama | Deterministic — no AI needed for most cases |
Custom Routing Rules
Override default routing via the Settings → AI Providers → Routing Rules panel, or via the API:
POST /api/ai/routing/rules
{
"task": "triage",
"provider": "claude",
"condition": "severity == P1",
"reason": "P1 incidents always use Claude for higher accuracy"
}
// AUTO-TRIAGE ENGINE
The triage engine processes every incoming alert from Flux Event and Flux Monitor within seconds, automatically assigning severity, identifying the affected service, and generating a human-readable summary.
Triage Pipeline
- Alert ingestion — receives raw alert payload from Flux Event processor or Flux Monitor threshold breach
- Context enrichment — pulls service topology, recent incidents, and metric baselines from cross-app DB connections
- Severity scoring — AI assigns P1–P4 with confidence percentage based on impact assessment
- Pattern matching — compares against 90-day incident history to identify recurring issues
- Summary generation — produces a concise, actionable incident summary in plain English
- Runbook linking — surfaces the top 3 matching runbooks from the runbook library
- Notification routing — triggers Flux Notify with AI-enriched incident data
| Severity | AI Criteria | Response SLA |
|---|---|---|
| P1 | Customer-facing service down, revenue impact, SLA breach imminent | Page on-call immediately |
| P2 | Significant degradation, partial outage, >10% error rate | Acknowledge within 15 min |
| P3 | Non-critical degradation, single component affected | Acknowledge within 2 hours |
| P4 | Informational, within SLA, no user impact | Review next business day |
Average triage completion time: 11 seconds. Severity accuracy vs. manual assignment: 94.2%. False P1 rate: 1.8% (reviewed and suppressed automatically for repeat offenders).
// NOC CHATBOT
The NOC chatbot provides a natural language interface over your live incident data, metric baselines, runbooks, and historical patterns. Ask questions in plain English — no query language required.
Example Queries
| Query | What Flux AI does |
|---|---|
| "What caused INC-4471?" | Retrieves incident timeline, correlates with metric data, returns AI-generated root cause summary |
| "Are there any P1 incidents right now?" | Queries Flux Notify for open P1 threads, returns summary with service, duration, and on-call status |
| "What's the MTTR trend this week?" | Queries historical incident data, calculates per-severity MTTR, compares to baseline |
| "Is the payments service healthy?" | Pulls current metrics from Flux Monitor, checks open incidents, returns health assessment |
| "Draft a post-mortem for INC-4468" | Initiates post-mortem generation — routed to Claude API for structured long-form output |
| "Which services have anomalies?" | Returns anomaly detection feed sorted by severity score |
Context Window
The chatbot maintains a sliding context window including the last 20 conversation turns, all currently open incidents, recent anomaly signals, and your team's runbook library. This context is injected into every AI request, enabling coherent multi-turn conversations about ongoing incidents.
// ANOMALY DETECTION
Flux AI continuously monitors metrics from Flux Monitor and Flux Speed, building rolling baselines and surfacing statistically significant deviations before they become outages.
Detection Methods
| Method | Best For | Baseline Window |
|---|---|---|
| Z-score statistical baseline | Steady-state metrics (error rates, latency) | 14-day rolling |
| Isolation Forest (ML) | Multi-dimensional anomaly clusters | Retrained weekly |
| Seasonal decomposition (STL) | Metrics with daily/weekly patterns (traffic, batch jobs) | 30-day rolling |
| Cross-metric correlation | Cascading failures across related services | Real-time |
Anomaly Score Levels
| Score | Severity | Action |
|---|---|---|
| >300% above baseline | CRITICAL | Immediate P1/P2 triage, page on-call |
| 100–300% above baseline | HIGH | Create P2/P3 incident, monitor closely |
| 30–100% above baseline | MEDIUM | Log anomaly, send digest notification |
| 10–30% above baseline | LOW | Record for trend analysis, no alert |
// ROOT CAUSE ANALYSIS
Flux AI's RCA engine guides engineers through structured investigation workflows, automatically correlating events across all Flux Suite applications to identify the origin of cascading failures.
RCA Process
When an RCA is initiated for an incident, Flux AI:
- Pulls the full event timeline from Flux Event for the affected service and its dependencies
- Correlates metric anomalies from Flux Monitor in a configurable time window before the incident
- Cross-references recent deploys, config changes, and maintenance windows from the audit log
- Runs causal chain analysis using the Claude API to identify the most likely root cause with evidence scoring
- Presents findings with a confidence score and supporting evidence for human review
- Feeds accepted root causes back into the pattern library for future triage improvement
Root cause correctly identified on first analysis: 86% of incidents. When engineers provide feedback ("wrong root cause"), the model retrains on the corrected data automatically overnight.
// RUNBOOK SUGGESTIONS
The runbook engine maintains an indexed library of resolution procedures and automatically surfaces the most relevant runbooks for each active incident based on pattern matching against incident history.
Managing Runbooks
Runbooks are created and managed at Settings → Runbooks. Each runbook contains a title, description, service tags, symptom patterns, and numbered resolution steps. When a new incident is triaged, Flux AI scores all runbooks against the incident context and presents the top 3 matches with a confidence percentage.
Runbook API
# List all runbooks
GET /api/ai/runbooks
# Get suggestions for an incident
GET /api/ai/runbooks/suggest?incident_id=INC-4471
# Create a new runbook
POST /api/ai/runbooks
{
"title": "PostgreSQL Replication Recovery",
"service_tags": ["postgresql", "payments", "database"],
"symptom_patterns": ["replication lag", "write amplification"],
"steps": ["Check replica lag", "Identify blocking queries", "..."]
}
// POST-MORTEM DRAFTING
Flux AI automatically generates structured post-mortem documents from the incident thread timeline, metric data, and audit log entries. Drafts are created when an incident is marked Resolved and are available immediately for review and editing.
Draft Structure
- Summary — what happened, duration, and overall impact
- Impact assessment — affected services, user impact, estimated revenue impact
- Timeline — chronological event log pulled from Flux Notify thread and Flux Event audit trail
- Root cause — AI-identified root cause with evidence chain
- Contributing factors — secondary factors that amplified the incident
- What went well — detection time, response quality, tooling effectiveness
- Action items — AI-suggested preventive measures with assignee fields
All post-mortem drafts are fully editable before publication. Engineers review and refine the AI draft, then publish to the internal wiki or export as PDF. Human review is always required — Flux AI does not auto-publish.
// NOISE REDUCTION
Alert fatigue is one of the leading causes of NOC burnout. Flux AI's noise reduction engine intelligently suppresses low-signal alerts so on-call teams only receive actionable notifications.
Suppression Types
| Type | Description | Configuration |
|---|---|---|
| Maintenance window | All alerts from services in a maintenance window are suppressed | Scheduled via Flux Notify + Flux Event |
| Flap detection | Alerts that fire and clear more than N times per hour are suppressed after the threshold | FLAP_THRESHOLD=3 (default) |
| Duplicate correlation | Child alerts caused by a known P1 parent are suppressed and linked | Automatic via service topology |
| Below threshold | Alerts from metrics below a configured significance level | Per-metric in Flux Monitor thresholds |
| Transient suppression | Alerts that resolve within N seconds are not paged | TRANSIENT_TTL=30 (default) |
| Staging environment | Alerts from non-production environments filtered from production NOC view | Environment tag on service |
// AUTO-REMEDIATION
Flux AI can execute a library of safe self-healing actions, always with configurable human approval gates. Every action is logged in the audit trail with full rollback capability.
All remediation actions require explicit approval by default. Actions can be configured for auto-execute only after 30+ successful manual approvals with no incidents for a given runbook. Never enable auto-execute on destructive or irreversible actions.
Available Actions
- Kubernetes: pod restart, deployment rollout restart, horizontal scale, rollback to previous revision
- Caching: Redis cache flush, CDN cache purge, application-level cache clear
- Database: kill long-running queries, connection pool reset, read replica failover
- Networking: DNS cache flush, routing table refresh, LB backend drain
- Application: circuit breaker trip, feature flag disable, rate limit increase
// CAPACITY FORECASTING
Flux AI analyzes historical metric trends from Flux Monitor and produces resource utilization forecasts with configurable alert horizons, enabling proactive scaling decisions before capacity limits are hit.
Monitored Resources
- CPU utilization per host, service, and cluster
- Memory utilization with leak detection trend analysis
- Disk usage with ingest rate modeling per database and volume
- Network bandwidth — ingress and egress per interface and VPC
- Application-layer: request rate, queue depth, active connections
Alert Horizons
| Setting | Default | Description |
|---|---|---|
CAPACITY_ALERT_DAYS | 14 | Days before projected breach to fire forecast alert |
CAPACITY_THRESHOLD_CPU | 80 | CPU % that triggers forecast when projected to be reached |
CAPACITY_THRESHOLD_DISK | 85 | Disk % that triggers forecast |
CAPACITY_THRESHOLD_MEM | 85 | Memory % that triggers forecast |
// PREDICTIVE ALERTING
Beyond detecting anomalies in current metrics, Flux AI surfaces leading indicators — early warning signals that historically precede specific failure modes — allowing on-call engineers to act before users are impacted.
Leading Indicator Examples
| Leading Indicator | Predicted Failure | Avg Lead Time |
|---|---|---|
| DB connection pool >85% full | Connection exhaustion outage | 8 min |
| Cache hit rate dropping >15%/hr | Origin overload / cache miss storm | 12 min |
| Disk free rate of change >2% per hour | Disk full / write failure | 22 min |
| GC pause time trending up 3 consecutive samples | OOM crash or GC thrashing | 15 min |
| Message queue depth doubling every 5 min | Consumer backlog / message drop | 20 min |
// CROSS-APP INTELLIGENCE
Flux AI is the only Flux Suite application with read access to all other apps' databases. This cross-app view enables unified analysis that would be impossible from any single app.
Data Sources
| Source App | Data Used By Flux AI |
|---|---|
| Flux Notify | Incident threads, message history, resolution times, on-call assignments, escalation chains |
| Flux Monitor | Service metrics, health check results, SLA status, topology maps, capacity data |
| Flux Event | Event stream, correlation rules, deduplication logs, pipeline stage timings |
| Flux Speed | Network throughput baselines, latency trends, traceroute history, circuit performance |
Flux AI only reads from other apps' databases — it never writes to them directly. All actions that create or modify data in other apps (e.g., creating a Flux Notify incident) use the official REST APIs of those apps, ensuring a clean and auditable integration boundary.
// RBAC & PERMISSIONS
| Role | Triage / Chat | RCA / Post-Mortem | Runbooks | Remediation | Config |
|---|---|---|---|---|---|
| Admin | ✓ | ✓ | Create / Edit / Delete | Approve & Execute | Full |
| Operator | ✓ | ✓ | View & Use | Approve only | Limited |
| Analyst | ✓ (read-only) | View drafts | View only | — | — |
| Viewer | Dashboard only | — | — | — | — |
// REST API
All Flux AI features are accessible via REST API. Authentication uses the same three methods as the rest of the Flux Suite (JWT Bearer, X-API-Token, X-API-Key).
| Method | Endpoint | Description | Auth |
|---|---|---|---|
| POST | /api/ai/triage | Triage an alert payload — returns severity, summary, runbook matches, and confidence | Bearer / Token |
| GET | /api/ai/triage/feed | List recent triage results with AI summaries and assigned severities | Bearer / Token |
| POST | /api/ai/chat | Send a chat message — returns AI response with context from live incident data | Bearer / Token |
| GET | /api/ai/anomalies | List active anomaly signals with score, baseline delta, and linked incidents | Bearer / Token |
| POST | /api/ai/rca | Initiate root cause analysis for an incident ID — async, returns job ID | Bearer / Token |
| GET | /api/ai/rca/{job_id} | Poll RCA job status and retrieve results when complete | Bearer / Token |
| GET | /api/ai/runbooks | List all runbooks in the library with tags and usage stats | Bearer / Token |
| POST | /api/ai/runbooks | Create a new runbook entry with steps, symptom patterns, and service tags | Admin |
| GET | /api/ai/runbooks/suggest | Get runbook suggestions for an incident ID — returns top 3 with match scores | Bearer / Token |
| POST | /api/ai/postmortem | Generate post-mortem draft for an incident ID — routed to Claude API | Bearer / Token |
| GET | /api/ai/postmortem/{id} | Retrieve post-mortem draft with all sections as structured JSON | Bearer / Token |
| GET | /api/ai/capacity | Get capacity forecast — projected breach dates per resource with confidence ranges | Bearer / Token |
| GET | /api/ai/providers | List AI provider status, uptime, request counts, and latency metrics | Admin |
| POST | /api/ai/providers/test | Send a test prompt to verify both providers are responding | Admin |
// ARCHITECTURE
Flux AI runs as a 4-container Docker Compose stack, with an optional Ollama container for local LLM inference. All containers share a dedicated flux-ai bridge network.