← FLUX SUITE FLUX AI ● PRODUCTION
LIVE DEMO →
// FLUX AI — INTELLIGENCE & AUTOMATION ENGINE

FLUX AI

The intelligent automation layer for your NOC. Flux AI connects to Flux Notify, Flux Monitor, Flux Event, and Flux Speed to provide unified cross-app intelligence — automatically triaging incidents, detecting anomalies, suggesting runbooks, and drafting post-mortems using a dual AI provider strategy with Ollama for speed and Claude API for depth.

// OVERVIEW

Flux AI is the fifth application in the Flux Suite, providing NOC intelligence and automation across the entire platform. It operates as a standalone Docker service with its own web interface, API, and database — while maintaining live read access to the data stores of all other Flux apps.

⚡ Auto-Triage

AI assigns severity, service, and priority to every incoming alert in under 15 seconds, with natural-language summaries and confidence scores.

💬 NOC Chatbot

Natural language queries over live incident data, metric baselines, and runbooks. Ask anything about your infrastructure in plain English.

📡 Anomaly Detection

ML-based baselining using a 14-day rolling window. Multiple detection methods including Z-score, Isolation Forest, and seasonal decomposition.

🔍 Root Cause Analysis

AI-guided investigation workflows that correlate events across all Flux apps and visualize causal chains with evidence scoring.

📋 Runbook Suggestions

Matches active incidents to historical resolution patterns and surfaces the most relevant runbooks with confidence percentages.

📝 Post-Mortem Drafts

Auto-generates structured post-mortems from incident thread timelines, audit logs, and metric data — ready to edit and publish.

🔇 Noise Reduction

Intelligent alert suppression using maintenance windows, flap detection, duplicate correlation, and below-threshold filtering.

📈 Capacity Forecasting

Predictive resource modeling using historical metric trends. Forecasts CPU, memory, disk, and bandwidth with configurable alert horizons.

🔧 Auto-Remediation

Safe self-healing actions with human approval gates. Restart pods, scale services, flush caches — all with full audit trail and rollback.

Architecture at a Glance

ComponentTechnologyPurpose
flux-ai webPHP 8.2 / ApacheDashboard, API, configuration UI
flux-ai-workerNode.js 18ML inference pipeline, alert processor, scheduler
flux-ai-dbPostgreSQL 16Model data, incident history, anomaly baselines
flux-ai-redisRedis 7Inference cache, queue, session store
ollamaOllama runtimeLocal LLM — llama3.1 on GPU/CPU

// INSTALLATION

Quick Start

# Clone and start (Linux/macOS)
git clone https://github.com/your-org/flux-ai.git
cd flux-ai
cp .env.example .env
nano .env          # set passwords, Anthropic API key, Ollama URL
chmod +x start.sh && ./start.sh

# Windows
cp .env.example .env
notepad .env
start.bat

Default Ports

ServiceHost PortContainer PortAccess
Flux AI Web400780Public via NPM
WebSocket40088080Internal / NPM
PostgreSQL54385432localhost only
Redis63826379Internal only
Adminer60478080Tailscale only
Ollama API1143411434Internal only

Environment Variables

# Required
ANTHROPIC_API_KEY=sk-ant-...           # Claude API key (for RCA, post-mortems)
OLLAMA_URL=http://ollama:11434         # Ollama endpoint (local LLM)
OLLAMA_MODEL=llama3.1:8b              # Model to use (8b fits on 8GB VRAM)

# Database
POSTGRES_PASSWORD=ChangeMe!
REDIS_PASSWORD=ChangeMe!

# Cross-app integration (read-only DB access)
NOTIFY_DB_URL=postgresql://user:pass@flux-notify-db:5432/notify
MONITOR_DB_URL=postgresql://user:pass@flux-monitor-db:5432/monitor
EVENT_DB_URL=postgresql://user:pass@flux-event-db:5432/event

# Optional
ANTHROPIC_MODEL=claude-sonnet-4-6     # Default Claude model
ANOMALY_LOOKBACK_DAYS=14              # Baseline window
NOISE_SUPPRESSION_ENABLED=true
PREDICTIVE_ALERT_HORIZON_HOURS=4
⚠️ Ollama on CPU

Ollama runs on CPU if no GPU is available, but inference will be significantly slower (~10–30s per request vs. ~500ms with GPU). For production NOC use, a CUDA-capable GPU with at least 8GB VRAM is strongly recommended.

// AI PROVIDERS

Flux AI uses a dual-provider strategy to balance response speed, cost, and analytical depth. Both providers can be active simultaneously with intelligent routing based on task type.

🦙

Ollama — Local LLM

Self-hosted LLM running on your own hardware. No data leaves your network. Ideal for latency-sensitive operations like real-time triage and chatbot responses.

LOCAL · PRIVATE ~480ms latency llama3.1:8b default $0 / request
🤖

Anthropic Claude — Cloud API

Claude API for complex multi-step reasoning tasks. Superior performance on root cause analysis, post-mortem drafting, and detailed incident summarization.

CLOUD · ANTHROPIC ~2.1s latency claude-sonnet-4-6 pay-per-token
💡 Cost Optimization

Routing 97% of requests to local Ollama (triage, chat, capacity) and only using the Claude API for complex analysis (RCA, post-mortems) reduces API costs by ~$1,200/month on a busy NOC compared to API-only operation.

// PROVIDER ROUTING

The routing engine assigns each task to the optimal AI provider based on complexity, latency requirements, and configured rules.

TaskDefault ProviderFallbackReason
Alert triage & severity scoringOllamaClaude APILatency-critical — must complete in <15s
NOC chatbot responsesOllamaClaude APIInteractive — users expect <1s response
Anomaly explanationClaude APIOllamaNuanced reasoning over metric patterns
Root cause analysisClaude APIComplex multi-source reasoning required
Post-mortem draftingClaude APIOllamaLong-form structured writing
Runbook matchingOllamaClaude APIPattern match against local runbook index
Capacity forecastingOllamaClaude APINumeric trend analysis, no long-form output
Noise suppression rulesRules engineOllamaDeterministic — no AI needed for most cases

Custom Routing Rules

Override default routing via the Settings → AI Providers → Routing Rules panel, or via the API:

POST /api/ai/routing/rules
{
  "task": "triage",
  "provider": "claude",
  "condition": "severity == P1",
  "reason": "P1 incidents always use Claude for higher accuracy"
}

// AUTO-TRIAGE ENGINE

The triage engine processes every incoming alert from Flux Event and Flux Monitor within seconds, automatically assigning severity, identifying the affected service, and generating a human-readable summary.

Triage Pipeline

  1. Alert ingestion — receives raw alert payload from Flux Event processor or Flux Monitor threshold breach
  2. Context enrichment — pulls service topology, recent incidents, and metric baselines from cross-app DB connections
  3. Severity scoring — AI assigns P1–P4 with confidence percentage based on impact assessment
  4. Pattern matching — compares against 90-day incident history to identify recurring issues
  5. Summary generation — produces a concise, actionable incident summary in plain English
  6. Runbook linking — surfaces the top 3 matching runbooks from the runbook library
  7. Notification routing — triggers Flux Notify with AI-enriched incident data
SeverityAI CriteriaResponse SLA
P1Customer-facing service down, revenue impact, SLA breach imminentPage on-call immediately
P2Significant degradation, partial outage, >10% error rateAcknowledge within 15 min
P3Non-critical degradation, single component affectedAcknowledge within 2 hours
P4Informational, within SLA, no user impactReview next business day
📊 Triage Performance

Average triage completion time: 11 seconds. Severity accuracy vs. manual assignment: 94.2%. False P1 rate: 1.8% (reviewed and suppressed automatically for repeat offenders).

// NOC CHATBOT

The NOC chatbot provides a natural language interface over your live incident data, metric baselines, runbooks, and historical patterns. Ask questions in plain English — no query language required.

Example Queries

QueryWhat Flux AI does
"What caused INC-4471?"Retrieves incident timeline, correlates with metric data, returns AI-generated root cause summary
"Are there any P1 incidents right now?"Queries Flux Notify for open P1 threads, returns summary with service, duration, and on-call status
"What's the MTTR trend this week?"Queries historical incident data, calculates per-severity MTTR, compares to baseline
"Is the payments service healthy?"Pulls current metrics from Flux Monitor, checks open incidents, returns health assessment
"Draft a post-mortem for INC-4468"Initiates post-mortem generation — routed to Claude API for structured long-form output
"Which services have anomalies?"Returns anomaly detection feed sorted by severity score

Context Window

The chatbot maintains a sliding context window including the last 20 conversation turns, all currently open incidents, recent anomaly signals, and your team's runbook library. This context is injected into every AI request, enabling coherent multi-turn conversations about ongoing incidents.

// ANOMALY DETECTION

Flux AI continuously monitors metrics from Flux Monitor and Flux Speed, building rolling baselines and surfacing statistically significant deviations before they become outages.

Detection Methods

MethodBest ForBaseline Window
Z-score statistical baselineSteady-state metrics (error rates, latency)14-day rolling
Isolation Forest (ML)Multi-dimensional anomaly clustersRetrained weekly
Seasonal decomposition (STL)Metrics with daily/weekly patterns (traffic, batch jobs)30-day rolling
Cross-metric correlationCascading failures across related servicesReal-time

Anomaly Score Levels

ScoreSeverityAction
>300% above baselineCRITICALImmediate P1/P2 triage, page on-call
100–300% above baselineHIGHCreate P2/P3 incident, monitor closely
30–100% above baselineMEDIUMLog anomaly, send digest notification
10–30% above baselineLOWRecord for trend analysis, no alert

// ROOT CAUSE ANALYSIS

Flux AI's RCA engine guides engineers through structured investigation workflows, automatically correlating events across all Flux Suite applications to identify the origin of cascading failures.

RCA Process

When an RCA is initiated for an incident, Flux AI:

  • Pulls the full event timeline from Flux Event for the affected service and its dependencies
  • Correlates metric anomalies from Flux Monitor in a configurable time window before the incident
  • Cross-references recent deploys, config changes, and maintenance windows from the audit log
  • Runs causal chain analysis using the Claude API to identify the most likely root cause with evidence scoring
  • Presents findings with a confidence score and supporting evidence for human review
  • Feeds accepted root causes back into the pattern library for future triage improvement
📊 RCA Accuracy

Root cause correctly identified on first analysis: 86% of incidents. When engineers provide feedback ("wrong root cause"), the model retrains on the corrected data automatically overnight.

// RUNBOOK SUGGESTIONS

The runbook engine maintains an indexed library of resolution procedures and automatically surfaces the most relevant runbooks for each active incident based on pattern matching against incident history.

Managing Runbooks

Runbooks are created and managed at Settings → Runbooks. Each runbook contains a title, description, service tags, symptom patterns, and numbered resolution steps. When a new incident is triaged, Flux AI scores all runbooks against the incident context and presents the top 3 matches with a confidence percentage.

Runbook API

# List all runbooks
GET /api/ai/runbooks

# Get suggestions for an incident
GET /api/ai/runbooks/suggest?incident_id=INC-4471

# Create a new runbook
POST /api/ai/runbooks
{
  "title": "PostgreSQL Replication Recovery",
  "service_tags": ["postgresql", "payments", "database"],
  "symptom_patterns": ["replication lag", "write amplification"],
  "steps": ["Check replica lag", "Identify blocking queries", "..."]
}

// POST-MORTEM DRAFTING

Flux AI automatically generates structured post-mortem documents from the incident thread timeline, metric data, and audit log entries. Drafts are created when an incident is marked Resolved and are available immediately for review and editing.

Draft Structure

  • Summary — what happened, duration, and overall impact
  • Impact assessment — affected services, user impact, estimated revenue impact
  • Timeline — chronological event log pulled from Flux Notify thread and Flux Event audit trail
  • Root cause — AI-identified root cause with evidence chain
  • Contributing factors — secondary factors that amplified the incident
  • What went well — detection time, response quality, tooling effectiveness
  • Action items — AI-suggested preventive measures with assignee fields
📝 Editing & Publishing

All post-mortem drafts are fully editable before publication. Engineers review and refine the AI draft, then publish to the internal wiki or export as PDF. Human review is always required — Flux AI does not auto-publish.

// NOISE REDUCTION

Alert fatigue is one of the leading causes of NOC burnout. Flux AI's noise reduction engine intelligently suppresses low-signal alerts so on-call teams only receive actionable notifications.

Suppression Types

TypeDescriptionConfiguration
Maintenance windowAll alerts from services in a maintenance window are suppressedScheduled via Flux Notify + Flux Event
Flap detectionAlerts that fire and clear more than N times per hour are suppressed after the thresholdFLAP_THRESHOLD=3 (default)
Duplicate correlationChild alerts caused by a known P1 parent are suppressed and linkedAutomatic via service topology
Below thresholdAlerts from metrics below a configured significance levelPer-metric in Flux Monitor thresholds
Transient suppressionAlerts that resolve within N seconds are not pagedTRANSIENT_TTL=30 (default)
Staging environmentAlerts from non-production environments filtered from production NOC viewEnvironment tag on service

// AUTO-REMEDIATION

Flux AI can execute a library of safe self-healing actions, always with configurable human approval gates. Every action is logged in the audit trail with full rollback capability.

⚠️ Human Approval Gates

All remediation actions require explicit approval by default. Actions can be configured for auto-execute only after 30+ successful manual approvals with no incidents for a given runbook. Never enable auto-execute on destructive or irreversible actions.

Available Actions

  • Kubernetes: pod restart, deployment rollout restart, horizontal scale, rollback to previous revision
  • Caching: Redis cache flush, CDN cache purge, application-level cache clear
  • Database: kill long-running queries, connection pool reset, read replica failover
  • Networking: DNS cache flush, routing table refresh, LB backend drain
  • Application: circuit breaker trip, feature flag disable, rate limit increase

// CAPACITY FORECASTING

Flux AI analyzes historical metric trends from Flux Monitor and produces resource utilization forecasts with configurable alert horizons, enabling proactive scaling decisions before capacity limits are hit.

Monitored Resources

  • CPU utilization per host, service, and cluster
  • Memory utilization with leak detection trend analysis
  • Disk usage with ingest rate modeling per database and volume
  • Network bandwidth — ingress and egress per interface and VPC
  • Application-layer: request rate, queue depth, active connections

Alert Horizons

SettingDefaultDescription
CAPACITY_ALERT_DAYS14Days before projected breach to fire forecast alert
CAPACITY_THRESHOLD_CPU80CPU % that triggers forecast when projected to be reached
CAPACITY_THRESHOLD_DISK85Disk % that triggers forecast
CAPACITY_THRESHOLD_MEM85Memory % that triggers forecast

// PREDICTIVE ALERTING

Beyond detecting anomalies in current metrics, Flux AI surfaces leading indicators — early warning signals that historically precede specific failure modes — allowing on-call engineers to act before users are impacted.

Leading Indicator Examples

Leading IndicatorPredicted FailureAvg Lead Time
DB connection pool >85% fullConnection exhaustion outage8 min
Cache hit rate dropping >15%/hrOrigin overload / cache miss storm12 min
Disk free rate of change >2% per hourDisk full / write failure22 min
GC pause time trending up 3 consecutive samplesOOM crash or GC thrashing15 min
Message queue depth doubling every 5 minConsumer backlog / message drop20 min

// CROSS-APP INTELLIGENCE

Flux AI is the only Flux Suite application with read access to all other apps' databases. This cross-app view enables unified analysis that would be impossible from any single app.

Data Sources

Source AppData Used By Flux AI
Flux NotifyIncident threads, message history, resolution times, on-call assignments, escalation chains
Flux MonitorService metrics, health check results, SLA status, topology maps, capacity data
Flux EventEvent stream, correlation rules, deduplication logs, pipeline stage timings
Flux SpeedNetwork throughput baselines, latency trends, traceroute history, circuit performance
🔒 Read-Only Access

Flux AI only reads from other apps' databases — it never writes to them directly. All actions that create or modify data in other apps (e.g., creating a Flux Notify incident) use the official REST APIs of those apps, ensuring a clean and auditable integration boundary.

// RBAC & PERMISSIONS

RoleTriage / ChatRCA / Post-MortemRunbooksRemediationConfig
AdminCreate / Edit / DeleteApprove & ExecuteFull
OperatorView & UseApprove onlyLimited
Analyst✓ (read-only)View draftsView only
ViewerDashboard only

// REST API

All Flux AI features are accessible via REST API. Authentication uses the same three methods as the rest of the Flux Suite (JWT Bearer, X-API-Token, X-API-Key).

MethodEndpointDescriptionAuth
POST/api/ai/triageTriage an alert payload — returns severity, summary, runbook matches, and confidenceBearer / Token
GET/api/ai/triage/feedList recent triage results with AI summaries and assigned severitiesBearer / Token
POST/api/ai/chatSend a chat message — returns AI response with context from live incident dataBearer / Token
GET/api/ai/anomaliesList active anomaly signals with score, baseline delta, and linked incidentsBearer / Token
POST/api/ai/rcaInitiate root cause analysis for an incident ID — async, returns job IDBearer / Token
GET/api/ai/rca/{job_id}Poll RCA job status and retrieve results when completeBearer / Token
GET/api/ai/runbooksList all runbooks in the library with tags and usage statsBearer / Token
POST/api/ai/runbooksCreate a new runbook entry with steps, symptom patterns, and service tagsAdmin
GET/api/ai/runbooks/suggestGet runbook suggestions for an incident ID — returns top 3 with match scoresBearer / Token
POST/api/ai/postmortemGenerate post-mortem draft for an incident ID — routed to Claude APIBearer / Token
GET/api/ai/postmortem/{id}Retrieve post-mortem draft with all sections as structured JSONBearer / Token
GET/api/ai/capacityGet capacity forecast — projected breach dates per resource with confidence rangesBearer / Token
GET/api/ai/providersList AI provider status, uptime, request counts, and latency metricsAdmin
POST/api/ai/providers/testSend a test prompt to verify both providers are respondingAdmin

// ARCHITECTURE

Flux AI runs as a 4-container Docker Compose stack, with an optional Ollama container for local LLM inference. All containers share a dedicated flux-ai bridge network.

🌐
flux-ai (web)
PHP 8.2 / Apache — dashboard, REST API, configuration, RBAC enforcement
:4007
flux-ai-worker
Node.js 18 — ML inference pipeline, alert processor, anomaly detector, scheduler
:4008 (WS)
🗄
flux-ai-db
PostgreSQL 16 — model baselines, incident history, anomaly scores, runbook library
:5438
flux-ai-redis
Redis 7 — inference result cache, job queue, WebSocket sessions, rate limiting
:6382
🦙
ollama (optional)
Local LLM runtime — llama3.1:8b or custom model. GPU strongly recommended for production.
:11434