Skip to content

Monitoring & Observability

The observability stack is the full LGTM pattern: Loki (logs), Grafana (visualization and alerting), Tempo (distributed traces), and Prometheus (metrics). Everything runs as Docker containers on the production host alongside the services it monitors. Configuration is fully provisioned from the repository — dashboards, datasources, alert rules, and contact points are all version-controlled.

ContainerImageRole
grafanagrafana/grafanaVisualization, dashboards, alerting UI
lokigrafana/lokiLog aggregation and querying
prometheusprom/prometheusMetrics TSDB and scrape engine
tempografana/tempoDistributed tracing (OTLP receiver)
alloygrafana/alloyLog and syslog collector (replaced Promtail)
cadvisorgcr.io/cadvisor/cadvisorPer-container CPU, memory, network, OOM metrics
node-exporterprom/node-exporterProduction host system metrics
postgres-exporterprometheuscommunity/postgres-exporterPostgreSQL metrics
blackbox-exporterprom/blackbox-exporterHTTP/HTTPS/TCP endpoint probing
otel-collectorotel/opentelemetry-collector-contribOTLP fan-out (Claude Code telemetry)
unpollerghcr.io/unpoller/unpollerUniFi network metrics
docker-socket-proxy-grafanatecnativa/docker-socket-proxyRead-only Docker API proxy for Alloy/cAdvisor

All 12 containers run in net-monitoring, which is isolated from the application and data networks. Grafana and Prometheus are also on net-frontend so Caddy can proxy their web UIs.

Containers (stdout/stderr)
└─> Alloy (Docker socket discovery via proxy, log tailing)
└─> Loki → Grafana
Containers (cgroup stats via /sys)
└─> cAdvisor
└─> Prometheus (scrapes every 30s)
Production host (proc/sys mounts)
└─> node-exporter
└─> Prometheus (scrapes every 15s)
PostgreSQL
└─> postgres-exporter (read-only role)
└─> Prometheus (scrapes every 15s)
UniFi Controller API
└─> unpoller (polls every 30s)
└─> Prometheus
Remote hosts (nightwatch GPU node, inference host)
└─> node-exporter, AMD GPU exporter
└─> Prometheus (scrapes every 30s)
Claude Code (OTLP over gRPC or HTTP)
└─> otel-collector
├─> Prometheus (metrics)
├─> Tempo (traces)
└─> Loki (logs)
UniFi syslog
└─> Alloy (UDP 514)
└─> Loki

Prometheus storage: local TSDB, 2-year (730-day) / 50GB max retention.

JobTargetInterval
prometheusPrometheus self15s
node-exporter-piProduction host metrics15s
node-exporter-nightwatchGPU node host metrics30s
node-exporter-atlasInference host metrics30s
cadvisorContainer metrics30s (timeout 25s)
postgres-exporterPostgreSQL metrics15s
caddyCaddy metrics (admin API)15s
grafanaGrafana metrics15s
alloyAlloy collector metrics15s
tempoTempo metrics15s
amd-gpuGPU node AMD GPU metrics30s
unpollerUniFi network metrics30s
blackbox-http5 internal service probes5m
blackbox-http-authAuth-gated MCP endpoint probe (accepts 401)5m
blackbox-httpsExternal endpoint probe (Cloudflare path)5m
lokiLoki metrics15s
otel-collectorOTEL collector metrics15s

The blackbox exporter probes internal services to verify reachability from within the Docker network — separate from whether the container itself reports healthy.

Probe modules:

ModuleUse case
http_2xxUnauthenticated internal services (expects 2xx)
https_2xx_3xxExternal Cloudflare endpoint (allows Authelia redirects)
http_2xx_or_401Auth-gated services with no public health path (accepts 401 as “reachable”)
https_2xxTLS-verified probes
tcp_connectRaw port connectivity check

Probed targets include: dashboard API, Home Assistant, n8n, Grafana, Authelia, and the MCP postgres proxy endpoint.

RuleConditionSeverity
Disk Space CriticalRoot filesystem > 90% for 10 minutescritical
Container StoppedTime since last seen > 300s for 5 minutescritical
High Memory UsageAvailable memory < 10% for 5 minutescritical
PostgreSQL Connections HighActive connections > 80 for 5 minuteswarning
Sustained High CPUCPU > 85% for 15 minuteswarning
Loki Log Ingestion StoppedIngestion rate = 0 for 10 minuteswarning
Memory Pressure CriticalAvailable memory < 500MB for 5 minutescritical
Swap Usage HighSwap > 80% for 5 minuteswarning
Container OOM KilledOOM event increase in last 5 minutescritical
Container Restart LoopMore than 3 restarts in 10 minutescritical
PostgreSQL Downpg_up < 1 for 1 minutecritical
Disk Space WarningRoot filesystem > 85% for 10 minuteswarning
Container Docker UnhealthyDocker health state == 0 for 5 minutescritical

Container scope for stopped/OOM/restart/unhealthy rules: matches chris-os-* and homeassistant and wyoming-* containers. Excluded from stopped alert: Alloy (expected to run continuously) and Piston (on-demand code execution only).

RuleConditionSeverity
SSL Certificate Expiry WarningDays remaining < 14 for 10 minuteswarning
SSL Certificate Expiry CriticalDays remaining < 3 for 5 minutescritical
Endpoint UnreachableProbe success < 1 for 5 minutescritical

Endpoint probes use noDataState: Alerting — if the blackbox exporter itself stops reporting, the rule fires rather than resolving silently.

RuleConditionSeverity
Offsite R2 Backup StaleBackup age > 26 hours for 30 minutescritical

Two contact points:

Contact PointChannelDelivery
pushoverPushover mobile pushAll devices, normal priority
discord-alertsDiscord webhook#vital-apparatus channel (“Aperture Science Monitoring”)

Routing policy:

  • Default receiver: pushover
  • critical severity: pushover (group wait 10s) then discord-alerts (dual-path, continue=true)
  • warning severity: pushover only (group wait 1 minute)
  • Repeat intervals: critical 4h (Pushover) / 8h (Discord), warning 4h

All dashboards are provisioned from grafana/dashboards/ and load automatically on container startup. They live in the chris-os folder in Grafana.

DashboardContents
system-overviewProduction host CPU, memory, disk, network
docker-containersPer-container cAdvisor metrics
postgresqlQuery latency, connections, table sizes
n8n-workflowsn8n execution metrics
pipeline-healthData pipeline health
voice-pipelineVoice pipeline metrics
health-wellnessHealth and wellness data
breweryInkbird temperature, TP-Link power monitoring (uses HA PostgreSQL datasource)
claude-codeClaude Code OTLP telemetry
glados-telemetryGLaDOS framework telemetry
NameTypeTarget
PrometheusprometheusPrometheus (default, 15s scrape)
LokilokiLoki log aggregation
PostgreSQLgrafana-postgresql-datasourcePrimary database, read-only role
HomeAssistantgrafana-postgresql-datasourceHome Assistant database, HA role
TempotempoTempo trace backend

Grafana authentication uses Authelia OIDC directly (client_id: grafana, PKCE S256). Members of the admins group receive the Grafana Admin role. Grafana’s own database uses PostgreSQL (not SQLite).

FilePurpose
grafana/docker-compose.grafana.ymlCompose stack for the full monitoring group
grafana/alloy-config.alloyAlloy log collection and routing config
grafana/prometheus.ymlPrometheus scrape config
grafana/alerts.ymlAlert rule definitions (provisioned)
grafana/blackbox.ymlBlackbox exporter probe module definitions
grafana/otel/collector-config.ymlOTLP collector fan-out config
grafana/dashboards/*.jsonDashboard definitions (provisioned)

Container “stopped” alerting: Uses last_over_time and a threshold pipeline rather than absence detection. The rule tracks the most recent timestamp any metric was seen from a container. noDataState: Alerting ensures alerts fire even if cAdvisor stops reporting entirely.

Loki WAL: A prior crash corrupted the Loki WAL. Recovery required wiping the WAL directory. min_ready_duration: 0s is set in Loki config — a value of 15s (the default) causes an infinite readiness loop on this version.

cAdvisor ARM overlay: cAdvisor disk scanning on ARM is slow (2-8 minutes). --docker_only=true flag limits cAdvisor to Docker containers only, avoiding host filesystem scanning.

OTLP from Claude Code: Claude Code emits OTLP telemetry to the collector. Metrics flow to Prometheus, traces to Tempo, logs to Loki. The claude-code and glados-telemetry dashboards visualize this data.