Grafana — Metrics, Dashboards & Alerts
Grafana turns raw metrics from Prometheus (and other data sources) into beautiful, queryable dashboards with built-in alerting.
🏗️ How It Works
📦 Install with Docker Compose
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prom-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
volumes:
prom-data:
grafana-data:
Open http://localhost:3000 — login with admin / admin. On first launch, add Prometheus as a data source: Configuration → Data Sources → Add → Prometheus → URL: http://prometheus:9090
⚙️ Prometheus Config
# prometheus.yml
global:
scrape_interval: 15s # how often to pull metrics
evaluation_interval: 15s # how often to evaluate alert rules
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Prometheus scrapes itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Kubernetes pods with annotation prometheus.io/scrape: "true"
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $1
# Node exporter (host metrics: CPU, disk, network)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# MySQL exporter
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
🔍 PromQL Query Language
PromQL is the query language for Prometheus. Every Grafana panel runs a PromQL query.
# ── Instant queries (current value) ───────────────────
# CPU usage across all CPUs (as percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory used (bytes)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Container CPU usage rate (last 5 min)
rate(container_cpu_usage_seconds_total{container!=""}[5m])
# HTTP request rate (per second, last 2 min)
rate(http_requests_total[2m])
# HTTP error rate as % of total
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Active pods per deployment
count by (deployment) (kube_pod_status_phase{phase="Running"})
# MySQL connections in use
mysql_global_status_threads_connected
| Function | What It Does | Example |
|---|---|---|
rate() | Per-second rate of a counter over a window | rate(http_requests_total[5m]) |
irate() | Instant rate (uses last 2 samples) | irate(cpu_seconds[1m]) |
increase() | Total increase over a window | increase(errors[1h]) |
sum() | Sum across series | sum(rate(req[5m])) |
avg() | Average across series | avg by (pod) (cpu_usage) |
max() | Maximum across series | max(memory_bytes) |
histogram_quantile() | Percentile from histogram | histogram_quantile(0.95, ...) |
label_replace() | Rename/rewrite a label | label_replace(m, "new", "$1", "old", "(.+)") |
📈 Building Dashboards
Create a Dashboard
Dashboards → New → New Dashboard. Click "Add visualization" to add your first panel.
Choose Panel Type
Time series for trends, Stat for single values, Gauge for percentage, Bar chart for comparisons, Table for raw data.
Write PromQL Query
In the query editor, select your Prometheus data source and type your PromQL expression. Preview updates in real-time.
Set Display Options
Add unit (bytes, percent, ms), set thresholds (green/yellow/red), choose legend mode, adjust Y-axis scale.
Add Variables
Dashboard Settings → Variables. Create a namespace or pod variable to filter the whole dashboard at once.
Provisioned Dashboard (as code)
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
editable: false
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: Default
folder: Infra
type: file
options:
path: /var/lib/grafana/dashboards
Go to grafana.com/dashboards for thousands of ready-made dashboards. Import by ID: Dashboards → Import → paste the ID (e.g., 3662 for Prometheus / Grafana, 7362 for Kubernetes, 14057 for Node Exporter).
🔔 Alerting
# prometheus/rules/alerts.yml
groups:
- name: infra-alerts
interval: 1m
rules:
# Alert when any pod keeps restarting
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "{{ $labels.pod }} in {{ $labels.namespace }} restarted {{ $value }} times in 15m"
# Alert when node memory is over 85%
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100 > 85
for: 2m
labels:
severity: critical
annotations:
summary: "High memory on {{ $labels.instance }}"
# Alert when disk is almost full
- alert: DiskAlmostFull
expr: |
(1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
/ node_filesystem_size_bytes) * 100 > 90
for: 1m
labels:
severity: critical
annotations:
summary: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"
Grafana Alert Contact Points
Alerting → Contact Points → Add. Grafana supports Slack, PagerDuty, email, Webhook, OpsGenie, and more.
# grafana/provisioning/alerting/contact-points.yml
apiVersion: 1
contactPoints:
- orgId: 1
name: Slack Alerts
receivers:
- uid: slack-main
type: slack
settings:
url: https://hooks.slack.com/services/xxx/yyy/zzz
title: "{{ template \"default.title\" . }}"
text: "{{ template \"default.message\" . }}"
📋 Common Queries
| What You Want | PromQL |
|---|---|
| CPU % per pod | rate(container_cpu_usage_seconds_total{container!=""}[5m]) * 100 |
| Memory per pod (MB) | container_memory_working_set_bytes{container!=""} / 1024 / 1024 |
| Node CPU % | 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 |
| Node memory % | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 |
| HTTP req/s | sum(rate(http_requests_total[5m])) by (service) |
| Error rate % | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 |
| p95 latency (ms) | histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m])) * 1000 |
| Running pods count | count(kube_pod_status_phase{phase="Running"}) |
| MySQL threads | mysql_global_status_threads_connected |
| MySQL slow queries/s | rate(mysql_global_status_slow_queries[5m]) |