Grafana & Prometheus Setup Guide — Monitoring, Dashboards & Alerting

🏗️ How It Works

Monitoring Stack Your Application │ exposes /metrics endpoint (HTTP) │ format: metric_name{label="value"} 1.23 ▼ Prometheus │ scrapes every 15s (pull model) │ stores time-series data in TSDB │ evaluates alert rules ▼ Grafana │ queries Prometheus with PromQL │ renders dashboards + graphs └─► sends alerts to Slack / PagerDuty / email

📦 Install with Docker Compose

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  prom-data:
  grafana-data:

🌐

Access Grafana

Open http://localhost:3000 — login with admin / admin. On first launch, add Prometheus as a data source: Configuration → Data Sources → Add → Prometheus → URL: http://prometheus:9090

⚙️ Prometheus Config

# prometheus.yml
global:
  scrape_interval: 15s       # how often to pull metrics
  evaluation_interval: 15s   # how often to evaluate alert rules

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus scrapes itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Kubernetes pods with annotation prometheus.io/scrape: "true"
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: $1

  # Node exporter (host metrics: CPU, disk, network)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # MySQL exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

🔍 PromQL Query Language

PromQL is the query language for Prometheus. Every Grafana panel runs a PromQL query.

# ── Instant queries (current value) ───────────────────

# CPU usage across all CPUs (as percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used (bytes)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Container CPU usage rate (last 5 min)
rate(container_cpu_usage_seconds_total{container!=""}[5m])

# HTTP request rate (per second, last 2 min)
rate(http_requests_total[2m])

# HTTP error rate as % of total
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Active pods per deployment
count by (deployment) (kube_pod_status_phase{phase="Running"})

# MySQL connections in use
mysql_global_status_threads_connected

Function	What It Does	Example
`rate()`	Per-second rate of a counter over a window	`rate(http_requests_total[5m])`
`irate()`	Instant rate (uses last 2 samples)	`irate(cpu_seconds[1m])`
`increase()`	Total increase over a window	`increase(errors[1h])`
`sum()`	Sum across series	`sum(rate(req[5m]))`
`avg()`	Average across series	`avg by (pod) (cpu_usage)`
`max()`	Maximum across series	`max(memory_bytes)`
`histogram_quantile()`	Percentile from histogram	`histogram_quantile(0.95, ...)`
`label_replace()`	Rename/rewrite a label	`label_replace(m, "new", "$1", "old", "(.+)")`

📈 Building Dashboards

Create a Dashboard

Dashboards → New → New Dashboard. Click "Add visualization" to add your first panel.

Choose Panel Type

Time series for trends, Stat for single values, Gauge for percentage, Bar chart for comparisons, Table for raw data.

Write PromQL Query

In the query editor, select your Prometheus data source and type your PromQL expression. Preview updates in real-time.

Set Display Options

Add unit (bytes, percent, ms), set thresholds (green/yellow/red), choose legend mode, adjust Y-axis scale.

Add Variables

Dashboard Settings → Variables. Create a namespace or pod variable to filter the whole dashboard at once.

Provisioned Dashboard (as code)

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    editable: false

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: Default
    folder: Infra
    type: file
    options:
      path: /var/lib/grafana/dashboards

💡

Import community dashboards

Go to grafana.com/dashboards for thousands of ready-made dashboards. Import by ID: Dashboards → Import → paste the ID (e.g., 3662 for Prometheus / Grafana, 7362 for Kubernetes, 14057 for Node Exporter).

🔔 Alerting

# prometheus/rules/alerts.yml
groups:
  - name: infra-alerts
    interval: 1m
    rules:
      # Alert when any pod keeps restarting
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash-looping"
          description: "{{ $labels.pod }} in {{ $labels.namespace }} restarted {{ $value }} times in 15m"

      # Alert when node memory is over 85%
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High memory on {{ $labels.instance }}"

      # Alert when disk is almost full
      - alert: DiskAlmostFull
        expr: |
          (1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
          / node_filesystem_size_bytes) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"

Grafana Alert Contact Points

Alerting → Contact Points → Add. Grafana supports Slack, PagerDuty, email, Webhook, OpsGenie, and more.

# grafana/provisioning/alerting/contact-points.yml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: Slack Alerts
    receivers:
      - uid: slack-main
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx/yyy/zzz
          title: "{{ template \"default.title\" . }}"
          text: "{{ template \"default.message\" . }}"

📋 Common Queries

What You Want	PromQL
CPU % per pod	`rate(container_cpu_usage_seconds_total{container!=""}[5m]) * 100`
Memory per pod (MB)	`container_memory_working_set_bytes{container!=""} / 1024 / 1024`
Node CPU %	`100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100`
Node memory %	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`
HTTP req/s	`sum(rate(http_requests_total[5m])) by (service)`
Error rate %	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100`
p95 latency (ms)	`histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m])) * 1000`
Running pods count	`count(kube_pod_status_phase{phase="Running"})`
MySQL threads	`mysql_global_status_threads_connected`
MySQL slow queries/s	`rate(mysql_global_status_slow_queries[5m])`

Grafana — Metrics, Dashboards & Alerts

🏗️ How It Works

📦 Install with Docker Compose

⚙️ Prometheus Config

🔍 PromQL Query Language

📈 Building Dashboards

Create a Dashboard

Choose Panel Type

Write PromQL Query

Set Display Options

Add Variables

Provisioned Dashboard (as code)

🔔 Alerting

Grafana Alert Contact Points

📋 Common Queries