Docker · Kubernetes · Grafana · ArgoCD · AWS

🏗️ How It Works

Monitoring Stack Your Application │ exposes /metrics endpoint (HTTP) │ format: metric_name{label="value"} 1.23 ▼ Prometheus │ scrapes every 15s (pull model) │ stores time-series data in TSDB │ evaluates alert rules ▼ Grafana │ queries Prometheus with PromQL │ renders dashboards + graphs └─► sends alerts to Slack / PagerDuty / email

📦 Install with Docker Compose

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  prom-data:
  grafana-data:
🌐
Access Grafana

Open http://localhost:3000 — login with admin / admin. On first launch, add Prometheus as a data source: Configuration → Data Sources → Add → Prometheus → URL: http://prometheus:9090

⚙️ Prometheus Config

# prometheus.yml
global:
  scrape_interval: 15s       # how often to pull metrics
  evaluation_interval: 15s   # how often to evaluate alert rules

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus scrapes itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Kubernetes pods with annotation prometheus.io/scrape: "true"
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: $1

  # Node exporter (host metrics: CPU, disk, network)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # MySQL exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

🔍 PromQL Query Language

PromQL is the query language for Prometheus. Every Grafana panel runs a PromQL query.

# ── Instant queries (current value) ───────────────────

# CPU usage across all CPUs (as percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used (bytes)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Container CPU usage rate (last 5 min)
rate(container_cpu_usage_seconds_total{container!=""}[5m])

# HTTP request rate (per second, last 2 min)
rate(http_requests_total[2m])

# HTTP error rate as % of total
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Active pods per deployment
count by (deployment) (kube_pod_status_phase{phase="Running"})

# MySQL connections in use
mysql_global_status_threads_connected
FunctionWhat It DoesExample
rate()Per-second rate of a counter over a windowrate(http_requests_total[5m])
irate()Instant rate (uses last 2 samples)irate(cpu_seconds[1m])
increase()Total increase over a windowincrease(errors[1h])
sum()Sum across seriessum(rate(req[5m]))
avg()Average across seriesavg by (pod) (cpu_usage)
max()Maximum across seriesmax(memory_bytes)
histogram_quantile()Percentile from histogramhistogram_quantile(0.95, ...)
label_replace()Rename/rewrite a labellabel_replace(m, "new", "$1", "old", "(.+)")

📈 Building Dashboards

1

Create a Dashboard

Dashboards → New → New Dashboard. Click "Add visualization" to add your first panel.

2

Choose Panel Type

Time series for trends, Stat for single values, Gauge for percentage, Bar chart for comparisons, Table for raw data.

3

Write PromQL Query

In the query editor, select your Prometheus data source and type your PromQL expression. Preview updates in real-time.

4

Set Display Options

Add unit (bytes, percent, ms), set thresholds (green/yellow/red), choose legend mode, adjust Y-axis scale.

5

Add Variables

Dashboard Settings → Variables. Create a namespace or pod variable to filter the whole dashboard at once.

Provisioned Dashboard (as code)

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    editable: false
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: Default
    folder: Infra
    type: file
    options:
      path: /var/lib/grafana/dashboards
💡
Import community dashboards

Go to grafana.com/dashboards for thousands of ready-made dashboards. Import by ID: Dashboards → Import → paste the ID (e.g., 3662 for Prometheus / Grafana, 7362 for Kubernetes, 14057 for Node Exporter).

🔔 Alerting

# prometheus/rules/alerts.yml
groups:
  - name: infra-alerts
    interval: 1m
    rules:
      # Alert when any pod keeps restarting
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash-looping"
          description: "{{ $labels.pod }} in {{ $labels.namespace }} restarted {{ $value }} times in 15m"

      # Alert when node memory is over 85%
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High memory on {{ $labels.instance }}"

      # Alert when disk is almost full
      - alert: DiskAlmostFull
        expr: |
          (1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
          / node_filesystem_size_bytes) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Disk {{ $labels.mountpoint }} is {{ $value }}% full"

Grafana Alert Contact Points

Alerting → Contact Points → Add. Grafana supports Slack, PagerDuty, email, Webhook, OpsGenie, and more.

# grafana/provisioning/alerting/contact-points.yml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: Slack Alerts
    receivers:
      - uid: slack-main
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx/yyy/zzz
          title: "{{ template \"default.title\" . }}"
          text: "{{ template \"default.message\" . }}"

📋 Common Queries

What You WantPromQL
CPU % per podrate(container_cpu_usage_seconds_total{container!=""}[5m]) * 100
Memory per pod (MB)container_memory_working_set_bytes{container!=""} / 1024 / 1024
Node CPU %100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
Node memory %(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
HTTP req/ssum(rate(http_requests_total[5m])) by (service)
Error rate %rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
p95 latency (ms)histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m])) * 1000
Running pods countcount(kube_pod_status_phase{phase="Running"})
MySQL threadsmysql_global_status_threads_connected
MySQL slow queries/srate(mysql_global_status_slow_queries[5m])