Docker · Kubernetes · Grafana · ArgoCD · AWS

🗺️ The DevOps Landscape

DevOps Toolchain Plan Build Store Deploy ───── ───── ───── ────── Jira Jenkins ECR ArgoCD Linear GitHub Actions Harbor Flux GitHub Issues GitLab CI Docker Hub Helm CircleCI Nexus Kustomize Monitor Log Trace Security ─────── ─── ───── ──────── Prometheus Loki Jaeger Vault Grafana Elasticsearch Tempo Trivy Datadog Fluentd Zipkin Falco New Relic Logstash OpenTelemetry Snyk Infra as Code Service Mesh Secrets ───────────── ──────────── ─────── Terraform Istio Vault Pulumi Linkerd Sealed Secrets Ansible Consul External Secrets CloudFormation

⚙️ CI/CD — Build & Deploy Pipelines

🏗️

Jenkins

The OG of CI/CD. Self-hosted, 1800+ plugins, runs on any infra. Very flexible, higher ops overhead.

🐙

GitHub Actions

Native to GitHub. YAML workflows in .github/workflows/. Best choice if your code is on GitHub.

🦊

GitLab CI

Built into GitLab. Excellent for self-hosted environments. Pipelines defined in .gitlab-ci.yml.

🔵

CircleCI

Hosted CI with fast parallelism. Good Docker support, simple YAML config, great for open source.

GitHub Actions — Full Docker + k8s Pipeline

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: 123456789012.dkr.ecr.us-east-1.amazonaws.com
  IMAGE: 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests
        run: docker compose run --rm app pytest

  build-push:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to ECR
        run: aws ecr get-login-password | docker login --username AWS --password-stdin ${{ env.REGISTRY }}

      - name: Build and push
        id: meta
        run: |
          TAG=${{ env.IMAGE }}:${{ github.sha }}
          docker build --cache-from ${{ env.IMAGE }}:latest -t $TAG -t ${{ env.IMAGE }}:latest .
          docker push $TAG
          docker push ${{ env.IMAGE }}:latest
          echo "tags=$TAG" >> $GITHUB_OUTPUT

  deploy:
    needs: build-push
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: myorg/infra
          token: ${{ secrets.INFRA_TOKEN }}

      - name: Update image tag and push
        run: |
          sed -i "s|image: .*|image: ${{ needs.build-push.outputs.image-tag }}|" manifests/myapp/deployment.yaml
          git config user.email "ci@company.com" && git config user.name "CI"
          git commit -am "deploy: myapp ${{ github.sha }}" && git push

Jenkins — Declarative Pipeline

// Jenkinsfile
pipeline {
    agent any
    environment {
        REGISTRY = '123456789012.dkr.ecr.us-east-1.amazonaws.com'
        IMAGE    = "${REGISTRY}/myapp"
    }
    stages {
        stage('Test') {
            steps {
                sh 'docker compose run --rm app pytest'
            }
        }
        stage('Build') {
            steps {
                sh """
                    aws ecr get-login-password --region us-east-1 | \
                      docker login --username AWS --password-stdin ${REGISTRY}
                    docker build -t ${IMAGE}:${BUILD_NUMBER} -t ${IMAGE}:latest .
                    docker push ${IMAGE}:${BUILD_NUMBER}
                    docker push ${IMAGE}:latest
                """
            }
        }
        stage('Deploy') {
            when { branch 'main' }
            steps {
                sh """
                    sed -i "s|image: .*|image: ${IMAGE}:${BUILD_NUMBER}|" manifests/myapp/deployment.yaml
                    git commit -am "deploy: build ${BUILD_NUMBER}" && git push
                """
            }
        }
    }
    post {
        failure { slackSend color: 'danger', message: "Build failed: ${BUILD_URL}" }
        success { slackSend color: 'good',   message: "Deployed: ${IMAGE}:${BUILD_NUMBER}" }
    }
}

🏗️ Infrastructure as Code

🌍

Terraform

Define AWS/GCP/Azure infrastructure in HCL. Plan → Apply → Destroy. The industry standard IaC tool.

🟠

Pulumi

IaC using real programming languages (TypeScript, Python, Go). More powerful than Terraform for complex logic.

📜

Ansible

Agentless config management. SSH-based, great for VM setup, package installs, and configuration drift.

🔶

CloudFormation

AWS-native IaC in JSON/YAML. No extra tools needed. Slower but integrates deeply with AWS console.

Terraform — AWS ECS Service Example

# main.tf
terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" { region = "us-east-1" }

resource "aws_ecs_cluster" "main" {
  name = "production"
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_service" "myapp" {
  name            = "myapp"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.myapp.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.app.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.myapp.arn
    container_name   = "myapp"
    container_port   = 8080
  }
}

output "alb_dns" {
  value = aws_lb.main.dns_name
}
terraform init         # download providers
terraform plan         # preview changes
terraform apply        # apply changes
terraform destroy      # tear down everything
terraform output       # show output values
terraform state list   # list managed resources

⛵ Helm — Kubernetes Package Manager

Helm packages Kubernetes manifests into reusable, versioned "charts". Install complex apps with a single command.

# Add repositories
helm repo add stable        https://charts.helm.sh/stable
helm repo add bitnami       https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Search for charts
helm search repo postgres

# Install a chart
helm install my-postgres bitnami/postgresql \
  --namespace db --create-namespace \
  --set auth.postgresPassword=secret \
  --set primary.persistence.size=20Gi

# Install with values file
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  -f values.yaml

# Upgrade release
helm upgrade my-postgres bitnami/postgresql --set image.tag=16.0.0

# Rollback
helm rollback my-postgres 1

# List releases
helm list -A

# Uninstall
helm uninstall my-postgres -n db

# Create your own chart
helm create myapp
helm package myapp/
helm install myapp ./myapp-0.1.0.tgz

📊 Monitoring & Observability

🔥

Prometheus

Time-series metrics database. Pull model — scrapes /metrics endpoints. Powers most k8s monitoring.

📊

Grafana

Dashboard and alerting. Connects to Prometheus, Loki, Tempo, CloudWatch, and 50+ other data sources.

🐕

Datadog

All-in-one SaaS monitoring — metrics, logs, traces, APM, RUM. Expensive but zero infrastructure overhead.

⚠️

AlertManager

Handles Prometheus alerts — routes to Slack/PagerDuty, deduplicates, silences. Part of Prometheus stack.

AlertManager Config

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'slack-critical'
  routes:
    - match: { severity: critical }
      receiver: pagerduty
    - match: { severity: warning }
      receiver: slack-critical

receivers:
  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'your-pagerduty-key'

📋 Logging — Loki & ELK Stack

🪵

Grafana Loki

Like Prometheus but for logs. Indexes only labels (not content) — very cheap storage. Query with LogQL.

🦅

Elasticsearch

Full-text search + log storage. More powerful queries than Loki. Higher cost. Part of ELK/OpenSearch stack.

🐦

Fluentd / Promtail

Log collectors. Fluentd → Elasticsearch. Promtail → Loki. Runs as DaemonSet in k8s to collect all pod logs.

🎯

Kibana

Dashboard for Elasticsearch. Like Grafana but for text logs. Comes with ELK stack.

LogQL — Querying Loki

# Stream all logs from a namespace
{namespace="myapp"}

# Filter by container and search for error
{namespace="myapp", container="api"} |= "ERROR"

# Regex filter
{app="nginx"} |~ "POST .* 5[0-9][0-9]"

# Parse JSON logs and filter by field
{app="api"} | json | level="error" | response_time > 1000

# Count errors per minute
sum(rate({namespace="myapp"} |= "ERROR" [1m])) by (pod)

# Show last 100 lines
{namespace="myapp"} | limit 100

🔍 Distributed Tracing

🏹

Jaeger

Open-source distributed tracing (by Uber). Trace requests across microservices. Integrates with OpenTelemetry.

🎵

Grafana Tempo

Scalable trace storage by Grafana Labs. Pairs with Loki + Prometheus for full observability in Grafana.

📡

OpenTelemetry

Vendor-neutral standard for metrics, logs, and traces. Instrument once, export to Jaeger/Tempo/Datadog/etc.

🔭

Zipkin

Original open-source tracing (by Twitter). Simpler than Jaeger. Good for small deployments.

🔒 Security Tools

🔐

HashiCorp Vault

Secrets management platform. Stores, rotates, and leases secrets. Dynamic DB credentials, PKI, encryption-as-a-service.

🛡️

Trivy

Fast vulnerability scanner for Docker images, filesystems, and k8s. Run in CI to catch CVEs before deploy.

🦅

Falco

Runtime security for Kubernetes. Detects unexpected system calls inside containers (shell in pod, file write, etc.).

🔍

Snyk

SaaS security scanning — code, dependencies, Docker images, k8s manifests. Integrates with GitHub PRs.

Trivy — Scan Images in CI

# Install
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh

# Scan a Docker image
trivy image nginx:latest

# Scan and fail build on HIGH/CRITICAL CVEs
trivy image \
  --severity HIGH,CRITICAL \
  --exit-code 1 \
  myapp:latest

# Scan filesystem (Dockerfile, package.json, etc.)
trivy fs .

# Scan Kubernetes manifests
trivy k8s --report summary cluster

# Output JSON for SBOM
trivy image --format json -o sbom.json myapp:latest

Vault — Dynamic DB Credentials

# Start Vault (dev mode)
vault server -dev -dev-root-token-id="root"

# Enable database secrets engine
vault secrets enable database

# Configure PostgreSQL connection
vault write database/config/mydb \
  plugin_name=postgresql-database-plugin \
  connection_url="postgresql://{{username}}:{{password}}@db:5432/myapp" \
  allowed_roles="app-role" \
  username="vault" \
  password="vaultpass"

# Create role with TTL
vault write database/roles/app-role \
  db_name=mydb \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT,INSERT,UPDATE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

# Get temp credentials (your app calls this)
vault read database/creds/app-role
# Returns: username=v-app-role-xyz  password=A1b2C3d4  lease_duration=1h

🕸️ Service Mesh

🌊

Istio

Most feature-rich service mesh. mTLS between all pods, traffic shaping, circuit breaking, detailed telemetry. Complex to operate.

🔗

Linkerd

Simpler, lighter mesh. Auto mTLS, traffic metrics, reliability features. Written in Rust — very low overhead.

🏛️

Consul

Service mesh + service discovery + key-value store by HashiCorp. Multi-platform (k8s + VMs).

⚠️
Service mesh has real complexity cost

Only adopt a service mesh when you actually need mTLS between services, fine-grained traffic control, or advanced observability. For most teams, Kubernetes Network Policies + good logging covers 80% of the use cases at a fraction of the operational cost.

🔄 GitOps Tools

🔄

ArgoCD

Most popular GitOps tool. UI + CLI, multi-cluster, app health checks, rollback. Start here.

🌊

Flux

GitOps controller (CNCF project). More Kubernetes-native than ArgoCD. Built around GitRepository + Kustomization CRDs.

Kustomize

Overlay-based k8s manifest customization. Built into kubectl. Patch base manifests per environment without Helm.

🗄️ Container Registries

RegistryTypeBest ForCost
Amazon ECRManagedAWS workloads — best AWS integration$0.10/GB/month storage
Docker HubHosted SaaSPublic images, open sourceFree (rate limited) / $7/mo Pro
GitHub Container RegistryHosted SaaSCode already on GitHubFree for public; storage fees for private
HarborSelf-hostedOn-prem, air-gapped, fine-grained RBACFree (you host it)
JFrog ArtifactoryHybridEnterprise — also stores npm, maven, etc.$$$
NexusSelf-hostedOpen source enterprise registry + artifact repoFree community edition

✅ Tool Choices by Team Size

CategorySmall Team (1–5)Medium (5–20)Enterprise (20+)
CI/CDGitHub ActionsGitHub Actions / GitLab CIJenkins + GitHub Actions
IaCTerraformTerraform + TerragruntTerraform + Atlantis
RegistryECR / GHCRECRECR / Harbor
GitOpsArgoCDArgoCDArgoCD / Flux
MetricsPrometheus + GrafanaPrometheus + GrafanaDatadog / New Relic
LogsLokiLoki / OpenSearchElasticsearch / Datadog
SecretsAWS Secrets ManagerVault / Secrets ManagerVault Enterprise
SecurityTrivy in CITrivy + SnykSnyk + Falco + Wiz