Top DevOps Tools — The Complete Ecosystem
Every production DevOps environment is built from a set of proven tools. This guide covers the essential apps in each category — what they do, when to use them, and how to get started.
🗺️ The DevOps Landscape
⚙️ CI/CD — Build & Deploy Pipelines
Jenkins
The OG of CI/CD. Self-hosted, 1800+ plugins, runs on any infra. Very flexible, higher ops overhead.
GitHub Actions
Native to GitHub. YAML workflows in .github/workflows/. Best choice if your code is on GitHub.
GitLab CI
Built into GitLab. Excellent for self-hosted environments. Pipelines defined in .gitlab-ci.yml.
CircleCI
Hosted CI with fast parallelism. Good Docker support, simple YAML config, great for open source.
GitHub Actions — Full Docker + k8s Pipeline
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: 123456789012.dkr.ecr.us-east-1.amazonaws.com
IMAGE: 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: docker compose run --rm app pytest
build-push:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Login to ECR
run: aws ecr get-login-password | docker login --username AWS --password-stdin ${{ env.REGISTRY }}
- name: Build and push
id: meta
run: |
TAG=${{ env.IMAGE }}:${{ github.sha }}
docker build --cache-from ${{ env.IMAGE }}:latest -t $TAG -t ${{ env.IMAGE }}:latest .
docker push $TAG
docker push ${{ env.IMAGE }}:latest
echo "tags=$TAG" >> $GITHUB_OUTPUT
deploy:
needs: build-push
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
repository: myorg/infra
token: ${{ secrets.INFRA_TOKEN }}
- name: Update image tag and push
run: |
sed -i "s|image: .*|image: ${{ needs.build-push.outputs.image-tag }}|" manifests/myapp/deployment.yaml
git config user.email "ci@company.com" && git config user.name "CI"
git commit -am "deploy: myapp ${{ github.sha }}" && git push
Jenkins — Declarative Pipeline
// Jenkinsfile
pipeline {
agent any
environment {
REGISTRY = '123456789012.dkr.ecr.us-east-1.amazonaws.com'
IMAGE = "${REGISTRY}/myapp"
}
stages {
stage('Test') {
steps {
sh 'docker compose run --rm app pytest'
}
}
stage('Build') {
steps {
sh """
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin ${REGISTRY}
docker build -t ${IMAGE}:${BUILD_NUMBER} -t ${IMAGE}:latest .
docker push ${IMAGE}:${BUILD_NUMBER}
docker push ${IMAGE}:latest
"""
}
}
stage('Deploy') {
when { branch 'main' }
steps {
sh """
sed -i "s|image: .*|image: ${IMAGE}:${BUILD_NUMBER}|" manifests/myapp/deployment.yaml
git commit -am "deploy: build ${BUILD_NUMBER}" && git push
"""
}
}
}
post {
failure { slackSend color: 'danger', message: "Build failed: ${BUILD_URL}" }
success { slackSend color: 'good', message: "Deployed: ${IMAGE}:${BUILD_NUMBER}" }
}
}
🏗️ Infrastructure as Code
Terraform
Define AWS/GCP/Azure infrastructure in HCL. Plan → Apply → Destroy. The industry standard IaC tool.
Pulumi
IaC using real programming languages (TypeScript, Python, Go). More powerful than Terraform for complex logic.
Ansible
Agentless config management. SSH-based, great for VM setup, package installs, and configuration drift.
CloudFormation
AWS-native IaC in JSON/YAML. No extra tools needed. Slower but integrates deeply with AWS console.
Terraform — AWS ECS Service Example
# main.tf
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" { region = "us-east-1" }
resource "aws_ecs_cluster" "main" {
name = "production"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_service" "myapp" {
name = "myapp"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.myapp.arn
desired_count = 3
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.myapp.arn
container_name = "myapp"
container_port = 8080
}
}
output "alb_dns" {
value = aws_lb.main.dns_name
}
terraform init # download providers
terraform plan # preview changes
terraform apply # apply changes
terraform destroy # tear down everything
terraform output # show output values
terraform state list # list managed resources
⛵ Helm — Kubernetes Package Manager
Helm packages Kubernetes manifests into reusable, versioned "charts". Install complex apps with a single command.
# Add repositories
helm repo add stable https://charts.helm.sh/stable
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Search for charts
helm search repo postgres
# Install a chart
helm install my-postgres bitnami/postgresql \
--namespace db --create-namespace \
--set auth.postgresPassword=secret \
--set primary.persistence.size=20Gi
# Install with values file
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
-f values.yaml
# Upgrade release
helm upgrade my-postgres bitnami/postgresql --set image.tag=16.0.0
# Rollback
helm rollback my-postgres 1
# List releases
helm list -A
# Uninstall
helm uninstall my-postgres -n db
# Create your own chart
helm create myapp
helm package myapp/
helm install myapp ./myapp-0.1.0.tgz
📊 Monitoring & Observability
Prometheus
Time-series metrics database. Pull model — scrapes /metrics endpoints. Powers most k8s monitoring.
Grafana
Dashboard and alerting. Connects to Prometheus, Loki, Tempo, CloudWatch, and 50+ other data sources.
Datadog
All-in-one SaaS monitoring — metrics, logs, traces, APM, RUM. Expensive but zero infrastructure overhead.
AlertManager
Handles Prometheus alerts — routes to Slack/PagerDuty, deduplicates, silences. Part of Prometheus stack.
AlertManager Config
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-critical'
routes:
- match: { severity: critical }
receiver: pagerduty
- match: { severity: warning }
receiver: slack-critical
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'your-pagerduty-key'
📋 Logging — Loki & ELK Stack
Grafana Loki
Like Prometheus but for logs. Indexes only labels (not content) — very cheap storage. Query with LogQL.
Elasticsearch
Full-text search + log storage. More powerful queries than Loki. Higher cost. Part of ELK/OpenSearch stack.
Fluentd / Promtail
Log collectors. Fluentd → Elasticsearch. Promtail → Loki. Runs as DaemonSet in k8s to collect all pod logs.
Kibana
Dashboard for Elasticsearch. Like Grafana but for text logs. Comes with ELK stack.
LogQL — Querying Loki
# Stream all logs from a namespace
{namespace="myapp"}
# Filter by container and search for error
{namespace="myapp", container="api"} |= "ERROR"
# Regex filter
{app="nginx"} |~ "POST .* 5[0-9][0-9]"
# Parse JSON logs and filter by field
{app="api"} | json | level="error" | response_time > 1000
# Count errors per minute
sum(rate({namespace="myapp"} |= "ERROR" [1m])) by (pod)
# Show last 100 lines
{namespace="myapp"} | limit 100
🔍 Distributed Tracing
Jaeger
Open-source distributed tracing (by Uber). Trace requests across microservices. Integrates with OpenTelemetry.
Grafana Tempo
Scalable trace storage by Grafana Labs. Pairs with Loki + Prometheus for full observability in Grafana.
OpenTelemetry
Vendor-neutral standard for metrics, logs, and traces. Instrument once, export to Jaeger/Tempo/Datadog/etc.
Zipkin
Original open-source tracing (by Twitter). Simpler than Jaeger. Good for small deployments.
🔒 Security Tools
HashiCorp Vault
Secrets management platform. Stores, rotates, and leases secrets. Dynamic DB credentials, PKI, encryption-as-a-service.
Trivy
Fast vulnerability scanner for Docker images, filesystems, and k8s. Run in CI to catch CVEs before deploy.
Falco
Runtime security for Kubernetes. Detects unexpected system calls inside containers (shell in pod, file write, etc.).
Snyk
SaaS security scanning — code, dependencies, Docker images, k8s manifests. Integrates with GitHub PRs.
Trivy — Scan Images in CI
# Install
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh
# Scan a Docker image
trivy image nginx:latest
# Scan and fail build on HIGH/CRITICAL CVEs
trivy image \
--severity HIGH,CRITICAL \
--exit-code 1 \
myapp:latest
# Scan filesystem (Dockerfile, package.json, etc.)
trivy fs .
# Scan Kubernetes manifests
trivy k8s --report summary cluster
# Output JSON for SBOM
trivy image --format json -o sbom.json myapp:latest
Vault — Dynamic DB Credentials
# Start Vault (dev mode)
vault server -dev -dev-root-token-id="root"
# Enable database secrets engine
vault secrets enable database
# Configure PostgreSQL connection
vault write database/config/mydb \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@db:5432/myapp" \
allowed_roles="app-role" \
username="vault" \
password="vaultpass"
# Create role with TTL
vault write database/roles/app-role \
db_name=mydb \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT,INSERT,UPDATE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
default_ttl="1h" \
max_ttl="24h"
# Get temp credentials (your app calls this)
vault read database/creds/app-role
# Returns: username=v-app-role-xyz password=A1b2C3d4 lease_duration=1h
🕸️ Service Mesh
Istio
Most feature-rich service mesh. mTLS between all pods, traffic shaping, circuit breaking, detailed telemetry. Complex to operate.
Linkerd
Simpler, lighter mesh. Auto mTLS, traffic metrics, reliability features. Written in Rust — very low overhead.
Consul
Service mesh + service discovery + key-value store by HashiCorp. Multi-platform (k8s + VMs).
Only adopt a service mesh when you actually need mTLS between services, fine-grained traffic control, or advanced observability. For most teams, Kubernetes Network Policies + good logging covers 80% of the use cases at a fraction of the operational cost.
🔄 GitOps Tools
ArgoCD
Most popular GitOps tool. UI + CLI, multi-cluster, app health checks, rollback. Start here.
Flux
GitOps controller (CNCF project). More Kubernetes-native than ArgoCD. Built around GitRepository + Kustomization CRDs.
Kustomize
Overlay-based k8s manifest customization. Built into kubectl. Patch base manifests per environment without Helm.
🗄️ Container Registries
| Registry | Type | Best For | Cost |
|---|---|---|---|
| Amazon ECR | Managed | AWS workloads — best AWS integration | $0.10/GB/month storage |
| Docker Hub | Hosted SaaS | Public images, open source | Free (rate limited) / $7/mo Pro |
| GitHub Container Registry | Hosted SaaS | Code already on GitHub | Free for public; storage fees for private |
| Harbor | Self-hosted | On-prem, air-gapped, fine-grained RBAC | Free (you host it) |
| JFrog Artifactory | Hybrid | Enterprise — also stores npm, maven, etc. | $$$ |
| Nexus | Self-hosted | Open source enterprise registry + artifact repo | Free community edition |
✅ Tool Choices by Team Size
| Category | Small Team (1–5) | Medium (5–20) | Enterprise (20+) |
|---|---|---|---|
| CI/CD | GitHub Actions | GitHub Actions / GitLab CI | Jenkins + GitHub Actions |
| IaC | Terraform | Terraform + Terragrunt | Terraform + Atlantis |
| Registry | ECR / GHCR | ECR | ECR / Harbor |
| GitOps | ArgoCD | ArgoCD | ArgoCD / Flux |
| Metrics | Prometheus + Grafana | Prometheus + Grafana | Datadog / New Relic |
| Logs | Loki | Loki / OpenSearch | Elasticsearch / Datadog |
| Secrets | AWS Secrets Manager | Vault / Secrets Manager | Vault Enterprise |
| Security | Trivy in CI | Trivy + Snyk | Snyk + Falco + Wiz |