Building SRE-Grade Observability: Deploying Grafana and Prometheus on AKS

Published: Dec 28, 2024 by Joe Hernandez

SREAzureTerraformKubernetesAKSMonitoringGrafanaPrometheusObservability

As an SRE, observability isn't just about collecting metrics. It's about building systems that help you understand user experience, detect problems before users do, and make data-driven decisions about reliability. In this guide, I'll show you how to deploy a production-ready monitoring stack on AKS that goes beyond basic infrastructure monitoring to support true SRE practices.

Prerequisites

Before we begin, ensure you have the following:

An existing AKS cluster
Terraform installed (version 1.0.0+)
Azure CLI configured with appropriate permissions
Helm installed (for manual verification if needed)

Project Structure

Organize your Terraform project as follows:

aks-monitoring/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars

Provider Configuration

In main.tf, configure the Terraform provider for Azure:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

Deploying Prometheus and Grafana via Helm

Add the Helm provider and configure the deployments.

provider "helm" {
  kubernetes {
    host                   = azurerm_kubernetes_cluster.aks.kube_config.0.host
    client_certificate     = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
    client_key             = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
    cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
  }
}

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "prometheus"
  namespace  = "monitoring"
  create_namespace = true

  values = [<<EOF
    alertmanager:
      enabled: true
    server:
      persistentVolume:
        enabled: true
  EOF
  ]
}

resource "helm_release" "grafana" {
  name       = "grafana"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "grafana"
  namespace  = "monitoring"
  depends_on = [helm_release.prometheus]
  
  values = [<<EOF
    persistence:
      enabled: true
    adminPassword: "SuperSecurePassword"
    service:
      type: LoadBalancer
  EOF
  ]
}

Variables Configuration

Define variables for customization in variables.tf:

variable "aks_cluster_name" {
  type        = string
  description = "Name of the existing AKS cluster"
}

variable "resource_group_name" {
  type        = string
  description = "Resource group where the AKS cluster is deployed"
}

Output Configuration

Expose the Grafana service URL and credentials in outputs.tf:

output "grafana_dashboard_url" {
  value = "http://${helm_release.grafana.name}.monitoring.svc.cluster.local"
}

output "grafana_admin_password" {
  value = "SuperSecurePassword"
  sensitive = true
}

Deploying the Monitoring Stack

Initialize Terraform:

terraform init

Configure terraform.tfvars:

aks_cluster_name       = "my-aks-cluster"
resource_group_name    = "my-resource-group"

Plan and apply the configuration:

terraform plan
terraform apply

Verify the installation:

kubectl get pods -n monitoring

SRE Best Practices for Your Monitoring Stack

Essential Dashboards for SRE

Create these dashboards to support SRE practices:

Service Level Indicators Dashboard

Request success rate (availability SLI)
Response time percentiles (latency SLI)
Error budget burn rate
Time to detection for critical alerts

Golden Signals Dashboard

Latency, Traffic, Errors, Saturation
Per-service breakdown
Historical trends for capacity planning

Incident Response Dashboard

Current system status across all services
Recent deployments and changes
Dependency status
Runbook links

Alerting Strategy

Configure alerts that matter:

# SLO-based alerting
- alert: ErrorBudgetBurnRate
  expr: (1 - sli_success_rate) > (error_budget * 6)  # 6x burn rate
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate for {{ $labels.service }}"
    
- alert: HighLatency  
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
  for: 2m
  labels:
    severity: warning

Retention and Storage Strategy

For production SRE monitoring:

Short-term (15 days): High-resolution data for incident response
Medium-term (90 days): Aggregated data for trend analysis
Long-term (1+ year): Summary metrics for capacity planning

Operational Excellence

Monitoring the Monitoring

Your observability stack needs observability too:

Monitor Prometheus scrape success rates
Alert on Grafana dashboard load times
Track alert manager notification success
Monitor storage usage and retention

Disaster Recovery

Ensure your monitoring survives outages:

Deploy Prometheus in HA mode with multiple replicas
Use persistent volumes with backup strategies
Document recovery procedures
Test restore processes regularly

Conclusion

This monitoring stack provides the foundation for SRE practices like SLI/SLO management, error budget tracking, and data-driven incident response. Remember: the goal isn't just to collect metrics, but to gain insights that improve system reliability and user experience.

The real value comes from using this data to make informed decisions about where to invest engineering effort, when to slow down deployments, and how to continuously improve system reliability.