Back to Blog

Building SRE-Grade Observability: Deploying Grafana and Prometheus on AKS

Published: Dec 28, 2024 by Joe Hernandez
SREAzureTerraformKubernetesAKSMonitoringGrafanaPrometheusObservability

As an SRE, observability isn't just about collecting metrics. It's about building systems that help you understand user experience, detect problems before users do, and make data-driven decisions about reliability. In this guide, I'll show you how to deploy a production-ready monitoring stack on AKS that goes beyond basic infrastructure monitoring to support true SRE practices.

Prerequisites

Before we begin, ensure you have the following:

Project Structure

Organize your Terraform project as follows:

aks-monitoring/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars

Provider Configuration

In main.tf, configure the Terraform provider for Azure:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

Deploying Prometheus and Grafana via Helm

Add the Helm provider and configure the deployments.

provider "helm" {
  kubernetes {
    host                   = azurerm_kubernetes_cluster.aks.kube_config.0.host
    client_certificate     = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
    client_key             = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
    cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
  }
}

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "prometheus"
  namespace  = "monitoring"
  create_namespace = true

  values = [<<EOF
    alertmanager:
      enabled: true
    server:
      persistentVolume:
        enabled: true
  EOF
  ]
}

resource "helm_release" "grafana" {
  name       = "grafana"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "grafana"
  namespace  = "monitoring"
  depends_on = [helm_release.prometheus]
  
  values = [<<EOF
    persistence:
      enabled: true
    adminPassword: "SuperSecurePassword"
    service:
      type: LoadBalancer
  EOF
  ]
}

Variables Configuration

Define variables for customization in variables.tf:

variable "aks_cluster_name" {
  type        = string
  description = "Name of the existing AKS cluster"
}

variable "resource_group_name" {
  type        = string
  description = "Resource group where the AKS cluster is deployed"
}

Output Configuration

Expose the Grafana service URL and credentials in outputs.tf:

output "grafana_dashboard_url" {
  value = "http://${helm_release.grafana.name}.monitoring.svc.cluster.local"
}

output "grafana_admin_password" {
  value = "SuperSecurePassword"
  sensitive = true
}

Deploying the Monitoring Stack

  1. Initialize Terraform:
terraform init
  1. Configure terraform.tfvars:
aks_cluster_name       = "my-aks-cluster"
resource_group_name    = "my-resource-group"
  1. Plan and apply the configuration:
terraform plan
terraform apply
  1. Verify the installation:
kubectl get pods -n monitoring

SRE Best Practices for Your Monitoring Stack

Essential Dashboards for SRE

Create these dashboards to support SRE practices:

Service Level Indicators Dashboard

Golden Signals Dashboard

Incident Response Dashboard

Alerting Strategy

Configure alerts that matter:

# SLO-based alerting
- alert: ErrorBudgetBurnRate
  expr: (1 - sli_success_rate) > (error_budget * 6)  # 6x burn rate
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate for {{ $labels.service }}"
    
- alert: HighLatency  
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
  for: 2m
  labels:
    severity: warning

Retention and Storage Strategy

For production SRE monitoring:

Operational Excellence

Monitoring the Monitoring

Your observability stack needs observability too:

Disaster Recovery

Ensure your monitoring survives outages:

Conclusion

This monitoring stack provides the foundation for SRE practices like SLI/SLO management, error budget tracking, and data-driven incident response. Remember: the goal isn't just to collect metrics, but to gain insights that improve system reliability and user experience.

The real value comes from using this data to make informed decisions about where to invest engineering effort, when to slow down deployments, and how to continuously improve system reliability.

Share this post