Building SRE-Grade Observability: Deploying Grafana and Prometheus on AKS
As an SRE, observability isn't just about collecting metrics. It's about building systems that help you understand user experience, detect problems before users do, and make data-driven decisions about reliability. In this guide, I'll show you how to deploy a production-ready monitoring stack on AKS that goes beyond basic infrastructure monitoring to support true SRE practices.
Prerequisites
Before we begin, ensure you have the following:
- An existing AKS cluster
- Terraform installed (version 1.0.0+)
- Azure CLI configured with appropriate permissions
- Helm installed (for manual verification if needed)
Project Structure
Organize your Terraform project as follows:
aks-monitoring/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars
Provider Configuration
In main.tf, configure the Terraform provider for Azure:
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
}
provider "azurerm" {
features {}
}
Deploying Prometheus and Grafana via Helm
Add the Helm provider and configure the deployments.
provider "helm" {
kubernetes {
host = azurerm_kubernetes_cluster.aks.kube_config.0.host
client_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
client_key = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
}
}
resource "helm_release" "prometheus" {
name = "prometheus"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "prometheus"
namespace = "monitoring"
create_namespace = true
values = [<<EOF
alertmanager:
enabled: true
server:
persistentVolume:
enabled: true
EOF
]
}
resource "helm_release" "grafana" {
name = "grafana"
repository = "https://grafana.github.io/helm-charts"
chart = "grafana"
namespace = "monitoring"
depends_on = [helm_release.prometheus]
values = [<<EOF
persistence:
enabled: true
adminPassword: "SuperSecurePassword"
service:
type: LoadBalancer
EOF
]
}
Variables Configuration
Define variables for customization in variables.tf:
variable "aks_cluster_name" {
type = string
description = "Name of the existing AKS cluster"
}
variable "resource_group_name" {
type = string
description = "Resource group where the AKS cluster is deployed"
}
Output Configuration
Expose the Grafana service URL and credentials in outputs.tf:
output "grafana_dashboard_url" {
value = "http://${helm_release.grafana.name}.monitoring.svc.cluster.local"
}
output "grafana_admin_password" {
value = "SuperSecurePassword"
sensitive = true
}
Deploying the Monitoring Stack
- Initialize Terraform:
terraform init
- Configure
terraform.tfvars:
aks_cluster_name = "my-aks-cluster"
resource_group_name = "my-resource-group"
- Plan and apply the configuration:
terraform plan
terraform apply
- Verify the installation:
kubectl get pods -n monitoring
SRE Best Practices for Your Monitoring Stack
Essential Dashboards for SRE
Create these dashboards to support SRE practices:
Service Level Indicators Dashboard
- Request success rate (availability SLI)
- Response time percentiles (latency SLI)
- Error budget burn rate
- Time to detection for critical alerts
Golden Signals Dashboard
- Latency, Traffic, Errors, Saturation
- Per-service breakdown
- Historical trends for capacity planning
Incident Response Dashboard
- Current system status across all services
- Recent deployments and changes
- Dependency status
- Runbook links
Alerting Strategy
Configure alerts that matter:
# SLO-based alerting
- alert: ErrorBudgetBurnRate
expr: (1 - sli_success_rate) > (error_budget * 6) # 6x burn rate
for: 5m
labels:
severity: critical
annotations:
summary: "High error budget burn rate for {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
for: 2m
labels:
severity: warning
Retention and Storage Strategy
For production SRE monitoring:
- Short-term (15 days): High-resolution data for incident response
- Medium-term (90 days): Aggregated data for trend analysis
- Long-term (1+ year): Summary metrics for capacity planning
Operational Excellence
Monitoring the Monitoring
Your observability stack needs observability too:
- Monitor Prometheus scrape success rates
- Alert on Grafana dashboard load times
- Track alert manager notification success
- Monitor storage usage and retention
Disaster Recovery
Ensure your monitoring survives outages:
- Deploy Prometheus in HA mode with multiple replicas
- Use persistent volumes with backup strategies
- Document recovery procedures
- Test restore processes regularly
Conclusion
This monitoring stack provides the foundation for SRE practices like SLI/SLO management, error budget tracking, and data-driven incident response. Remember: the goal isn't just to collect metrics, but to gain insights that improve system reliability and user experience.
The real value comes from using this data to make informed decisions about where to invest engineering effort, when to slow down deployments, and how to continuously improve system reliability.