We are looking for an experienced Site Reliability Engineer (SRE) to design and implement scalable monitoring frameworks, lead incident response processes, and build automation-driven solutions that reduce operational toil. The ideal candidate must have hands-on experience with Grafana, Prometheus, Kubernetes (AKS), CI/CD pipelines, IaC, and Azure cloud services.
Design and implement end-to-end monitoring, observability, and alerting frameworks using tools such as Grafana, Prometheus, Loki, Elasticsearch, etc.
Build dashboards, metrics pipelines, and log aggregation systems for application and infrastructure visibility.
Ensure proactive detection of performance issues, errors, anomalies, and service health degradation.
Lead P1/P2 incident response, root cause analysis (RCA), and post-incident reviews.
Define and maintain SLOs/SLIs, error budgets, and reliability KPIs.
Reduce operational toil through automation, self-healing solutions, and proactive remediation strategies.
Build runbooks, playbooks, and incident workflows to improve response efficiency.
Manage and optimize Kubernetes clusters (AKS) including deployments, scaling, networking, and security.
Implement Infrastructure as Code (IaC) using Terraform, ARM templates, or Bicep.
Build and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, Jenkins, etc.
Automate infrastructure provisioning, environment setup, backup, failover, and compliance workflows.
Partner with development teams to design reliable and scalable systems, embedding SRE principles from the start.
Participate in architecture discussions, reviewing system design for reliability, performance, and observability.
Influence engineering decisions regarding performance optimization, resiliency patterns, distributed tracing, and fault tolerance.
Strong working knowledge of Azure services, including:
AKS
Azure Monitor / Log Analytics
Application Gateway / Front Door
Azure Functions
Azure Storage
App Services
Azure Networking (VNet, NSG, Load Balancers)
Understanding of cloud cost optimization and capacity planning.
5–10 years of experience in SRE, DevOps, or Infrastructure Engineering.
Hands-on experience with Grafana, Prometheus, Loki, Alertmanager.
Strong understanding of Kubernetes (AKS) and containerization.
Experience with CI/CD pipelines and deployment automation.
Proficiency in Infrastructure as Code (Terraform/ARM/Bicep).
Solid understanding of Azure Cloud Architecture.
Expertise in incident management, RCA, SLO/SLI design.
Good experience in Shell, Python, or Go for automation.
Experience with Service Mesh (Istio/Linkerd)
Distributed tracing tools: Jaeger, OpenTelemetry
Knowledge of GitOps models (ArgoCD/Flux)
Background in microservices architecture
Strong analytical and troubleshooting skills
Ability to lead during critical incidents
Excellent communication and stakeholder handling
Passion for automation and reliability engineering