Site Reliability Engineers at N Consulting Ltd

View All Jobs

Download File

Job Description – SRE

Location: Pan India
Experience: 5–10 Years
Role Type: SRE

About the Role

We are looking for an experienced Site Reliability Engineer (SRE) to design and implement scalable monitoring frameworks, lead incident response processes, and build automation-driven solutions that reduce operational toil. The ideal candidate must have hands-on experience with Grafana, Prometheus, Kubernetes (AKS), CI/CD pipelines, IaC, and Azure cloud services.

Key Responsibilities

Monitoring, Logging & Alerting

Design and implement end-to-end monitoring, observability, and alerting frameworks using tools such as Grafana, Prometheus, Loki, Elasticsearch, etc.

Build dashboards, metrics pipelines, and log aggregation systems for application and infrastructure visibility.

Ensure proactive detection of performance issues, errors, anomalies, and service health degradation.

Incident Response & Reliability Engineering

Lead P1/P2 incident response, root cause analysis (RCA), and post-incident reviews.

Define and maintain SLOs/SLIs, error budgets, and reliability KPIs.

Reduce operational toil through automation, self-healing solutions, and proactive remediation strategies.

Build runbooks, playbooks, and incident workflows to improve response efficiency.

Kubernetes, Cloud & Automation

Manage and optimize Kubernetes clusters (AKS) including deployments, scaling, networking, and security.

Implement Infrastructure as Code (IaC) using Terraform, ARM templates, or Bicep.

Build and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, Jenkins, etc.

Automate infrastructure provisioning, environment setup, backup, failover, and compliance workflows.

Dev Collaboration & Architecture Reliability

Partner with development teams to design reliable and scalable systems, embedding SRE principles from the start.

Participate in architecture discussions, reviewing system design for reliability, performance, and observability.

Influence engineering decisions regarding performance optimization, resiliency patterns, distributed tracing, and fault tolerance.

Azure Cloud Expertise

Strong working knowledge of Azure services, including:

AKS

Azure Monitor / Log Analytics

Application Gateway / Front Door

Azure Functions

Azure Storage

App Services

Azure Networking (VNet, NSG, Load Balancers)

Understanding of cloud cost optimization and capacity planning.

Required Skills

5–10 years of experience in SRE, DevOps, or Infrastructure Engineering.

Hands-on experience with Grafana, Prometheus, Loki, Alertmanager.

Strong understanding of Kubernetes (AKS) and containerization.

Experience with CI/CD pipelines and deployment automation.

Proficiency in Infrastructure as Code (Terraform/ARM/Bicep).

Solid understanding of Azure Cloud Architecture.

Expertise in incident management, RCA, SLO/SLI design.

Good experience in Shell, Python, or Go for automation.

Preferred Skills

Experience with Service Mesh (Istio/Linkerd)

Distributed tracing tools: Jaeger, OpenTelemetry

Knowledge of GitOps models (ArgoCD/Flux)

Background in microservices architecture

Soft Skills

Strong analytical and troubleshooting skills

Ability to lead during critical incidents

Excellent communication and stakeholder handling

Passion for automation and reliability engineering

This job has now closed

You can find more jobs over on our careers page.

See More Jobs