Location: Pan India
Experience: 5–10 Years
Role: On-Prem Infrastructure Engineer / Site Reliability Engineer (SRE)
We are seeking a skilled On-Prem Infrastructure Engineer / SRE to manage and support NVIDIA’s on-prem engineering cloud infrastructure across multiple data centers. The ideal candidate will have strong experience in bare-metal infrastructure management, observability tools, automation, and production support. This role is critical in ensuring uptime, reliability, and operational excellence for engineering services.
Manage and operate NVIDIA’s on-prem infrastructure across distributed data centers.
Maintain high availability, reliability, and readiness of on-prem engineering cloud environments.
Perform lifecycle management of bare-metal servers and underlying hardware.
Guard and maintain Service Level Agreements (SLAs) for mission-critical engineering services.
Implement and maintain monitoring, alerting, and incident response workflows.
Drive root cause analysis (RCA), conduct post-mortems, and ensure corrective and preventive actions.
Deploy, configure, and manage observability tools such as Prometheus, Grafana, ELK Stack.
Maintain KPI monitoring pipelines using Jenkins, Python, and ELK.
Develop and enhance custom monitoring dashboards and business-specific alerting rules.
Contribute to capacity planning, resource optimization, and performance tuning initiatives.
Develop automation scripts/tools using Python, Go, Bash, or Jenkins pipelines.
Improve operational efficiency through continuous automation.
Monitor system alerts, troubleshoot incidents, and resolve user-reported issues.
Participate in WAR rooms during major or high-impact incidents.
Ensure timely escalation and resolution of production issues.
Create and maintain technical documentation for operational procedures, architectures, and troubleshooting steps.
Work closely with engineering, DevOps, hardware, and data center teams to improve overall infrastructure reliability.
Strong hands-on experience in bare-metal server management using tools such as:
IPMI, Redfish, KVM or similar technologies.
Experience with automation and scripting using:
Python, Go, Bash, Jenkins (CI/CD pipelines).
Practical experience with infrastructure tools:
Kubernetes, MySQL, Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana).
Solid understanding of system performance, capacity planning, and datacenter operations.
Strong troubleshooting, incident-response, and operational debugging skills.
Ability to work in fast-paced environments and handle production-critical scenarios.
Familiarity with NVIDIA hardware: GPUs, Tegra systems, DGX platforms, etc.
Experience in large-scale distributed systems or high-performance computing environments.
Strong communication and collaboration abilities.
Analytical mindset with a focus on problem-solving.
Ability to maintain composure under pressure in incident environments.
Detail-oriented with strong documentation habits.ocumentation habits.