Experience Required: 8+ Years
Key Responsibilities :
• Monitor, maintain, and improve reliability, availability, and performance of enterprise applications and infrastructure.
• Implement ITSM processes such as incident, problem, and change management to ensure operational excellence.
• Identify and eliminate bottlenecks by developing automation and proactive monitoring solutions.
• Collaborate with development and infrastructure teams to ensure smooth deployment and reliable operation of applications.
• Participate in on-call rotations and shift operations, ensuring critical incident response and timely resolution.
• Conduct root cause analysis (RCA) for high-impact incidents and drive permanent fixes.
• Develop and maintain runbooks, standard operating procedures (SOPs), and service documentation.
• Gather metrics, generate performance reports, and support continuous improvement initiatives.
Required Skills and Competencies
• Strong understanding of ITSM frameworks (preferably ITIL) and service operations for enterprise-scale environments.
• Experience in application monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, Splunk, AppDynamics, or Dynatrace).
• Familiarity with cloud infrastructure (AWS, Azure, or GCP) and key DevOps/SRE practices.
• Proficiency in incident response, system troubleshooting, and performance optimization.
• Basic scripting or automation skills (Python, Shell, or PowerShell) for operational efficiency.
• Excellent collaboration and communication skills with a proactive problem-solving mindset.
Willingness to work in rotational shifts and support 24×7 production environments.