As part of the Azure CXP CRE team, your responsibilities include: On-call Communication Management during regular on-call rotations *Join incident bridges and work with engineering to obtain real-time outage details. *Understand incident scope, impact, and mitigation to translate complex technical findings into clear, professional, and decisive updates for customers and stakeholders. *Keep communications consistent and fact-based throughout the incident; confirm information with engineering and leadership before sharing. *Assist with publishing Public Incident Reports and RCA summaries. *Support live site incident (LSI) operations, including triage, resolution, and post-incident analysis. *Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings. Problem Management & Data Analytics *Design and implement automated detection systems to identify impacted resources in real time. *Collaborate with engineering and operations teams to enhance telemetry, monitoring, and alerting accuracy while reducing false positives. *Develop dashboards and visualizations in Power BI and Azure Data Explorer to support data-driven insights. *Build scalable data collection and analysis frameworks to improve service reliability and incident response. *Participate in incident resolution workflows and provide actionable insights to drive platform and process improvements. *Communicate technical findings and recommendations to stakeholders through clear, data-backed reporting. Tooling & Automation *Develop tools and analytics pipelines to automatically assess incident impact and blast radius across services, regions, and customers in real time. *Design and maintain automation solutions that enhance incident detection, monitoring, communication, and remediation while reducing operational toil and repeat issues. *Identify recurring problems, propose preventive solutions, and collaborate with engineers and teams to implement fixes. *Build and support no-code/low-code solutions to optimize operations and improve team efficiency. *Collaborate with product, infrastructure, and operations teams to align automation initiatives with organizational reliability and customer trust goals. Required Qualifications *Bachelor's degree in computer science, Information Technology, Data Science, Cybersecurity, or a related field AND 7+ years of technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls; OR equivalent hands-on experience. *Hands-on experience implementing AI-driven solutions and automation, with proficiency in one or more programming/automation languages (e.g., C#, Java, JavaScript, Python) or equivalent expertise is a plus. *Certifications in cloud technologies (Azure, AWS, GCP), ITIL, or SRE frameworks are desirable. *Strategic thinking and a customer-first mindset; able to advocate for improvements in platform transparency and experience. *Excellent problem-solving, judgment, and decision-making skills, communication and collaboration skills. *Understanding of SRE principles, including SLAs/SLOs, telemetry, and monitoring. *Proven experience in cloud operations, incident & crisis management, or large-scale systems engineering ideally within platforms such as Azure, AWS, or GCP. *Contribute to a data-driven culture as well as a culture of experimentation across the organization. *Own and drive projects and features by working towards the team's defined goals and milestones. *Creating prototypes and proof-of-concepts for iterative development. *Be curious and willing to learn and grow. Preferred: 5+ Years of demonstrated experience as an Incident Commander or Crisis Manager for critical, high-severity incidents in high-availability, distributed environments. Experience with SRE (Site Reliability Engineering) principles and practices. Exposure to chaos engineering, fault injection, or high availability architecture. AI/ML Experience: [Beginner to Intermediate] Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes. Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting. An understanding of the challenges and risks associated with AI/ML systems in a production environment. Certifications: Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, GCP Professional Cloud Architect). Certifications in ITIL, SRE, or other relevant frameworks.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.