Incident Manager Site Reliability Engineering

Year    KA, IN, India

Job Description

The Role


LeadSquared platform and product suite is 100% on the cloud and currently all on AWS. The product suite comprises a large number of applications, services, and APIs built on various open-source and AWS native tech stacks and deployed across multiple AWS accounts.


We are seeking a Senior Incident Manager to lead critical incident response efforts across our production systems and infrastructure. In this high-impact role, you will own the entire lifecycle of major incidents, ensuring they are resolved quickly, communicated clearly, and analysed deeply.


You'll not only lead during crisis moments but also build scalable incident management practices, define policy, train teams, and drive continuous reliability improvements.



Key Responsibilities

System Reliability and Architecture: Able to lead system capacity planning. Drive the improvement to key metrics like MTTD/MTTA/MTTR. Enhance Observability and Monitoring: Enhance coverage of critical applications and reduce noisy alerts. Incident Detection & Triage: Monitor systems and inputs to identify incidents, validate them, and classify based on severity and business impact. Incident Response Coordination: Lead high-priority incidents, mobilize relevant teams, and coordinate response efforts effectively. Communication & Status Updates: Share timely updates with leadership and teams; ensure accurate status tracking and clear communication to customers when needed. Post-Incident Analysis & RCA: Conduct root cause analysis for major incidents, facilitate blameless post-mortems, and follow through on corrective actions. Process & Playbook Ownership: Maintain and improve incident processes, SLAs, escalation paths, and supporting documentation like runbooks and templates. Collaboration & Stakeholder Alignment: Work closely with cross-functional teams to resolve recurring issues and align on improvements. Tooling & Automation: Manage incident tools and drive automation in detection, alerting, and reporting. Training & Readiness: Conduct simulation drills and train teams on incident response best practices.


Key Requirements

4+ years of experience in incident response, SRE, DevOps, or production operations Proven experience leading high-severity incident responses across distributed systems Strong technical fluency in cloud platforms like AWS, monitoring, and alerting Expertise with incident management tools (PagerDuty, Opsgenie, Blameless, etc.) Outstanding communication and stakeholder management skills Familiarity with SLI/SLO/SLA frameworks, observability, and reliability engineering Deep understanding of ITIL or incident lifecycle processes Calm, structured, and analytical decision-maker -- especially under pressure


Why Should You Apply


Fast-paced environment Accelerated Growth & Rewards Easily approachable management Work with the best minds and industry leaders Flexible work timings


Interested



If this role sounds like you, then apply with us! You have plenty of room for growth at LeadSquared.

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD3961516
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    KA, IN, India
  • Education
    Not mentioned
  • Experience
    Year