The Role
LeadSquared platform and product suite is 100% on the cloud and currently all on AWS. The product suite comprises a large number of applications, services, and APIs built on various open-source and AWS native tech stacks and deployed across multiple AWS accounts.
We are seeking a Senior Incident Manager to lead critical incident response efforts across our production systems and infrastructure. In this high-impact role, you will own the entire lifecycle of major incidents, ensuring they are resolved quickly, communicated clearly, and analysed deeply.
You'll not only lead during crisis moments but also build scalable incident management practices, define policy, train teams, and drive continuous reliability improvements.
Key Responsibilities
System Reliability and Architecture: Able to lead system capacity planning. Drive the improvement to key metrics like MTTD/MTTA/MTTR.
Enhance Observability and Monitoring: Enhance coverage of critical applications and reduce noisy alerts.
Incident Detection & Triage: Monitor systems and inputs to identify incidents, validate them, and classify based on severity and business impact.
Incident Response Coordination: Lead high-priority incidents, mobilize relevant teams, and coordinate response efforts effectively.
Communication & Status Updates: Share timely updates with leadership and teams; ensure accurate status tracking and clear communication to customers when needed.
Post-Incident Analysis & RCA: Conduct root cause analysis for major incidents, facilitate blameless post-mortems, and follow through on corrective actions.
Process & Playbook Ownership: Maintain and improve incident processes, SLAs, escalation paths, and supporting documentation like runbooks and templates.
Collaboration & Stakeholder Alignment: Work closely with cross-functional teams to resolve recurring issues and align on improvements.
Tooling & Automation: Manage incident tools and drive automation in detection, alerting, and reporting.
Training & Readiness: Conduct simulation drills and train teams on incident response best practices.
Key Requirements
4+ years of experience in incident response, SRE, DevOps, or production operations
Proven experience leading high-severity incident responses across distributed systems
Strong technical fluency in cloud platforms like AWS, monitoring, and alerting
Expertise with incident management tools (PagerDuty, Opsgenie, Blameless, etc.)
Outstanding communication and stakeholder management skills
Familiarity with SLI/SLO/SLA frameworks, observability, and reliability engineering
Deep understanding of ITIL or incident lifecycle processes
Calm, structured, and analytical decision-maker -- especially under pressure
Why Should You Apply
Fast-paced environment
Accelerated Growth & Rewards
Easily approachable management
Work with the best minds and industry leaders
Flexible work timings
Interested
If this role sounds like you, then apply with us! You have plenty of room for growth at LeadSquared.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.