High Availability and Scalability Engineering Lead
-- suitable for enterprise, SaaS, or mission-critical infrastructure teams:
Job Title:
High Availability & Scalability Engineering Lead
Role Overview:
The High Availability & Scalability Engineering Lead is responsible for designing, implementing, and managing highly available, fault-tolerant, and scalable systems to support critical business applications. This role blends deep technical expertise in distributed systems, cloud infrastructure, and performance optimization with leadership and cross-functional collaboration.
You will lead a team of engineers to ensure that all platforms meet stringent SLAs for uptime, resilience, and scalability--especially under peak loads or failure scenarios.
Key Responsibilities:Architecture & Design
Design and implement
high-availability architectures
using clustering, load balancing, replication, and failover strategies.
Lead design reviews for
scalable distributed systems
(microservices, event-driven, or service mesh architectures).
Evaluate and adopt cloud-native technologies (e.g.,
Kubernetes, ECS, autoscaling groups, service meshes, serverless
) to enhance elasticity and resilience.
Drive the definition of
RTO/RPO
, failover automation, and multi-region deployment strategies.
Implementation & Operations
Develop and enforce
SLAs, SLOs, and SLIs
for reliability, latency, and performance.
Lead efforts in
capacity planning, performance tuning, and chaos testing
to ensure predictable system behavior under stress.
Collaborate with DevOps and SRE teams to automate infrastructure provisioning (e.g., Terraform, Pulumi, CloudFormation).
Establish monitoring, alerting, and self-healing mechanisms using tools such as
Prometheus, Grafana, Datadog, or New Relic
.
Leadership & Strategy
Mentor and guide engineers on designing resilient, performant, and secure architectures.
Partner with product and platform engineering to forecast future growth and capacity needs.
Create frameworks and best practices for high availability, DR, and horizontal scalability across teams.
Lead incident reviews, root cause analysis, and reliability retrospectives to drive continuous improvement.
Required Skills & Qualifications:
Bachelor's or Master's in Computer Science, Engineering, or related field.
8+ years
of experience in backend, infrastructure, or systems engineering;
3+ years
in a leadership or architect role.
Deep expertise with
cloud platforms (Azure)
and
container orchestration (Kubernetes, Docker, ECS)
.
Proficiency in
distributed systems design
,
load balancing
,
replication
,
failover
, and
data partitioning
.
Strong programming experience in one or more:
Go, Python, Java, or C++
.
Experience with
observability and reliability engineering
(monitoring, logging, tracing, SLOs).
Proven ability to
lead cross-functional initiatives
, drive architectural decisions, and scale systems supporting millions of users or high transaction volumes.
Achieved uptime and latency SLAs consistently across services.
Reduction in mean time to recovery (MTTR) and incident frequency.
Documented and automated failover and scaling strategies.
Demonstrated mentorship and technical leadership within engineering teams.
Job Type: Full-time
Pay: ₹670,805.33 - ₹2,059,333.85 per year
Benefits:
Cell phone reimbursement
Health insurance
Internet reimbursement
Paid sick time
* Paid time off
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.