Senior Site Reliability Engineer (sre)

Year    MH, IN, India

Job Description


Job Title: Senior Site Reliability Engineer (SRE) - Automation & Observability



Experience: 8-10 years



Education: Any Degree



Location: Mumbai



Role Level: Senior Individual Contributor (Customer-facing)




:



We are seeking a Senior Site Reliability Engineer (SRE) to own and continuously improve the reliability, availability, scalability, and performance of business-critical services across multi-cloud environments (AWS, Azure, GCP).

This role combines strong SRE fundamentals, automation engineering, and observability expertise with customer leadership. You will work closely with customer engineering teams to embed reliability into application design, drive automation, lead incident response, and demonstrate measurable SRE outcomes through dashboards and metrics.



Key Responsibilities:



Reliability Engineering & SRE Practices Define, implement, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services. Continuously monitor SLO compliance and drive improvements based on error budget consumption. Participate in architecture reviews focused on high availability, disaster recovery, scalability, and fault tolerance.


Incident, Problem & Change Management:



Lead incident response, acting as the Tier-3 escalation point for SRE and operations teams. Drive blameless postmortems, Root Cause Analysis (RCA), and ensure corrective and preventive actions are implemented. Define and maintain incident response runbooks, escalation paths, and on-call processes. Track and improve key reliability metrics including MTTR, incident frequency, and change failure rate.


Automation & Infrastructure as Code:



Automate infrastructure provisioning and operational workflows using Terraform, CloudFormation, and AWS CDK. Build and maintain CI/CD pipelines supporting canary deployments, blue/green strategies, and automated rollbacks. Implement event-driven automation and auto-remediation using AWS Lambda, Step Functions, or Azure Functions. Continuously identify and eliminate operational toil through automation and self-healing systems.


Monitoring, Observability & Logging:



Design, implement, and operate end-to-end observability platforms covering metrics, logs, and traces. Hands-on experience with:

o New Relic / Datadog for APM, distributed tracing, and SLO tracking

o Prometheus for metrics collection

o Grafana for dashboards and SRE scorecards

o Graylog / ELK for centralized logging and RCA

Ensure alerts are SLO-driven, actionable, and noise-free. Build customer-facing dashboards to clearly demonstrate SRE service outcomes.


Cloud Infrastructure & Platform Reliability:



Provision and manage cloud infrastructure across AWS, Azure, and/or GCP. Operate compute, storage, networking, load balancers, VPNs, and private connectivity. Manage patching, backups, encryption, IAM/RBAC, and disaster recovery readiness. Optimize performance and cost through rightsizing, autoscaling, and capacity planning. Ensure reliability of data platforms such as MongoDB / MongoDB Atlas, Elasticsearch / OpenSearch, MySQL (RDS), and DocumentDB.


Customer Engagement & Mentorship:



Act as the primary technical contact for assigned customer accounts. Lead reliability and observability discussions with customers and internal stakeholders. Mentor mid-level and junior SREs, conducting reliability-focused design and operational reviews. Maintain high-quality documentation, runbooks, SOPs, and operational playbooks.


Required Qualifications:



8-10 years of experience in SRE, Cloud Engineering, or Production Operations roles. Strong OS fundamentals: Linux and Windows, with scripting (Bash, PowerShell). Strong programming skills in Python, Go, or equivalent. Proven hands-on experience with:

o Infrastructure as Code (Terraform, CloudFormation, CDK)

o CI/CD pipelines and deployment automation

o Observability tools (New Relic, Datadog, Prometheus, Grafana, Graylog, ELK)

o Distributed systems at production scale

Cloud certifications (one or more):

o AWS (Associate or Professional)

o Azure (AZ-104 / Architect Expert)

o GCP (Professional Cloud Architect)

Cloud-agnostic certification such as Terraform Associate, CKA, or SRE Foundation.


Nice-to-Have Skills:



Experience with multi-cloud or hybrid architectures. Exposure to cross-region or cross-cloud data replication. Hands-on experience with chaos engineering or fault injection. Knowledge of ITIL, Agile, or SRE maturity models. * Experience with serverless architectures (AWS Lambda, Azure Functions).

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD5094688
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    MH, IN, India
  • Education
    Not mentioned
  • Experience
    Year