Reliability Engineering Lead

Year    TS, IN, India

Job Description

Position Title:

Reliability Engineering Lead

Location: Hyderabad


Role Description (Process-First Responsibilities)



1. Service Level Management & Reliability Framework



Process Owner:

SLO-driven reliability decision making across digital services

Establish SLO Foundation:

Define, implement, and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, ensuring alignment with business impact and patient safety requirements

Error Budget Management:

Implement error budget policies that balance feature velocity with reliability, using budget consumption as the primary decision-making tool for release management and incident prioritization

Reliability Governance:

Create and maintain reliability standards that comply with GxP, SOX, and other pharmaceutical regulatory frameworks while enabling innovation velocity

Business Impact Correlation:

Translate technical reliability metrics into business language, demonstrating clear connections between SLO compliance and revenue, patient safety, or operational efficiency

2. Incident Management & Learning Culture



Process Owner:

Blameless incident response and organizational learning

Incident Command:

Lead critical incident response using structured protocols, focusing on rapid detection, mitigation, and recovery while maintaining detailed audit trails for regulatory compliance

Blameless Postmortem Leadership:

Facilitate blameless postmortems that focus on system improvements rather than individual accountability, creating a culture of psychological safety for honest analysis

Learning Repository Management:

Maintain and curate incident learning repositories with transparent sharing across digital units, enabling pattern recognition and systemic improvement

Predictive Issue Prevention:

Implement proactive monitoring and alerting systems that identify potential failures before they impact users, shifting from reactive to preventive operations

3. Toil Elimination & Engineering Balance



Process Owner:

Systematic automation of operational overhead

Toil Measurement & Reduction:

Maintain operational work (toil) below 50% of time through systematic identification, measurement, and elimination of manual, repetitive tasks

Automation Strategy:

Design and implement automation solutions using cost-benefit analysis, prioritizing work that scales linearly with service growth and requires minimal human judgment

Engineering Project Delivery:

Dedicate minimum 50% of time to engineering projects that improve reliability, performance, or developer experience, delivering measurable improvements quarterly

Knowledge Transfer:

Create self-service documentation, runbooks, and automation tools that reduce dependency on human intervention and enable team scaling

4. Platform Engineering Integration & AI Enablement



Process Owner:

Reliability integration in AI-first platform services

AI Workload Reliability:

Design and implement reliability practices for AI/ML workloads, including agent-to-agent communication systems, model serving infrastructure, and data pipeline reliability

Platform Collaboration:

Partner with platform teams to embed reliability principles into Internal Developer Platforms (IDPs), enabling self-service infrastructure with built-in reliability guardrails

Agentic System Support:

Provide reliability engineering expertise for Sanofi's agentic AI ecosystem, ensuring conversational AI systems meet enterprise reliability and compliance standards

Developer Experience Enhancement:

Contribute to CI/CD pipeline reliability, infrastructure-as-code best practices, and observability integration that accelerates developer productivity

5. Observability & Performance Engineering



Process Owner:

Comprehensive system visibility and performance optimization

Full-Stack Observability:

Implement and maintain observability platforms covering metrics, logs, traces, and business KPIs, providing end-to-end visibility into service health and user experience

Performance Optimization:

Conduct systematic performance engineering including capacity planning, bottleneck identification, and scalability improvements aligned with business growth projections

Intelligent Monitoring:

Deploy AI-powered monitoring and alerting systems that reduce noise, provide intelligent root cause analysis, and enable predictive maintenance

Cross-System Correlation:

Establish monitoring federation across diverse technology stacks (cloud, on-premises, legacy) while maintaining regulatory audit trails

6. Security & Compliance Integration



Process Owner:

Reliability practices within regulatory frameworks

Secure Reliability Engineering:

Implement reliability practices that enhance rather than compromise security posture, integrating DevSecOps principles with pharmaceutical compliance requirements

Compliance Automation:

Automate compliance checks, audit trail generation, and regulatory reporting while maintaining system reliability and performance

Risk Assessment Integration:

Conduct reliability impact assessments for changes affecting GxP systems, balancing innovation speed with regulatory validation requirements

Disaster Recovery:

Design and test disaster recovery procedures that meet both technical recovery objectives and regulatory continuity requirements

7. Team Leadership



Process Owner:

Represent the reliability engineering discipline

Team Grooming:

Groom a team of SREs that can work independently across the key SRE principles

Communication:

Provide crisp and strategic updates to the leadership team

Lead by Example:

Demonstrate expertise by taking on complex scenarios and providing innovative solutions that can be leveraged by the team, documented for knowledge sharing, and scaled across the organization to drive systematic reliability improvements

null

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4925919
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    TS, IN, India
  • Education
    Not mentioned
  • Experience
    Year