Principal Engineer, Site Reliability

Year    Hyderabad, Telangana, India

Job Description

About TMUS Global Solutions
T-Mobile is Americas supercharged Un-carrier, challenging conventions and setting new standards in wireless. With the nations largest and fastest 5G network, T-Mobile delivers advanced connectivity and unmatched value to millions across the U.S. Were unwaveringly obsessed with providing the best possible service experience, driven by a spirit of disruption that fuels competition and innovation in wireless and beyond.
TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.

About the Role
As a Principal SRE,you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.
What Youll Do

  • Architect observability and incident response pipelines for LLM, API, and backend systems
  • Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
  • Lead high-severity incident response, root cause analysis, and system recovery
  • Collaborate with AI, Platform, and Security teams to enforce operational guardrails
  • Implement automation-first strategies using GitLab CI/CD, Terraform, and deployment tooling
  • Guide infrastructure tuning, capacity planning, and cost optimization
  • Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, OpenTelemetry
  • Support AIOps, model observability, policy enforcement, and audit readiness
  • Mentor senior SREs and foster a high-ownership, technical excellence culture
What Youll Bring
  • Bachelor's or Masters in Computer Science, Engineering, or related field
  • 7-12 years in SRE, infrastructure, or platform roles in distributed systems
  • Strong experience in incident management, AI/ML observability, and performance engineering
  • Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
  • Proficiency in Python, Java, Bash/PowerShell, YAML
  • Deep knowledge of CI/CD workflows, GitLab pipelines, and SDLC processes
  • Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
  • Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
  • Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
  • Background in compliance, governance, and enterprise risk management for AI systems
  • Advanced debugging skills across data, infrastructure, networking, and app layers
  • Leadership in chaos engineering, SLO-based operations, and system resilience
Must Have Skills
  • Application & Microservice: Java, Spring boot, API & Service Design
  • Any CI/CD Tools: Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI
  • App Platform: Docker & Containers (Kubernetes)
  • Any Databases: SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB)
  • Any Messaging: Kafka, Rabbit MQ
  • Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus)
  • Incident/Change/Problem Management
Nice To Have
  • Compliance-aligned continuity planning (PCI, SOX)
  • Error-budget pacts with product/org leadership
  • Executive Incident/Change/Problem /risk reporting
  • Observability cost vs coverage trade-offs
  • Org-wide reliability governance strategy

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4267625
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Hyderabad, Telangana, India
  • Education
    Not mentioned
  • Experience
    Year