Prinicipal Engineer,site Reliability

Year    Hyderabad, Telangana, India

Job Description

About TMUS Global Solutions
T-Mobile is America's supercharged Un-carrier, challenging conventions and setting new standards in wireless. With the nation's largest and fastest 5G network, T-Mobile delivers advanced connectivity and unmatched value to millions across the U.S. We're unwaveringly obsessed with providing the best possible service experience, driven by a spirit of disruption that fuels competition and innovation in wireless and beyond.

Responsibilities:

  • Lead resolution of high-severity/complex incidents across hybrid infrastructure.
  • Architect and implement automation frameworks, self-healing workflows, and AI-driven ops.
  • Define SRE best practices, reliability SLIs/SLOs/SLAs, and operational standards.
  • Partner with application and platform engineering teams to improve resilience.
  • Drive observability maturity: predictive monitoring, anomaly detection, automated RCA.
  • Own continuous improvement of Engineer(s)/Sr Engineer(s) runbooks and automation pipelines.
  • Provide technical leadership, mentor junior SREs, and conduct training.
  • Identify new technologies, tools, and processes that elevate operational excellence.
Skills:
Mandatory Skills (Must-Have):
Incident Command & Complex Troubleshooting:
  • Expectation: Take leadership during high-severity outages, orchestrating technical response across teams.
  • Example: Lead a Sev-1 bridge call where multiple microservices are failing due to cascading Kubernetes issues; coordinate DB, infra, network, security and app teams to isolate the problem.
Deep Kubernetes & Distributed Systems Expertise:
  • Expectation: Design, troubleshoot, and optimize complex Kubernetes clusters and multi-region deployments
  • Example: Diagnose why inter-cluster communication in a service mesh is causing intermittent API failures and propose architectural fixes.
Automation Framework Design (Infra & Ops):
  • Expectation: Architect automation platforms to reduce manual toil, enable self-service, and support auto-remediation.
  • Example: Build an Ansible/Terraform-based automation pipeline that provisions, configures, and tests new app environments with zero manual steps.
Observability Strategy & Advanced Monitoring:
  • Expectation: Define enterprise-wide observability standards (SLIs/SLOs/SLAs), implement anomaly detection, and predictive monitoring.
  • Example: Roll out a metrics-based SLO framework for all API services with automated burn-rate alerts in Prometheus.
Database & Application Performance Engineering:
  • Expectation: Tune databases, caching layers, and app performance to handle scale.
  • Example: Identify DB query patterns that degrade API performance and recommend schema/index optimizations.
Cross-Domain SME Knowledge (Networking, Storage, APIs):
  • Expectation: Act as a go-to expert across infrastructure layers.
  • Example: Troubleshoot why API gateway latency spikes correlate with storage backend bottlenecks.
AI/ML in Operations (AIOps):
  • Expectation: Integrate AI-driven platforms for anomaly detection, auto-remediation, and incident prediction
  • Example: Deploy an ML model that predicts storage saturation 24 hours before impact, triggering automated cleanup.
Mentorship & Technical Leadership:
  • Expectation: Act as SME, guiding Engineer(s)/ Sr Engineer(s), creating playbooks, and driving operational excellence.
  • Example: Conduct deep-dive training sessions on advanced Kubernetes troubleshooting for Sr Engineer(s).
TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That meansthat any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.
TMUS Global Solutions willnever seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidates acceptance of a formal offer.

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD5166119
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Hyderabad, Telangana, India
  • Education
    Not mentioned
  • Experience
    Year