Sr Engineer, Site Reliability

Year    Hyderabad, Telangana, India

Job Description

About TMUS Global Solutions
T-Mobile is America's supercharged Un-carrier, challenging conventions and setting new standards in wireless. With the nation's largest and fastest 5G network, T-Mobile delivers advanced connectivity and unmatched value to millions across the U.S. We're unwaveringly obsessed with providing the best possible service experience, driven by a spirit of disruption that fuels competition and innovation in wireless and beyond.
Disclaimer: TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions. TMUS India Private Ltd., and T-Mobile US, Inc., do not provide telecommunication services in India.

About the Role
As a Senior Site Reliability Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will play a pivotal role in building and scaling intelligent infrastructure to support AI/ML applications, enterprise services, and LLM-based platforms. You will contribute to the design and implementation of observability frameworks, automation-first operations, and incident response strategies to ensure reliability, performance, and scalability across production systems.
What Youll Do

  • Implement and maintain observability, monitoring, and alerting systems for AI platforms and backend services
  • Design and support telemetry pipelines, logging infrastructure, and dashboards (Splunk, Prometheus, Grafana, OpenTelemetry)
  • Define and monitor SLOs, SLIs, latency, availability, and throughput metrics
  • Participate in on-call rotations, incident resolution, root cause analysis, and postmortems
  • Improve CI/CD workflows and infrastructure automation using GitLab pipelines
  • Optimize and scale infrastructure including Kafka, RMQ, HAProxy, and distributed APIs
  • Collaborate with engineering teams on governance, compliance, and secure operations
  • Support capacity planning, cost analysis, and tuning for high-scale performance
  • Automate repetitive tasks and reduce toil via scripting (Python, Bash, Java)
  • Contribute to runbooks, knowledge base articles, and SRE best practice documentation
  • Mentor junior engineers and support a culture of operational excellence and reliability
What Youll Bring
  • Bachelors degree in Computer Science, Engineering, or a related technical field
  • 4-7 years in SRE, DevOps, platform, or operations engineering roles
  • Strong hands-on experience in observability, monitoring, and distributed systems troubleshooting
  • Proficiency in scripting languages such as Python, Bash, or PowerShell
  • CI/CD experience with GitLab and automation across deployment pipelines
  • Solid understanding of SQL and NoSQL systems including Oracle DB and MongoDB
  • Familiarity with Kubernetes, container orchestration, and hybrid cloud (Azure, AWS, GCP, OCI)
  • Experience working in high-stakes, incident-driven environments
  • Strong working knowledge of Splunk, Grafana, Prometheus, and other observability tools
  • Understanding of AI/ML systems, inference APIs, and LLM infrastructure is a plus
  • Experience in platform compliance, security enforcement, and regulated domains (finance preferred)
Must Have Skills
  • Application & Microservice: Java, Spring boot, API & Service Design
  • Any CI/CD Tools: Gitlab Pipeline/Test Automation/GitHub Actions/ Jenkins /Circle CI
  • App Platform: Docker & Containers (Kubernetes)
  • Any Databases: SQL & NOSQL (Cassandra/Oracle/Snowflake/MongoDB)
  • Any Messaging: Kafka, Rabbit MQ
  • Any Observability/Monitoring: Splunk/ Grafana/ Open Telemetry /ELK Stack/ Datadog/ New Relic/ Prometheus)
  • Incident/Change/Problem Management
Nice To Have
  • Multi-region failover (SQL Server, MongoDB, vendors)
  • Observability platform design (sampling, retention policies)
  • Own domain SLOs and error budgets
  • Perf engineering for latency-sensitive apps
  • Toil automation (SRE bots, operators

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4754334
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Hyderabad, Telangana, India
  • Education
    Not mentioned
  • Experience
    Year