Site Reliability Engineering Ic4

Year Hyderabad, Telangana, India

https://www.mncjobsindia.com/company/microsoft

Apply Now

Job Description

Reliability: Ensure the reliability, scalability, and security of AI infrastructure supporting HPC & AI workloads. Incident Management: Lead incident response, root cause analysis, and continuous improvement to minimize downtime and optimize service availability. Performance Optimization: Identify and resolve bottlenecks in compute, storage, networking, and specialized hardware (GPUs, InfiniBand) to enhance AI system performance. Infrastructure Automation: Develop and maintain automation tools for deployment, monitoring, predictive analysis and management of AI infrastructure, including containerized environments (Kubernetes, Docker). Technical Leadership: Provide technical guidance in cloud and AI infrastructure technologies, collaborating with cross-functional teams to drive innovation and best practices. Customer Advocacy: Act as a customer advocate, focusing on service excellence and live site reliability for AI workloads. Research & Innovation: Stay informed on emerging AI infrastructure technologies and industry trends, recommending adoption where beneficial. 6+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. 5+ years of hands-on experience developing and supporting infrastructure services for AI or cloud platforms. Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams. 1+ years experience with incident management and reliability engineering in cloud or AI environments. Excellent interpersonal, communication, and collaboration skills. 7+ years technical experience in software engineering, network engineering, OR systems administration OR Bachelor's Degree in Computer Science, Information Technology, OR related field AND 4+ years technical experience in software engineering, network engineering, OR systems administration OR Master's Degree in Computer Science, Information Technology, OR related field AND 3+ years technical experience in software engineering, network engineering Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem). Experience with GPUs, InfiniBand, or similar high-performance technologies. Proficiency in RDMA (Remote Direct Memory Access), MPI (Message Passing Interface), and high-performance computing architecture. Proficient in scripting (PowerShell, Shell script, etc.) and deep expertise in Linux.

Skills Required

Engineering

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.

Job Detail

Job Id

JD4958883
Industry

Not mentioned
Total Positions

1
Job Type:

Full Time
Salary:

Not mentioned
Employment Status

Permanent
Job Location

Hyderabad, Telangana, India
Education

Not mentioned
Experience

Year

MNC Jobs India

Jobs by Function

Popular Job Skills

Popular Industries

Popular Cities

Jobseekers

Employers