Reliability: Ensure the reliability, scalability, and security of AI infrastructure supporting HPC & AI workloads. Incident Management: Lead incident response, root cause analysis, and continuous improvement to minimize downtime and optimize service availability. Performance Optimization: Identify and resolve bottlenecks in compute, storage, networking, and specialized hardware (GPUs, InfiniBand) to enhance AI system performance. Infrastructure Automation: Develop and maintain automation tools for deployment, monitoring, predictive analysis and management of AI infrastructure, including containerized environments (Kubernetes, Docker). Technical Leadership: Provide technical guidance in cloud and AI infrastructure technologies, collaborating with cross-functional teams to drive innovation and best practices. Customer Advocacy: Act as a customer advocate, focusing on service excellence and live site reliability for AI workloads. Research & Innovation: Stay informed on emerging AI infrastructure technologies and industry trends, recommending adoption where beneficial. 6+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. 5+ years of hands-on experience developing and supporting infrastructure services for AI or cloud platforms. Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams. 1+ years experience with incident management and reliability engineering in cloud or AI environments. Excellent interpersonal, communication, and collaboration skills. 7+ years technical experience in software engineering, network engineering, OR systems administration OR Bachelor's Degree in Computer Science, Information Technology, OR related field AND 4+ years technical experience in software engineering, network engineering, OR systems administration OR Master's Degree in Computer Science, Information Technology, OR related field AND 3+ years technical experience in software engineering, network engineering Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem). Experience with GPUs, InfiniBand, or similar high-performance technologies. Proficiency in RDMA (Remote Direct Memory Access), MPI (Message Passing Interface), and high-performance computing architecture. Proficient in scripting (PowerShell, Shell script, etc.) and deep expertise in Linux.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.