BS or MS degree in CS or related engineering or science field with 3+ years of relevant experience
Experience with benchmarking and troubleshooting or optimizing performance of a system.
Experience with coding, scripting, and automation.
Background in Networking.
General Linux skills.
Demonstrated ability to lead complex projects, independently resolve ambiguity, collaborate with stakeholders across teams, and communicate effectively.
Desired qualifications:
Experience working on clusters, e.g., running HPC/AI workloads, or maintaining an HPC/AI system.
Experience troubleshooting or tuning performance on distributed systems.
Familiarity with elements of the AI/HPC software stack such as job schedulers (e.g., Slurm); NCCL, RCCL, or MPI; or ML frameworks.
Experience with RDMA Networking, i.e., RoCE or Infiniband.
Experience architecting or developing solutions on a public cloud platform.
Responsibilities
Carry out performance studies on GPU clusters with focus on AIML workload performance, network performance and tuning.
Design and code solutions for performance benchmarking.
Troubleshoot performance problems on RDMA clusters and perform cluster performance validation, including on very novel and not fully understood systems.
Document new tools and procedures to a high standard.
Write whitepapers to disseminate findings of performance studies.
Participate in architecture design and review, code review, and contribute to roadmap development.
Mentor junior engineers.
* Participate in operational rotations.
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.