Cluster Lifecycle Management: Lead the evaluation, planning, configuration, and physical/virtual deployment of multiple large-scale CPU + GPU clusters. System Administration: Perform expert-level Linux system administration, including kernel tuning, security hardening, and OS lifecycle management (e.g., RHEL, Ubuntu, or Rocky Linux). Workload Management: Act as the subject matter expert for SLURM, managing complex partitioning, resource quality of service (QoS), and scheduling optimization for mixed workloads. Infrastructure Design: Architect and build the physical and logical infrastructure for HPC, including high-speed fabric integration (InfiniBand/Ethernet) and power/cooling planning. Software Stack & Modules: Maintain and curate the HPC application stack using software management tools like LMOD or Tcl Modules, ensuring researchers have access to optimized compilers, libraries (MPI, CUDA), and applications. GPU Optimization: Spec and tune GPU environments (e.g., NVIDIA H100/B200), focusing on GPUDirect, NVLink topologies, and containerized runtimes like Apptainer/Singularity. Troubleshooting & Performance: Conduct deep-dive root cause analysis for complex system failures and performance bottlenecks across compute, network, and software layers. Cross-Functional Leadership: Closely own infrastructure projects by coordinating with Networking (low-latency fabric) and Security (compliance, identity management) to ensure all builds meet enterprise standards. Experience with GPU-aware MPI implementations and performance profiling tools (e.g., NVIDIA Nsight, Tau). Knowledge of container orchestration in HPC (e.g., Kubernetes for AI/ML workloads alongside SLURM). Certifications such as RHCE (Red Hat Certified Engineer) or relevant NVIDIA/InfiniBand technical training. Education: BS/MS in Computer Science, Electrical Engineering, or a related field. HPC Experience: 6+ years of hands-on experience managing production-grade HPC clusters. Scheduler Expertise: Deep proficiency in SLURM administration, including writing custom prolog/epilog scripts and managing GRES (Generic Resources) for GPUs. Linux Mastery: Advanced knowledge of Linux internals, shell scripting (Bash), and at least one high-level language (Python or Go). Automation: Extensive experience with configuration management and provisioning tools (e.g., Ansible, Terraform, xCAT, or Warewulf). Networking: Familiarity with HPC-specific networking such as InfiniBand (NDR/HDR) and RoCE v2.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.