Company: Qualcomm India Private Limited
Job Area: Engineering Group, Engineering Group
Software Engineering
General Summary:
We are seeking a **Cluster Networking & Observability Engineer** to specialize in high-performance networking and observability for AI inference clusters. This role ensures low-latency communication and robust telemetry systems.
**Key Responsibilities**
- Design and maintain RoCE/RDMA-based networking for AI clusters.
- Configure and troubleshoot datacenter network components.
- Implement and maintain telemetry systems using Prometheus and OpenTelemetry.
- Manage **Kubernetes and Slurm cluster networking aspects**.
- Develop automation for network configuration and monitoring.
**Required Qualifications**
- Bachelor's or Master's in Computer Science, Electrical Engineering, or related field.
- 3-5 years of experience in networking or HPC environments.
- Solid understanding of datacenter networking and RoCE/RDMA.
- Understanding of IPMI, SNMP and Hardware management protocols
- Experience with telemetry and observability tools (Prometheus, OpenTelemetry).
- Proficiency in **Python and Shell scripting**.
- Familiarity with Linux networking stack and performance tuning.
- Exposure to cloud platforms (AWS, Azure, GCP) and hybrid deployments.
- **Hands-on experience managing Kubernetes and Slurm clusters**.
- Strong software engineering background.
Design and maintain RoCE/RDMA-based networking for AI clusters.
- Configure and troubleshoot datacenter network components.
- Implement and maintain telemetry systems using Prometheus and OpenTelemetry.
- Manage **Kubernetes and Slurm cluster networking aspects**.
- Develop automation for network configuration and monitoring.
**Required Qualifications**
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field.
- 3-5 years of experience in networking or HPC environments.
- Solid understanding of datacenter networking and RoCE/RDMA.
- Understanding of IPMI, SNMP and Hardware management protocols
- Experience with telemetry and observability tools (Prometheus, OpenTelemetry).
- Proficiency in **Python and Shell scripting**.
- Familiarity with Linux networking stack and performance tuning.
- Exposure to cloud platforms (AWS, Azure, GCP) and hybrid deployments.
- **Hands-on experience managing Kubernetes and Slurm clusters**.
- Strong software engineering background.
Minimum Qualifications: o Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience.
OR
Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Engineering or related work experience.
OR
PhD in Engineering, Information Systems, Computer Science, or related field.
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.