infrastructure layer behind large-scale LLM training and inference
.
If your strength lies in
systems, performance tuning, reliability, and distributed runtimes
, this role is for you.
If you primarily work on
model experimentation or notebooks
, this role is
not a fit
.
What You'll Own
AI Runtime Architecture
Design and own runtime infrastructure for
distributed training & inference
Build
elastic, fault-tolerant
systems (scaling, retries, recovery)
Strengthen orchestration of
PyTorch-based distributed workloads
Performance & Systems Engineering
Profile and optimize
latency, throughput, GPU utilization
Tune
multi-GPU / multi-node
training & inference pipelines
Debug low-level issues across
runtime, memory, and networking
Platform & Tooling
Build internal frameworks for training, checkpointing, recovery, and deployment
Implement
observability, diagnostics, and resilience
Drive
CI/CD and production-readiness
standards for AI runtime systems
Technical Leadership
Own technical direction and delivery
Mentor engineers through code reviews and architecture discussions
Collaborate with infra, research, and product teams
Mandatory Requirements (Non-Negotiable)
4+ years
strong software / systems engineering experience
1+ year owning AI runtime infrastructure
(distributed training or inference)
Hands-on PyTorch runtime optimization
(mandatory)
Proven
low-level performance engineering
experience
Strong
Python and C++
skills (Java acceptable)
Prior experience
leading or mentoring engineers
Good to Have
Kubernetes,
Ray
,
TorchElastic
, or custom orchestration frameworks
LLM training pipelines, fine-tuning, checkpointing, elastic training
Multi-GPU, multi-node cloud-native workloads
Job scheduling, failure recovery, production-grade runtime systems
Job Type: Full-time
Pay: ₹1,200,000.00 - ₹1,600,000.00 per year
Work Location: In person
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.