Position Overview:
As Lead/Staff AI Runtime Engineer, you'll play a pivotal role in the design, development, and
optimization of the core runtime infrastructure that powers distributed training and
deployment of large AI models (LLMs and beyond).
This is a hands-on leadership role - perfect for a systems-minded software engineer who
thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure.
You'll own critical components of PyTorch-based stack, lead technical direction, and
collaborate across engineering, research, and product to push the boundaries of elastic,
fault-tolerant, high-performance model execution.
What you'll do:
Lead Runtime Design & Development
? Own the core runtime architecture supporting AI training and inference at scale.
? Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery)
within our custom PyTorch stack.
? Optimize distributed training reliability, orchestration, and job-level fault tolerance.
Drive Performance at Scale
? Profile and enhance low-level system performance across training and inference pipelines.
? Improve packaging, deployment, and integration of customer models in production
environments.
? Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU
setups.
Build Internal Tooling & Frameworks
? Design and maintain libraries and services that support model lifecycle: training, check
pointing, fault recovery, packaging, and deployment.
? Implement observability hooks, diagnostics, and resilience mechanisms for deep learning
workloads.
? Champion best practices in CI/CD, testing, and software quality across the AI Runtime
stack.
Collaborate & Mentor
? Work cross-functionally with Research, Infrastructure, and Product teams to align runtime
development with customer and platform needs.
? Guide technical discussions, mentor junior engineers, and help scale the AI Runtime
team's capabilities.
What you'll need to be successful:
? 5+ years of experience in systems/software engineering, with deep exposure to AI runtime,
distributed systems, or compiler/runtime interaction.
? Experience in delivering PaaS services. ? Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference. ? Strong programming skills in Python and C++ (Go or Rust is a plus). ? Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration. ? Experience working with multi-GPU, multi-node, or cloud-native AI workloads. ? Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments. Bonus Points: ? Contributions to PyTorch internals or open-source DL infrastructure projects. ? Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration. ? Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators. ? Background in systems research, compilers, or runtime architecture for HPC or ML. Start up previous experience
Job Type: Contractual / Temporary
Contract length: 6 months
Pay: ?50,000.00 - ?80,000.00 per month
Work Location: In person
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.