Sre Ai Ml Support Engineer

Year    MH, IN, India

Job Description

SRE - AI_ML Support Engineer - JD

We are hiring a "SRE [Site Reliability Engineer] AI ML Support" engineer for our "Enterprise-grade high-

performance supercomputing" platform. We are helping enterprises and service providers build their AI

inference platforms for end users, powered by our state-of-the-art RDU (Reconfigurable Dataflow Unit)

hardware architecture. Our cloud-agnostic, enterprise-grade MLOps platform abstracts infrastructure

complexity and enables seamless deployment, management, and scaling of foundation model workloads at

production scale. You'll contribute to the core of our enterprise-grade AI platform, collaborating across teams to

ensure our systems are performant, secure, and built to last. This is a high-impact, high-visibility role working

at the intersection of AI infrastructure, enterprise software, and developer experience.

Minimum Requirements:

Foundational ML knowledge with hands-on experience working with machine learning models,
especially large language models (LLMs) and LLM APIs

Strong programming skills in Python, including working with ML frameworks (PyTorch, Huggingface,
LangChain, etc) as well as building scripts, automation

Solid understanding of Generative AI concepts (such as RAG) and applied use cases Exposure to Linux systems and familiarity with troubleshooting environment/setup issues Ability to investigate, triage, and resolve customer or internal issues related to ML workflows, APIs, and
AI-based applications

Experience with issue tracking, documentation, and collaboration platforms (e.g., ticketing systems,
project tracking tools, knowledge bases)

Proficiency with Docker for containerization and shell scripting for system automation Good communication and collaboration skills to work with cross-functional teams as well as external
customers or stakeholders

Nice to have:

Familiarity with multi-modal models (e.g. Llama 4 Maverick) Familiarity with ML Ops practices - monitoring, observability, exposure to related libraries and
frameworks like OpenSearch, Prometheus and Grafana

Strong hands-on exposure to Linux system administration and network administration, including
troubleshooting, system monitoring, and optimizing performance

Experience working with Kubernetes (on-prem deployments preferred) for managing containerized ML
workloads

Exposure to one or more public cloud platforms (AWS, GCP, Azure, etc) Strong customer-facing communication skills to handle escalations, reliability concerns, and solution
discussions with stakeholders and clients in a B2B environment

Ways to stand out from the crowd:

Prior experience working with APIs and SDKs of major LLM providers (OpenAI, Anthropic, Hugging
Face, etc)

Demonstrated ability to resolve complex issues in production ML systems Knowledge of fine-tuning, prompt engineering, and optimizing LLM usage in production
Job Type: Full-time

Pay: ?500,000.00 - ?1,719,712.72 per year

Benefits:

Provident Fund
Work Location: In person

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4313400
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    MH, IN, India
  • Education
    Not mentioned
  • Experience
    Year