Machine Learning Engineer – Computer Vision & Vlm

Year    Bangalore, Karnataka, India

Job Description

Machine Learning Engineer - Computer Vision & Vision-Language Models (VLMs)
About Sarvam AI
Sarvam.ai is a pioneering generative-AI startup headquartered in Bengaluru, India. We are dedicated to transformative R & D in language technologies, building scalable and efficient Large Language Models (LLMs) that serve a wide spectrum of languages-especially Indic languages. Our mission is to re-imagine human-computer interaction and craft novel AI-driven solutions that make language technology inclusive for diverse communities worldwide.
Role Overview
As a Machine Learning Engineer (MLE) in the Vision-Language team, you will build and refine vision, OCR, and language models for varied use-cases. Your work will span research, scalable training, and rigorous evaluation of cutting-edge computer-vision and VLM systems.
Key Responsibilities
Model R & D
Prototype and fine-tune state-of-the-art vision architectures and vision-language models.
Design and evaluate multimodal fusion strategies for robust image-text understanding.
Data & Training Pipelines
Build distributed pipelines (PySpark / Ray) to curate and preprocess large-scale multimodal datasets (images, geospatial rasters, PDFs, video frames, captions).
Implement efficient training loops in PyTorch/Lightning with mixed precision, gradient accumulation, and multi-GPU (? 4) parallelism.
Domain-Focused Applications
Develop models for geospatial analysis, Indic document intelligence (OCR + layout), visual question answering (VQA), and broader computer-vision use-cases.
Evaluation & Benchmarking
Define and automate task-specific metrics for OCR accuracy, retrieval, dense captioning, and VQA; maintain regression dashboards and ablation suites.
Required Qualifications
Experience: 2-3 years in ML engineering with emphasis on classical computer vision and modern vision-language models.
Education: Bachelor's or Master's in Computer Science, AI/ML, or related fields.
Technical Skills
Strong Python & PyTorch; comfortable with CUDA profiling and tensor debugging.
Hands-on experience training CV models (CNNs, ViTs) and/or VLMs on ? 4-GPU nodes.
Proven ability to build, deploy, and monitor pipelines for OCR, object detection, and segmentation.
Solid grasp of computer-vision fundamentals (detection, segmentation, representation learning) and transformer mechanics.
Software-Engineering Fundamentals:
Proficiency with Git, unit tests, structured logging, Docker, and CI/CD.
Ability to select and integrate appropriate databases (SQL, NoSQL, vector stores) for large-scale multimodal data.
Experience designing scalable backend APIs/micro-services (gRPC/REST), including monitoring and observability best practices.
Preferred Qualifications
Publications or submissions in CVPR/ICCV/ECCV, EMNLP, ACL.
Prior work on multilingual or low-resource vision-language tasks.
Experience with data-centric AI (active learning, synthetic augmentation).
Contributions to open-source vision/NLP libraries (Hugging Face, OpenCV, Detectron2).
Familiarity with distributed schedulers (KubeFlow, Slurm).

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD4146191
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Bangalore, Karnataka, India
  • Education
    Not mentioned
  • Experience
    Year