Return to jobs list

Machine Learning Engineer – Computer Vision & VLM

Full Time · Engineering · On-Site

Bengaluru, Karnataka, India

Machine Learning Engineer – Computer Vision & Vision‑Language Models (VLMs)


About Sarvam AI

Sarvam.ai is a pioneering generative‑AI startup headquartered in Bengaluru, India. We are dedicated to transformative R & D in language technologies, building scalable and efficient Large Language Models (LLMs) that serve a wide spectrum of languages—especially Indic languages. Our mission is to re‑imagine human‑computer interaction and craft novel AI‑driven solutions that make language technology inclusive for diverse communities worldwide.


Role Overview

As a Machine Learning Engineer (MLE) in the Vision‑Language team, you will build and refine vision, OCR, and language models for varied use‑cases. Your work will span research, scalable training, and rigorous evaluation of cutting‑edge computer‑vision and VLM systems.


Key Responsibilities

  • Model R & D

    • Prototype and fine‑tune state‑of‑the‑art vision architectures and vision‑language models.

    • Design and evaluate multimodal fusion strategies for robust image–text understanding.

  • Data & Training Pipelines

    • Build distributed pipelines (PySpark / Ray) to curate and preprocess large‑scale multimodal datasets (images, geospatial rasters, PDFs, video frames, captions).

    • Implement efficient training loops in PyTorch/Lightning with mixed precision, gradient accumulation, and multi‑GPU (≥ 4) parallelism.

  • Domain‑Focused Applications

    • Develop models for geospatial analysis, Indic document intelligence (OCR + layout), visual question answering (VQA), and broader computer‑vision use‑cases.

  • Evaluation & Benchmarking

    • Define and automate task‑specific metrics for OCR accuracy, retrieval, dense captioning, and VQA; maintain regression dashboards and ablation suites.


Required Qualifications

  • Experience: 2–3 years in ML engineering with emphasis on classical computer vision and modern vision‑language models.

  • Education: Bachelor’s or Master’s in Computer Science, AI/ML, or related fields.

  • Technical Skills

    • Strong Python & PyTorch; comfortable with CUDA profiling and tensor debugging.

    • Hands‑on experience training CV models (CNNs, ViTs) and/or VLMs on ≥ 4‑GPU nodes.

    • Proven ability to build, deploy, and monitor pipelines for OCR, object detection, and segmentation.

    • Solid grasp of computer‑vision fundamentals (detection, segmentation, representation learning) and transformer mechanics.

    • Software‑Engineering Fundamentals:

      • Proficiency with Git, unit tests, structured logging, Docker, and CI/CD.

      • Ability to select and integrate appropriate databases (SQL, NoSQL, vector stores) for large‑scale multimodal data.

      • Experience designing scalable backend APIs/micro‑services (gRPC/REST), including monitoring and observability best practices.


Preferred Qualifications

  • Publications or submissions in CVPR/ICCV/ECCV, EMNLP, ACL.

  • Prior work on multilingual or low‑resource vision‑language tasks.

  • Experience with data‑centric AI (active learning, synthetic augmentation).

  • Contributions to open‑source vision/NLP libraries (Hugging Face, OpenCV, Detectron2).

  • Familiarity with distributed schedulers (KubeFlow, Slurm).

Made with