Machine Learning Engineer - Inference (2-3 years)

Full Time · Engineering · On-Site

Bengaluru, Karnataka, India

Company Overview

Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. We are dedicated to leading transformative research and development in the field of language technologies. With a focus on building scalable and efficient Large Language Models (LLMs) that support a wide range of languages, particularly Indic languages, Sarvam.ai aims to reimagine human-computer interaction and build novel AI-driven solutions. Join us as we push the boundaries of AI to create more inclusive and intelligent language processing tools for diverse communities worldwide.

Job Summary

We are looking for an experienced Machine Learning Engineer specializing in model inference and optimization to join our team. This role focuses on improving the efficiency and scalability of LLMs in production, including model deployment, quantization, and inference acceleration. The ideal candidate will have 2-3 years of experience working with ML frameworks such as PyTorch or TensorFlow, a deep understanding of neural network architectures, and a strong interest in LLM inference optimization.

Key Responsibilities

Research and implement model optimization techniques for LLMs, including quantization, pruning, distillation, and efficient fine-tuning.
Develop and optimize LLM inference pipelines to improve latency and efficiency across CPU/GPU/TPU environments.
Benchmark and profile models to identify performance bottlenecks and implement solutions for inference acceleration.
Deploy scalable and efficient LLM inference solutions on cloud and on-prem infrastructures.
Work with cross-functional teams to integrate optimized models into production systems.
Stay up-to-date with the latest advancements in LLM inference, distributed computing, and AI hardware accelerators.
Maintain and improve code quality, documentation, and experiment tracking for continuous development.

Must-Have Qualifications

Experience: 2-3 years in ML engineering, with a focus on model inference and optimization.
Education: Bachelor's or Master’s degree in Computer Science, AI/ML, Data Science, or a related field.
ML Frameworks: Proficiency in PyTorch or TensorFlow for model training and deployment.
Model Optimization: Hands-on experience with quantization (INT8, FP16), pruning, and knowledge distillation.
Inference Acceleration: Experience with ONNX, TensorRT, DeepSpeed, or Hugging Face Optimum for optimizing inference workloads.
Cloud & Deployment: Experience deploying ML models on AWS, Azure, or GCP using cloud-native ML tools.
Profiling & Benchmarking: Familiarity with NVIDIA Nsight, PyTorch Profiler, or TensorBoard for analyzing model performance.
Problem-Solving: Strong analytical skills to troubleshoot ML model efficiency and deployment challenges.

Preferred Qualifications

Experience with distributed training and inference frameworks (e.g., vLLM, DeepSpeed, FSDP).
Understanding of GPU/TPU optimizations, CUDA programming, or low-level ML hardware acceleration.
Familiarity with edge and offline model deployment strategies.

Contributions to open-source projects related to ML inference, LLMs, or optimization.

Made with