Return to jobs list
Job Details Logo

MLE Inferencing Team - Edge AI

Full Time · Engineering · On-Site

Bengaluru, Karnataka, India

Company Overview

Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. We are dedicated to leading transformative research and development in the field of language technologies. With a focus on building scalable and efficient Large Language Models (LLMs) that support a wide range of languages, particularly Indic languages, Sarvam.ai aims to reimagine human-computer interaction and build novel AI-driven solutions. Join us as we push the boundaries of AI to create more inclusive and intelligent language processing tools for diverse communities worldwide.

Job Summary

As a Machine Learning Engineer (MLE), you would be aligned with the Inferencing Team with a focus on the Edge AI vertical that is working on building AI models for on-device deployment. Your role would span across distilling the model for the edge, optimizing the model and platform for inference time & resources, and building applications to interact with the models 

Key Responsibilities

  • Research and implement model optimization techniques for LLMs, including quantization, pruning, distillation, and efficient fine-tuning.

  • Develop and optimize LLM inference pipelines to improve latency and efficiency across CPU/GPU/TPU environments.

  • Benchmark and profile models to identify performance bottlenecks and implement solutions for inference acceleration.

  • Deploy scalable and efficient LLM inference solutions on cloud and on-prem infrastructures.

  • Work with cross-functional teams to integrate optimized models into production systems.

  • Stay up-to-date with the latest advancements in LLM inference, distributed computing, and AI hardware accelerators.

  • Maintain and improve code quality, documentation, and experiment tracking for continuous development.

  • Ability to export these models into ONNX, TFlite, or other formats, alongside an understanding of model obfuscation for IP protection and security aspects

Must-Have Qualifications

  • Experience: 2-3 years in ML engineering, with a focus on model inference and optimization.

  • Education: Bachelor's or Master’s degree in Computer Science, AI/ML, Data Science, or a related field.

  • ML Frameworks: Proficiency in PyTorch or TensorFlow for model training and deployment.

  • Model Optimization: Hands-on experience with quantization (INT8, FP16), pruning, and knowledge distillation.

  • Inference Acceleration: Experience with ONNX, TensorRT, DeepSpeed, or Hugging Face Optimum for optimizing inference workloads.

  • Cloud & Deployment: Experience deploying ML models on AWS, Azure, or GCP using cloud-native ML tools.

  • Profiling & Benchmarking: Familiarity with NVIDIA Nsight, PyTorch Profiler, or TensorBoard for analyzing model performance.

  • Problem-Solving: Strong analytical skills to troubleshoot ML model efficiency and deployment challenges.

Preferred Qualifications

  • Experience with distributed training and inference frameworks (e.g., vLLM, DeepSpeed, FSDP).

  • Understanding of GPU/TPU optimizations, CUDA programming, or low-level ML hardware acceleration.

  • Familiarity with edge and offline model deployment strategies.

Contributions to open-source projects related to ML inference, LLMs, or optimization.

Made with