MLE Inferencing Team - Edge AI
Full Time · Engineering · On-Site
Bengaluru, Karnataka, India
Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. We are dedicated to leading transformative research and development in the field of language technologies. With a focus on building scalable and efficient Large Language Models (LLMs) that support a wide range of languages, particularly Indic languages, Sarvam.ai aims to reimagine human-computer interaction and build novel AI-driven solutions. Join us as we push the boundaries of AI to create more inclusive and intelligent language processing tools for diverse communities worldwide.
As a Machine Learning Engineer (MLE), you would be aligned with the Inferencing Team with a focus on the Edge AI vertical that is working on building AI models for on-device deployment. Your role would span across distilling the model for the edge, optimizing the model and platform for inference time & resources, and building applications to interact with the models
Research and implement model optimization techniques for LLMs, including quantization, pruning, distillation, and efficient fine-tuning.
Develop and optimize LLM inference pipelines to improve latency and efficiency across CPU/GPU/TPU environments.
Benchmark and profile models to identify performance bottlenecks and implement solutions for inference acceleration.
Deploy scalable and efficient LLM inference solutions on cloud and on-prem infrastructures.
Work with cross-functional teams to integrate optimized models into production systems.
Stay up-to-date with the latest advancements in LLM inference, distributed computing, and AI hardware accelerators.
Maintain and improve code quality, documentation, and experiment tracking for continuous development.
Ability to export these models into ONNX, TFlite, or other formats, alongside an understanding of model obfuscation for IP protection and security aspects
Experience: 2-3 years in ML engineering, with a focus on model inference and optimization.
Education: Bachelor's or Master’s degree in Computer Science, AI/ML, Data Science, or a related field.
ML Frameworks: Proficiency in PyTorch or TensorFlow for model training and deployment.
Model Optimization: Hands-on experience with quantization (INT8, FP16), pruning, and knowledge distillation.
Inference Acceleration: Experience with ONNX, TensorRT, DeepSpeed, or Hugging Face Optimum for optimizing inference workloads.
Cloud & Deployment: Experience deploying ML models on AWS, Azure, or GCP using cloud-native ML tools.
Profiling & Benchmarking: Familiarity with NVIDIA Nsight, PyTorch Profiler, or TensorBoard for analyzing model performance.
Problem-Solving: Strong analytical skills to troubleshoot ML model efficiency and deployment challenges.
Experience with distributed training and inference frameworks (e.g., vLLM, DeepSpeed, FSDP).
Understanding of GPU/TPU optimizations, CUDA programming, or low-level ML hardware acceleration.
Familiarity with edge and offline model deployment strategies.
Contributions to open-source projects related to ML inference, LLMs, or optimization.
Autofill application
Save time by importing your resume in one of the following formats: .pdf or .docx.