Company Overview

Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. We are dedicated to leading transformative research and development in speech and language technologies. With a focus on building state-of-the-art ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models, particularly for Indic languages, Sarvam.ai aims to redefine human-computer interaction through cutting-edge AI-driven speech solutions. Join us in pushing the boundaries of Speech AI to create inclusive, scalable, and intelligent voice-based applications for diverse communities worldwide.

Job Summary

We are looking for an experienced Machine Learning Engineer specializing in Speech AI (ASR & TTS) to develop and optimize speech recognition and synthesis models. This role involves working on deep learning-based ASR and TTS models, improving accuracy, efficiency, and multilingual capabilities, and deploying them at scale. The ideal candidate should have 2-3 years of experience working with speech processing, deep learning, and optimization of ASR/TTS models, along with proficiency in ML frameworks like PyTorch or TensorFlow.

Key Responsibilities

ASR (Automatic Speech Recognition)

Develop, train, and optimize speech-to-text models using architectures like Wav2Vec, Whisper, Conformer, and DeepSpeech.
Implement techniques for low-latency ASR inference, including beam search, LM integration, and real-time transcription.
Improve speech recognition accuracy for low-resource and Indic languages using transfer learning and data augmentation.
Optimize ASR pipelines for noise robustness, speaker adaptation, and domain-specific transcription.

TTS (Text-to-Speech)

Develop and fine-tune neural TTS models such as Tacotron, FastSpeech, VITS, or WaveNet for high-quality, natural-sounding speech synthesis.
Implement multilingual and expressive TTS models with prosody and emotion control.
Optimize TTS inference for deployment on edge devices, mobile, and cloud platforms.
Improve speech synthesis quality through techniques like voice cloning, neural vocoders (HiFi-GAN, WaveGlow), and prosody modeling.

General Speech AI Responsibilities

Benchmark and profile ASR/TTS models to improve latency, efficiency, and deployment performance.
Deploy scalable speech AI APIs on AWS, Azure, or GCP for real-world applications.
Optimize ASR & TTS models for edge and offline inference.
Stay updated with advancements in speech AI, neural vocoders, and real-time inference techniques.

Must-Have Qualifications

Experience: 2-3 years in speech AI, deep learning, or machine learning, focusing on ASR & TTS.
Education: Bachelor's or Master’s degree in Computer Science, AI/ML, Speech Processing, or a related field.
ML Frameworks: Proficiency in PyTorch or TensorFlow for training and deploying ASR/TTS models.
ASR Expertise: Experience with speech-to-text architectures like Whisper, Wav2Vec, Conformer, or DeepSpeech.
TTS Expertise: Experience with speech synthesis models like Tacotron, FastSpeech, or VITS.
Speech Signal Processing: Understanding of MFCCs, STFT, phonemes, prosody modeling, and feature extraction.
Inference Optimization: Hands-on experience with TensorRT, ONNX, or quantization (INT8, FP16) for ASR/TTS.
Cloud & Edge Deployment: Experience deploying speech models on AWS, GCP, or Azure.

Preferred Qualifications

Experience with speech diarization, speaker recognition, or language modeling for ASR.
Familiarity with zero-shot TTS, voice cloning, and multilingual speech modeling.
Understanding of CUDA optimization and low-bit quantization for ASR/TTS models.
Contributions to open-source speech AI projects or a strong GitHub portfolio.
Experience with real-time streaming ASR/TTS applications and low-latency inference.

Application Process

Interested candidates are invited to submit their resume, cover letter, and any relevant project portfolios or GitHub links showcasing their experience in ASR, TTS, or Speech AI. Strong AI-related projects—whether in industry, research, or personal work—will be highly valued.