Site Reliability Engineer- Application Specialisation

Job type: Full Time · Department: Engineering · Work type: On-Site

Bengaluru, Karnataka, India

Company Overview

Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. Our mission is to make generative AI accessible and impactful for Bharat. Founded by a team of AI experts, Sarvam.ai is dedicated to developing cost-effective, high-performance AI agents tailored for the Indian market, enabling enterprises to tap into new opportunities and foster deeper customer connections. Join us in reshaping AI for India and beyond.

Job Summary

We are looking for a SRE to help build and manage scalable, secure, and high-performance infrastructure. You will play a key role in automating deployments, managing cloud infrastructure, optimizing CI/CD workflows, and ensuring system reliability. This role requires a strong foundation in cloud platforms, containerization, infrastructure as code (IaC), and monitoring tools.

Key Responsibilities

Design, implement, and manage CI/CD pipelines for seamless software deployment.
Deploy and manage cloud infrastructure using Terraform, Kubernetes, and Docker.
Automate infrastructure provisioning, scaling, and security compliance.
Monitor system performance, optimize resource utilization, and ensure high availability.
Implement logging, monitoring, and alerting solutions using tools like Prometheus, Grafana, ELK Stack, or CloudWatch.
Enhance security and compliance by managing IAM policies, encryption, and vulnerability scanning.
Troubleshoot system failures, perform root cause analysis, and implement improvements.
Work closely with development teams to ensure smooth deployment of AI models and applications.

Must-Have Skills and Qualifications

Educational Background: Bachelor's degree in Computer Science, Engineering, or related field (2024/2025 graduates).
Cloud Expertise: Experience with AWS, Azure, or GCP for deploying and managing cloud-based applications.
Containerization: Strong hands-on experience with Docker and Kubernetes.
Infrastructure as Code (IaC): Experience using Terraform, Ansible, or CloudFormation.
CI/CD Pipelines: Experience setting up automated workflows using GitHub Actions, Jenkins, or GitLab CI/CD.
Monitoring & Logging: Experience with Prometheus, Grafana, ELK, or similar tools.
Networking & Security: Understanding of firewalls, VPNs, SSL, and cloud security best practices.
Version Control: Proficiency with Git and managing repositories.
Problem Solving: Strong debugging, troubleshooting, and analytical skills.

Good to Have

Exposure to serverless computing (AWS Lambda, Azure Functions).
Experience with message queues (Kafka, RabbitMQ, or SQS).
Familiarity with site reliability engineering (SRE) practices.
Contributions to open-source projects or a strong GitHub portfolio demonstrating DevOps expertise.

Made with