Return to jobs list

Reasoning - Back End - Observability Engineer

Full Time · Engineering · Remote

Bengaluru, Karnataka, India

Observability Engineer - Sarvam Reasoning AI

About Job

The Sarvam AI Reasoning team is building sophisticated reasoning capabilities for India's first sovereign AI platform. We are seeking a skilled Site Reliability Engineer to join our organization. In this role, you'll be responsible for designing and implementing comprehensive observability systems, incident management frameworks, and automated remediation solutions. You'll work closely with our development teams to ensure our AI platform maintains exceptional reliability and performance. Your expertise will be critical in building resilient systems that can handle the scale and complexity of our enterprise-grade AI reasoning capabilities.

This challenging position offers the opportunity to work at the intersection of AI technology and systems reliability. We seek candidates with strong expertise in observability, incident management, and chaos engineering who can demonstrate excellence in designing and maintaining complex distributed systems.

Skills & Qualification

  • Bachelor's/Master's Degree in Computer Science or related field from a top-tier institution

  • 3-5 years of experience in Site Reliability Engineering or DevOps roles

  • Demonstrated expertise in designing and implementing observability platforms for distributed systems

  • Strong experience with monitoring tools, log aggregation systems, and metrics collection

  • Proficiency in developing dashboards and alerting systems

  • Extensive knowledge of incident management processes and escalation policies

  • Experience implementing SLO/SLI frameworks and error budgets

  • Familiarity with chaos engineering practices and tools

  • Knowledge of AIOps principles and predictive monitoring

  • Experience with incident response automation and remediation playbooks

  • Strong programming skills in languages like Python, Go, or Java

  • Excellent communication skills and experience with post-incident analysis

Responsibilities

  • Design comprehensive observability data platforms for logs, metrics, and traces

  • Design and implement distributed tracing with correlation IDs across all systems

  • Develop dashboards and alerting systems

  • Own on-call processes, incident classification, and escalation policies

  • Own SLO/SLI frameworks with error budgets and automated remediation playbooks

  • Implement chaos engineering practices to validate system resiliency

  • Build AIOps systems for anomaly detection and predictive monitoring

  • Implement automated incident response with predefined playbooks

  • Design blameless post-mortem processes and continuous improvement cycles

  • Collaborate with development teams to improve system reliability and performance

  • Establish and maintain documentation for operational procedures and best practices

  • Mentor junior engineers on observability and reliability engineering practices

Made with