Senior Site Reliability Engineer

Job type: Full Time · Department: Engineering · Work type: On-Site

Bengaluru, Karnataka, India; Gurugram, Haryana, India

Senior Site Reliability Engineer

About the company:

At WizCommerce, we’re building the AI Operating System for Wholesale Distribution — transforming how manufacturers, wholesalers, and distributors sell, serve, and scale.

With a growing customer base across North America, WizCommerce helps B2B businesses move beyond disconnected systems and manual processes with an integrated, AI-powered platform.

Our platform brings together everything a wholesale business needs to sell smarter and faster. With WizCommerce, businesses can:

Take orders easily — whether at a trade show, during customer visits, or online.
Save hours of manual work by letting AI handle repetitive tasks like order entry or creating product content.
Offer a modern shopping experience through their own branded online store.
Access real-time insights on what’s selling, which customers to focus on, and where new opportunities lie.

The wholesale industry is at a turning point — outdated systems and offline workflows can no longer keep up. WizCommerce brings the speed, intelligence, and design quality of modern consumer experiences to the B2B world, helping companies operate more efficiently and profitably.

Backed by leading global investors including Peak XV Partners (formerly Sequoia Capital India), Z47 (formerly Matrix Partners), Blume Ventures, and Alpha Wave Global, we’re rapidly scaling and redefining how wholesale and distribution businesses sell and grow.

If you want to be part of a fast-growing team that’s disrupting a $20 trillion global industry, WizCommerce is the place to be.

Read more about us in Economic Times, The Morning Star, YourStory, or on our website!

Founders:

Divyaanshu Makkar (Co-founder, CEO)

Vikas Garg (Co-founder, CCO)

Job Description:

Role: As a Senior Site Reliability Engineer (SRE) at WizCommerce, you will take end-to-end ownership of designing, maintaining, and scaling our global infrastructure to ensure reliability, performance, and security across all products. You’ll lead initiatives that drive automation, observability, and resilience while mentoring junior engineers and collaborating closely with cross-functional teams to evolve our SRE practices.

Responsibilities:

System Architecture & Reliability:

Architect and manage scalable, distributed systems ensuring high availability, fault tolerance, and zero downtime.
Lead the design and implementation of advanced monitoring, alerting, and observability frameworks across cloud environments.

Incident Management & Root Cause Analysis:

Serve as the technical lead during critical incidents—driving resolution, documenting post-mortems, and implementing preventative measures.
Establish SLA/SLO/SLI frameworks and ensure compliance across production environments.

Automation & Infrastructure as Code:

Build robust automation pipelines using Terraform, Ansible, Jenkins, or GitHub Actions to eliminate manual interventions.
Champion infrastructure as code (IaC) principles, enforcing version control and repeatability across environments.

Scalability & Performance Optimization:

Conduct capacity planning and load testing to proactively identify scaling bottlenecks.
Continuously analyze performance metrics, fine-tune system configurations, and optimize cloud spend.

Security, Compliance & Resilience:

Design and enforce security best practices for cloud infrastructure, data backups, disaster recovery, and business continuity planning.
Partner with security teams to ensure compliance with internal and external audit standards.

Cross-Team Collaboration:

Collaborate with engineering and DevOps to improve deployment pipelines, architecture design, and release reliability.
Review system design proposals, ensuring production readiness and maintainability.

Leadership & Mentorship:

Mentor junior SREs and DevOps engineers through code reviews, best practices, and process improvement sessions.
Lead initiatives to evolve SRE culture within WizCommerce — from documentation to automation standards.

Qualifications and Skills:

Minimum 6 years of experience in SRE or DevOps, managing large-scale, distributed systems in production.
Deep expertise with cloud platforms (GCP preferred; AWS or Azure acceptable).
Strong proficiency in Python, Go, or Bash scripting for automation and tool development.
Advanced knowledge of containerization (Docker) and orchestration (Kubernetes, Helm) in multi-environment setups.
Proven experience in setting up CI/CD pipelines using Jenkins, ArgoCD, or GitHub Actions.
Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Datadog, or NewRelic.
Strong grasp of networking, load balancing, DNS, and caching for optimizing distributed systems.
Proficiency in infrastructure as code (Terraform, Ansible) and version control (Git).
Strong analytical and troubleshooting skills with a bias toward automation and continuous improvement.

Compensation: Best in the industry

Role location: Bengaluru/Gurugram

Website Link: https://www.wizcommerce.com/

Made with