Return to jobs list

Manager of Infrastructure Engineering (Observability)

Job type: Full Time · Department: Infrastructure Engineering · Work type: On-Site · USD 160000 - 210000 / year

Redmond, Washington, United States

Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises.

Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale.

In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale.

QUALIFICATIONS

  • 7+ years in infrastructure engineering, SRE, or platform roles

  • 2+ years managing technical teams

  • Deep experience designing and operating observability systems at scale

  • Strong background in Linux, distributed systems, and production operations

  • Experience in GPU, HPC, or AI infrastructure environments

  • Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU)

  • Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs

Strong Technical Background In

  • Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.)

  • Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines)

  • Distributed tracing (OpenTelemetry, Jaeger, Tempo)

  • Kubernetes observability (nodes, clusters, workloads, control plane)

  • Alerting strategy, SLOs, SLIs, and error budgets

  • High-cardinality, high-volume telemetry tradeoffs

Nice to Have

  • Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits)

  • Experience operating observability platforms across multiple data centers and failure domains

  • Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits)

  • Experience balancing telemetry cost, retention, and fidelity at large scale

  • Prior experience evolving alerting from reactive to SLO-driven

  • Experience building or scaling observability teams or platforms in high-growth environments

WHAT YOU'LL DO

Technical Ownership & Strategy

  • Own Voltage Park’s observability strategy across infrastructure and platform layers

  • Define standards for metrics, logs, traces, alerts, dashboards, and SLOs

  • Drive architecture decisions for telemetry pipelines, storage, and retention

  • Balance signal quality, system performance, and cost at scale

Team Leadership

  • Build, manage, and mentor a team of infrastructure engineers focused on observability

  • Set clear technical direction, priorities, and expectations

  • Review designs, guide implementation, and raise the bar on operational rigor

  • Partner closely with Engineering and Operations teams

Platform Engineering

  • Design and operate high-throughput observability pipelines (metrics, logs, traces)

  • Ensure observability platforms are reliable, scalable, and resilient

  • Improve alert quality and reduce noise across production systems

  • Enable self-service observability for internal engineering teams

Reliability & Operations

  • Participate in and lead infrastructure incident response

  • Use observability data to drive root-cause analysis and systemic improvements

  • Build feedback loops from incidents into better tooling, alerts, and runbooks

  • Help establish a culture of measurement-driven reliability

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Made with