My job alerts

Software Engineer - Reliability

Luma AI

This job is no longer accepting applications

See open jobs at Luma AI.See open jobs similar to "Software Engineer - Reliability" General Catalyst.

Software Engineering

Palo Alto, CA, USA

USD 170k-360k / year

Posted on Nov 7, 2025

Software Engineer - Reliability

Palo Alto, CA • London, UK

Infra Reliability

Hybrid

Full-time

About Luma AI

Luma's mission is to build multimodal AI to expand human imagination and capabilities. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

This is not a typical cloud SRE role. We are looking for a hands-on, first-principles engineer who is fluent in Linux and comfortable operating close to the metal. You will build, maintain, and scale Luma's large-scale GPU infrastructure, working directly on on-prem and multi-vendor cloud clusters. You'll solve complex systems problems, ensure reliability through clear SLOS/SLIs, and build automation that allows us to operate at an unprecedented scale with a lean team.

What You'll Do

Own GPU Cluster Reliability: Take end-to-end ownership of our GPU clusters for training and inference, ensuring high availability and peak performance across multiple cloud providers.
Drive Reliability Metrics: Define and maintain service-level objectives (SLOs) and indicators (SLIs) to measure and improve reliability as our infrastructure scales.
Deep Linux Expertise: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS level.
Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure.
Master Kubernetes at Scale: Operate and scale Kubernetes clusters beyond managed services, ensuring reliability across diverse workloads.
Modern Operations Practices: Implement and manage observability stacks (Prometheus, Grafana) and GitOps workflows (Argo CD, Flux) to keep infrastructure transparent and resilient.

Who You Are

5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
Deep, hands-on expertise in Linux and containerized systems.
Strong experience with Kubernetes in production environments at meaningful scale.
Proficient in Python and/or Go, with a track record of building infrastructure tooling.
Strong understanding of networking, cloud infrastructure (AWS/GCP), and IaC tools like Terraform.
A tenacious troubleshooter who thrives on solving complex, low-level problems.
Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).

What Sets You Apart (Bonus Points)

Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
Experience debugging GPU performance issues with specialized tools.

Compensation

The base pay range for this role is $170,000 – $360,000 per year.

Req ID: R100014

This job is no longer accepting applications

See open jobs at Luma AI.See open jobs similar to "Software Engineer - Reliability" General Catalyst.

See more open positions at Luma AI

Privacy policy Cookie policy

Stay Up to Date

Thanks!

Software Engineer - Reliability

Compensation