Software Engineer - Data Infra Reliability
Luma AI
Software Engineering
San Francisco, CA, USA
USD 220k-280k / year
- Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure.
- Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs.
- Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads.
- Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms.
- Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems.
- Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation.
- Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP).
- Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers.
- Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management.
- Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store).
- Experience managing GPU clusters or AI/ML workloads.
- Background in both Software Engineering and Operations (DevOps).
- Experience with high-performance networking (InfiniBand/RDMA).