Software Engineer - Reliability
Luma AI
Software Engineering
United States · California, USA · Remote
USD 170k-360k / year
- Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
- Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
- Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
- Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
- Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
- Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
- 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
- Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
- Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI.
- Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
- Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
- Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
- Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.
- Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
- Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
- Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.