Site Reliability Engineer | AI Supercomputing
Luma AI
This job is no longer accepting applications
See open jobs at Luma AI.See open jobs similar to "Site Reliability Engineer | AI Supercomputing" General Catalyst.Software Engineering, Data Science
United States · Palo Alto, CA, USA · Remote
USD 170k-360k / year
- Supercomputing Architecture: Design and deploy high-performance clusters combining thousands of GPUs, CPUs, and high-throughput networking to maximize training efficiency.
- The Network Layer: Optimize low-level networking (InfiniBand, RDMA) to ensure seamless communication between accelerators, eliminating bottlenecks in distributed training jobs.
- Hardware-Software Synthesis: Collaborate with hardware partners to push the boundaries of what is possible, debugging failures at the intersection of the kernel, driver, and silicon.
- HPC Authority: You possess elite knowledge of high-performance computing (HPC), including job schedulers and the nuances of GPU architecture.
- Deep Systems Fluency: You are comfortable navigating the Linux terminal to solve complex performance issues, utilizing tools like perf and strace to optimize at the OS level.
- First-Principles Engineering: You have a history of building infrastructure from the ground up, demonstrating the ability to design systems where no playbook currently exists.
Compensation
This job is no longer accepting applications
See open jobs at Luma AI.See open jobs similar to "Site Reliability Engineer | AI Supercomputing" General Catalyst.