SRE | Foundation Models
Luma AI
This job is no longer accepting applications
See open jobs at Luma AI.See open jobs similar to "SRE | Foundation Models" General Catalyst.
Palo Alto, CA, USA
USD 170k-360k / year
- Research Platforms: Design and maintain the scheduling and orchestration systems that allow researchers to launch and manage massive training jobs with ease.
- Observability for Intelligence: Implement deep observability stacks that provide transparency into cluster health, allowing us to predict and prevent interruptions to critical training runs.
- Scalable Inference: Architect the production systems that serve our models to the world, balancing the high availability required for consumer products with the massive compute intensity of generative AI.
- Service Orientation: You understand that reliable infrastructure is the enabler of innovation, and you care deeply about the developer experience of the researchers you support.
- Operational Excellence: You have a track record of maintaining high availability in complex, distributed environments, using automation to reduce toil.
- ML Infrastructure Fluency: You are familiar with the unique demands of AI workloads, including the management of GPU resources and the intricacies of distributed training.
Compensation
This job is no longer accepting applications
See open jobs at Luma AI.See open jobs similar to "SRE | Foundation Models" General Catalyst.