Member of Technical Staff - ML Infra

Black Forest Labs

Black Forest Labs

Software Engineering, IT, Data Science
United States · Germany · Remote
Posted on Oct 10, 2024

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our ML infra including large GPU training and inference clusters.

Role:

  • Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
  • Implement and manage network-based cloud file systems and blob/S3 storage solutions
  • Develop and maintain Infrastructure as Code (IaC) for resource provisioning
  • Implement and optimize CI/CD pipelines for ML workflows
  • Design and implement custom autoscaling solutions for ML workloads
  • Ensure security best practices across the ML infrastructure
  • Provide developer-friendly tools and practices for efficient ML operations

Ideal Experience:

  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
  • Extensive experience with Kubernetes and Slurm cluster management
  • Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible)
  • Proven track record in managing and optimizing network-based cloud file systems and object storage
  • Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)
  • Strong understanding of security principles and best practices in cloud environments
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)
  • Familiarity with ML workflows and GPU infrastructure management
  • Demonstrated ability to handle complex migrations and breaking changes in production environments

Nice to have:

  • Experience with custom autoscaling solutions for ML workloads
  • Knowledge of cost optimization strategies for cloud-based ML infrastructure
  • Familiarity with MLOps practices and tools
  • Experience with high-performance computing (HPC) environments
  • Understanding of data versioning and experiment tracking for ML
  • Knowledge of network optimization for distributed ML training
  • Experience with multi-cloud or hybrid cloud architectures
  • Familiarity with container security and vulnerability scanning tools