Member of Technical Staff - Image / Video Data Engineer

Black Forest Labs

Black Forest Labs

IT, Data Science
United States · Germany · Remote
Posted on Oct 10, 2024

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing large-scale data pipelines for training frontier models.

Role:

  • Develop and maintain scalable infrastructure for large-scale image and video data acquisition
  • Manage and coordinate data transfers from various licensing partners
  • Implement and deploy state-of-the-art ML models for data cleaning, processing, and preparation
  • Implement scalable and efficient tools to visualize, cluster, and deeply understand the data
  • Optimize and parallelize data processing workflows to handle billion-scale datasets efficiently
  • Ensure data quality, diversity, and proper annotation (including captioning) for training readiness
  • Getting training data from alternative sources such as user preferences into trainable format
  • Work closely in the model development loop to update data as necessitated by the training trajectory

Ideal Experiences:

  • Proficiency in Python and various file systems for data intensive manipulation and analysis
  • Familiarity with cloud computing platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
  • Experience with image and video processing libraries (e.g., OpenCV, FFmpeg)
  • Demonstrated ability to optimize and parallelize data processing workflows across CPUs and GPUs
  • Familiarity with data annotation and captioning processes for ML training datasets
  • Knowledge of machine learning techniques for data cleaning and preprocessing

Nice to have:

  • Background or keen interest in developing large-scale data acquisition systems
  • Experience with natural language processing for image/video captioning
  • Experience with data deduplication techniques at scale
  • Experience with big data processing frameworks (e.g., Apache Spark, Hadoop)
  • Understanding of ethical considerations in data collection and usage