Member of Technical Staff - Image / Video Data Engineer
Black Forest Labs
IT, Data Science
United States · Germany · Remote
Posted on Oct 10, 2024
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing large-scale data pipelines for training frontier models.
Role:
- Develop and maintain scalable infrastructure for large-scale image and video data acquisition
- Manage and coordinate data transfers from various licensing partners
- Implement and deploy state-of-the-art ML models for data cleaning, processing, and preparation
- Implement scalable and efficient tools to visualize, cluster, and deeply understand the data
- Optimize and parallelize data processing workflows to handle billion-scale datasets efficiently
- Ensure data quality, diversity, and proper annotation (including captioning) for training readiness
- Getting training data from alternative sources such as user preferences into trainable format
- Work closely in the model development loop to update data as necessitated by the training trajectory
Ideal Experiences:
- Proficiency in Python and various file systems for data intensive manipulation and analysis
- Familiarity with cloud computing platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
- Experience with image and video processing libraries (e.g., OpenCV, FFmpeg)
- Demonstrated ability to optimize and parallelize data processing workflows across CPUs and GPUs
- Familiarity with data annotation and captioning processes for ML training datasets
- Knowledge of machine learning techniques for data cleaning and preprocessing
Nice to have:
- Background or keen interest in developing large-scale data acquisition systems
- Experience with natural language processing for image/video captioning
- Experience with data deduplication techniques at scale
- Experience with big data processing frameworks (e.g., Apache Spark, Hadoop)
- Understanding of ethical considerations in data collection and usage