Research Engineer - Evaluations
Luma AI
San Francisco, CA, USA
- Design and implement scalable pipelines for automated evaluation of generative models, with a focus on visual and multimodal outputs (image, video, text, audio).
- Develop novel metrics and evaluation models that capture qualities like fidelity, coherence, temporal consistency, and alignment with human intent.
- Integrate evaluation signals into training loops (including reinforcement learning and reward modeling) to continuously improve model performance.
- Build infrastructure for large-scale regression testing, benchmarking, and monitoring of multimodal generative models.
- Collaborate with researchers running human studies to translate human evaluation frameworks into automated or semi-automated systems.
- Partner with model researchers to identify failure cases and build targeted evaluation harnesses.
- Maintain dashboards, reporting tools, and alerting systems to surface evaluation results to stakeholders.
- Stay current with emerging evaluation techniques in generative AI, multimodal LLMs, and perceptual quality assessment.
- Master's or PhD in Computer Science, Machine Learning, or a related technical field (or equivalent industry experience).
- 3+ years of experience building ML evaluation systems, model pipelines, or large-scale infrastructure.
- Hands-on experience working with visual data (images and/or video), including evaluation, modeling, or data preparation.
- Proficiency in Python and ML frameworks (PyTorch, JAX, or TensorFlow).
- Familiarity with human-in-the-loop evaluation workflows and how to scale them with automation.
- Strong background in machine learning, with experience in generative models (diffusion, LLMs, multimodal architectures).
- Strong software engineering skills (CI/CD, testing, data pipelines, distributed systems).
- Experience with reinforcement learning or reward modeling.
- Prior work on perceptual metrics, multimodal evaluation benchmarks, or retrieval-based evaluation.
- Background in large-scale model training or evaluation infrastructure.
- Experience designing metrics for perceptual quality
- Familiarity with creative media workflows (film, VFX, animation, digital art).
- Contributions to open-source evaluation libraries or benchmarks.