Research Scientist / Engineer — Foundation Model (Voice Agents)
Luma AI
London, UK · Remote
USD 250k-450k / year
- Modeling: Build next-generation voice agents that tightly integrate audio understanding (e.g., ASR, diarization, emotion recognition) and audio generation (e.g., TTS, voice conversion) for real-time, interactive use.
- Data: Design, implement, and run robust data pipelines and training curricula for speech and audio, including large-scale pretraining, fine-tuning, and data quality iteration.
- Systems: Train large-scale video and audio generative models on massive datasets and GPU clusters, and develop low-latency architectures and inference strategies for streaming, conversational, and on-device deployment.
- Evaluation: Define and build novel evaluation frameworks for voice agents, covering accuracy, robustness, latency, controllability, and human perceptual quality.
- A strong background in machine learning and generative modeling.
- Practical understanding of speech and audio modeling, including representation learning, sequence modeling, and conditioning/control mechanisms.
- Experience building and training models in PyTorch, including large-scale or latency-sensitive systems.
- Experience with speech or audio understanding tasks (e.g., ASR, diarization, speaker/emotion recognition, audio classification).
- Experience with speech or audio generation (e.g., TTS, voice conversion, expressive or controllable speech).
- Familiarity with streaming or real-time inference, model compression, or deployment on consumer hardware.
- A portfolio of past projects, publications, or open-source contributions demonstrating your work in generative audio or speech AI.