Senior Applied Researcher, Audio Generation
Cartesia
Location
HQ - San Francisco, CA
Employment Type
Full time
Location Type
On-site
Department
Staff
Compensation
- $200K – $350K • Offers Equity
About Cartesia
Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.
We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.
We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.
The Role
We are seeking a Senior Applied Researcher to contribute to the development of our next-generation speech models. You will be responsible for designing, training, and deploying novel generative models for tasks like multi-lingual text-to-speech (TTS), voice conversion, music generation, and sound effect synthesis.
The challenge is no longer just about creating high-fidelity audio; it's about generating it with near-zero latency and giving users precise creative control. We aim to set new standards for accuracy, speed, and usability in production systems.
What you’ll do
Develop & optimize speech and audio models for production.
Work with engineering to ship and scale your models across our target platforms: cloud, on-premise, and on-device.
Develop model architectures and inference strategies specifically for low-latency, real-time performance on consumer hardware.
Implement and refine mechanisms for fine-grained controllability, allowing for the manipulation of attributes like speaker identity, emotion, prosody, and acoustic style.
Pioneer the latest research on new architectures for generative modeling.
What we’re looking for
Proven experience in developing and training novel generative models, preferably for audio or speech.
Clear understanding of the architectural trade-offs between model quality, inference speed, and memory footprint.
Hands-on experience with model conditioning and control mechanisms.
Our culture
🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.
🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.
🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.
Our perks
🍽 Lunch, dinner and snacks at the office.
🏥 Fully covered medical, dental, and vision insurance for employees.
🏦 401(k).
✈️ Relocation and immigration support.
🦖 Your own personal Yoshi.
Compensation Range: $200K - $350K