Senior Applied Researcher, Audio Generation

Cartesia

Cartesia

San Francisco, CA, USA
USD 200k-350k / year + Equity
Posted on Sep 16, 2025

Location

HQ - San Francisco, CA

Employment Type

Full time

Location Type

On-site

Department

Staff

Compensation

  • $200K – $350K • Offers Equity

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

The Role

We are seeking a Senior Applied Researcher to contribute to the development of our next-generation speech models. You will be responsible for designing, training, and deploying novel generative models for tasks like multi-lingual text-to-speech (TTS), voice conversion, music generation, and sound effect synthesis.

The challenge is no longer just about creating high-fidelity audio; it's about generating it with near-zero latency and giving users precise creative control. We aim to set new standards for accuracy, speed, and usability in production systems.

What you’ll do

  • Develop & optimize speech and audio models for production.

  • Work with engineering to ship and scale your models across our target platforms: cloud, on-premise, and on-device.

  • Develop model architectures and inference strategies specifically for low-latency, real-time performance on consumer hardware.

  • Implement and refine mechanisms for fine-grained controllability, allowing for the manipulation of attributes like speaker identity, emotion, prosody, and acoustic style.

  • Pioneer the latest research on new architectures for generative modeling.

What we’re looking for

  • Proven experience in developing and training novel generative models, preferably for audio or speech.

  • Clear understanding of the architectural trade-offs between model quality, inference speed, and memory footprint.

  • Hands-on experience with model conditioning and control mechanisms.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.

Compensation Range: $200K - $350K