Researcher: Audio (Data)
Cartesia
About Cartesia
Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.
We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.
We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.
The Role
• Lead the design and creation of high-quality datasets tailored for training cutting-edge audio models, focusing on tasks such as speech recognition, enhancement, separation, synthesis, and speech-to-speech systems.
• Develop strategies for curating, augmenting, and labeling audio datasets to address challenges like noise, variability, and diverse use cases.
• Design innovative data augmentation and synthetic data generation techniques to enrich training datasets and improve model robustness.
• Create datasets specifically for speech-to-speech systems, focusing on alignment, phonetic variability, and cross-linguistic considerations.
• Collaborate closely with researchers and engineers to understand model requirements and ensure datasets are optimized for specific architecture and task needs.
• Build tools and pipelines for scalable data processing, labeling, and validation to support both research and production workflows.
What We’re Looking For
• Deep expertise in audio data processing, with a strong understanding of the challenges involved in creating datasets for tasks like ASR, TTS, or speech-to-speech modeling.
• Experience with audio processing libraries and tools, such as librosa, torchaudio, or custom pipelines for large-scale audio data handling.
• Familiarity with data augmentation techniques for audio, including time-stretching, pitch-shifting, noise addition, and domain-specific methods.
• Strong understanding of dataset quality metrics and techniques to ensure data sufficiency, coverage, and relevance to target tasks.
• Programming skills in Python and experience with frameworks like PyTorch or TensorFlow for integrating data pipelines with model training workflows.
• Comfortable with large-scale data processing, distributed file systems for audio data storage and processing.
• A collaborative mindset, with the ability to work closely with researchers and engineers to align data design with model objectives.
Nice-to-Haves
• Experience in creating synthetic datasets using generative models or simulation frameworks.
• Background in multimodal data curation, integrating audio with text, video, or other modalities.
• Early-stage startup experience or experience building datasets for cutting-edge research.
Our culture
🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.
🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.
🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.
Our perks
🍽 Lunch, dinner and snacks at the office.
🏥 Fully covered medical, dental, and vision insurance for employees.
🏦 401(k).
✈️ Relocation and immigration support.
🦖 Your own personal Yoshi.