Researcher: Multimodal (Data)
Cartesia
About Cartesia
Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.
We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.
We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.
The Role
• Lead the design, creation, and optimization of datasets for training and evaluating multimodal models across diverse modalities, including audio, text, video, and images.
• Develop strategies for curating, aligning, and augmenting multimodal datasets to address challenges in synchronization, variability, and scalability.
• Design innovative methods for data augmentation, synthetic data generation, and cross-modal sampling to enhance the diversity and robustness of datasets.
• Create datasets tailored for specific multimodal tasks, such as audio-visual speech recognition, text-to-video generation, or cross-modal retrieval, with attention to real-world deployment needs.
• Collaborate closely with researchers and engineers to ensure datasets are optimized for target architectures, training pipelines, and task objectives.
• Build scalable pipelines for multimodal data processing, annotation, and validation to support research and production workflows.
What We’re Looking For
• Expertise in multimodal data curation and processing, with a deep understanding of challenges in combining diverse data types like audio, text, images, and video.
• Proficiency in tools and libraries for handling specific modalities, such as librosa (audio), OpenCV (video), and Hugging Face (text).
• Familiarity with data alignment techniques, including time synchronization for audio and video, embedding alignment for cross-modal learning, and temporal consistency checks.
• Strong understanding of multimodal dataset design principles, including methods for ensuring data diversity, sufficiency, and relevance for targeted applications.
• Programming expertise in Python and experience with frameworks like PyTorch or TensorFlow for building multimodal data pipelines.
• Comfortable with large-scale data processing and distributed systems for multimodal dataset storage, processing, and management.
• A collaborative mindset with the ability to work cross-functionally with researchers, engineers, and product teams to align data strategies with project goals.
Nice-to-Haves
• Experience in creating synthetic multimodal datasets using generative models, simulation environments, or advanced augmentation techniques.
• Background in annotating and aligning multimodal datasets for tasks such as audio-visual speech recognition, video-captioning, or multimodal reasoning.
• Early-stage startup experience or a proven track record of building datasets for cutting-edge research in fast-paced environments.
Our culture
🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.
🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.
🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.
Our perks
🍽 Lunch, dinner and snacks at the office.
🏥 Fully covered medical, dental, and vision insurance for employees.
🏦 401(k).
✈️ Relocation and immigration support.
🦖 Your own personal Yoshi.