Staff Data Engineer - Emerald
H1
Software Engineering, Data Science
New York, NY, USA
Data Engineering is responsible for the development and delivery of our most important asset - our data. Looking across thousands of data sources from across the globe, the data engineering team is responsible for making sense out of that data to create the world's most extensive and comprehensive knowledge base of healthcare stakeholders and the ecosystem they influence. It is our job to ensure that only accurate, normalized data flows through to our customers, and at a velocity that keeps up with the changes in the real world. As we rapidly expand the markets we serve and the breadth and depth of data we want to collect for our customers, the team must grow and scale to meet that demand.
This role sits at the intersection of distributed data engineering, entity matching, identity resolution, and large-scale healthcare data processing. You will lead a small team of engineers while remaining deeply hands-on technically, owning the systems and pipelines powering automatching, grouping logic, identity mapping, deduplication, and enrichment workflows processing tens of millions of records.
You will partner closely with Product, AI/ML, Analytics, and Engineering teams to improve platform accuracy, scalability, reliability, and operational efficiency across one of H1’s most critical data platforms.
You will:
- Lead the design, optimization, and scalability of distributed Spark/PySpark pipelines powering entity resolution and large-scale healthcare data processing.
- Own systems supporting automatching, identity mapping, grouping logic, deduplication, enrichment, and auto-approval workflows across healthcare provider and organization datasets.
- Build and maintain scalable processing frameworks for PubMed, clinical trial, ct.gov, conference, and other healthcare data sources.
- Drive infrastructure optimization initiatives focused on improving throughput, runtime, observability, and cloud compute cost efficiency.
- Partner closely with AI/ML teams to integrate matching and resolution models into EMERALD and improve matching precision and recall.
- Lead complex technical initiatives from architecture and design through deployment, monitoring, and long-term production support.
- Serve as a technical leader and mentor across the team through code reviews, technical guidance, and engineering best practices.
- Collaborate directly with Product and business stakeholders to align technical solutions with operational and customer needs.
- Support production operations, incident response, troubleshooting, and ongoing platform reliability.
You bring strong hands-on engineering expertise across distributed computing, large-scale data processing, and infrastructure optimization while also helping guide technical direction and mentor engineers across the organization.
- Deep expertise with distributed data processing frameworks such as Apache Spark and Hadoop, particularly within AWS environments.
- Strong proficiency in Python (PySpark), Scala, Java, or other modern programming languages used for large-scale distributed processing.
- Experience building scalable ETL/ELT frameworks across both batch and streaming architectures.
- Experience with entity resolution, identity mapping, automatching, deduplication, or large-scale matching systems is strongly preferred.
- Strong understanding of distributed file formats including Apache Parquet and Apache AVRO.
- Experience with streaming technologies such as Kafka, Spark Streaming, or KSQL.
- Strong grasp of software engineering fundamentals including distributed systems, data structures, concurrency, and system design.
- Experience performing root cause analysis across large-scale distributed systems and complex data pipelines.
- Ability to write clean, maintainable, modular, and production-grade code.
- Experience improving performance, scalability, observability, and infrastructure efficiency within distributed systems.
- Strong communication and collaboration skills across both technical and non-technical stakeholders.
- Familiarity with modern development and infrastructure tooling including Git, CI/CD pipelines, Docker, Kubernetes, Terraform, Argo, Hudi, and JIRA.
- Demonstrated technical leadership experience mentoring engineers and driving complex technical initiatives.
- Extensive experience with Apache Spark and AWS-based big data technologies including EMR, S3, and distributed compute environments.
- Strong coding experience in Python (PySpark), Scala, Java, or equivalent languages used for distributed processing systems.
- Experience optimizing large-scale Spark workloads for performance, scalability, and infrastructure cost efficiency.
- Experience with streaming and event-driven architectures using technologies such as Kafka or Spark Streaming.
- Experience with orchestration and lakehouse technologies such as Argo and Hudi or comparable platforms.
- Experience with containerization and infrastructure technologies such as Docker, Kubernetes, and Terraform.
- Experience working with relational or distributed databases such as PostgreSQL or Redshift.
- Proven ability to operate effectively within highly scalable, production-grade distributed systems.
- Experience working with healthcare, life sciences, Real World Evidence (RWE), or large-scale healthcare datasets is strongly preferred.
Anticipated role close date: 8/1/2026