Senior Infrastructure Software Engineer

MULTI·ON

MULTI·ON

Software Engineering, Other Engineering
Palo Alto, CA, USA · United States
Posted on Jun 7, 2024
At MultiOn, we are looking for a few more A+ players who want to join a rocket ship startup.

About Us:

We believe in the power of Artificial Intelligence to revolutionize the way we interact with computers and the digital world. We're not just building a state-of-the-art personal AI Agent that brings the concept of JARVIS from science fiction into reality; we're also reshaping the future of software. Our mission is to pioneer Software 3.0, where software doesn't just respond to user commands, but also acts on your behalf.

We already hold the record for the first AI flight booking, food order & workplace certification and believe this to be just the beginning. We're a team of passionate innovators, technologists, and dreamers, and we are looking for extraordinarily folks to join us on this exciting journey.

Our investors include General Catalyst, Amazon, Samsung, early investors in DeepMind and select angels from the GTM and Research teams at OpenAI, Google

Demo here: https://www.youtube.com/watch?v=Rm67ry6bogw

About the role:

As a Senior Infrastructure Software Engineer, you will be responsible for designing and building the infrastructure that supports the training and deployment of our machine learning models. This role demands a comprehensive understanding of the entire backend stack, from frameworks and compilers to runtimes and kernels. Additionally, you should be well-versed in tools and services commonly used in cloud-based environments, such as Kubernetes and Docker.

We seek a candidate with a deep curiosity about the fundamental workings of the Internet, GPUs, and computers, coupled with strong expertise in Linux and AI or GPU hardware. Proficiency in coding with Python, Go, or similar languages is essential. This high-visibility role requires comprehensive technical knowledge of bare metal GPU orchestration, physical and logical networking, Linux, and basic project management skills.

Responsibilities:

  • Proactively identify opportunities to introduce innovative technology and automation solutions that enhance our infrastructure's efficiency, effectiveness, and scalability
  • Oversee the provisioning, monitoring, and maintenance of hardware, software, and networks in new data centers
  • Conduct architecture and research work for distributed AI workloads
  • Collaborate with vendors to acquire, debug, and maintain next-generation hardware and software optimized for our workloads
  • Partner with stakeholders to make strategic hardware decisions
  • Provide technical leadership and guidance during deployment activities
  • Develop and maintain comprehensive documentation, including plans, SOPs, MOPs, etc.

Qualifications:

  • Minimum of 5 years of experience in DevOps and production-grade software infrastructure
  • Advanced software development skills in C++, Go, Rust, or similar system languages
  • Proficiency in Python at an intermediate level
  • Extensive experience in maintaining production Linux systems, including the setup, management, and maintenance of networking, monitoring, and storage
  • Experience in Linux systems administration, preferably with contributions to open source projects
  • Strong expertise in network services, including REST APIs and HTTP
  • Significant experience in developing tooling and automation solutions
  • Knowledge of network fundamentals: subnetting, custom routing, firewalls, IPv6
  • Experience with continuous/rapid release engineering
  • Proficiency with infrastructure-as-code systems such as Terraform and Pulumi
  • Solid understanding of low-level operating systems concepts, including multi-threading, memory management, networking, storage, performance, and scale
  • Experience in managing a production-grade issue response process using tools like PagerDuty, ensuring adherence to uptime SLAs
  • Familiarity with Kubernetes and containerization, VPNs, GPU workloads
  • GPU programming and CUDA knowledge are advantageous
  • Experience with machine learning frameworks such as PyTorch or TensorFlow

MultiOn Inc. is an equal-opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Note: Due to the high volume of applications we receive, we can only respond to applicants who have been selected for an interview.

Compensation Range: $150K - $200K