Site Reliability Engineer (Eastern U.S. or EU - remote)
Authzed
We’re pioneering open-source authorization solutions for scaling businesses tackling complex end-user permissions in zero-trust architectures. Our focus is on providing SpiceDB—the most mature open-source permissions database inspired by Google’s Zanzibar system—and building managed services that enable planet-scale production authorization services.
Our strategic approach to capital-raising has empowered us to efficiently utilize our $3.9M seed fund and recently secure a $12M Series A. This funding has allowed us to further develop SpiceDB, now the open-source standard in authorization database technology, fortify our reputation as authorization experts, accelerate our open-source community growth, and scale revenue with robust enterprise products.
AuthZed is a fully remote company with employees across the US and Europe. We’re a hardworking group with a software-driven culture; even our sales team understands and loves our technology! We bring integrity to all our interactions, fostering confidence in decision making - trusting and respecting each voice on our team, every day.
Company Values
- Agency
- Everyone should have the capability, freedom, and confidence to bring about changes to our business and product. Organizational processes exist to clearly define our goals, but not restrict how progress is made.
- Collaboration
- Success is defined in various dimensions and no single person can be an expert in all of them. Without valuing the opinions of others, finding compromises, and sharing mutual trust and respect, you cannot arrive at the best possible solution.
- Open-mindness
- Without asking questions, testing assumptions, and questioning our pre-existing biases we risk operating within an echo-chamber. We celebrate the representation of diverse perspectives and backgrounds as a catalyst for creating an inclusive work environment that everyone can appreciate.
The Role
We are seeking a Site Reliability Engineer to join our startup in the infrastructure and authorization space. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, availability, and performance of our systems. You will be responsible for designing, implementing, automating and maintaining scalable infrastructure solutions to support our growing customer base. This is an exciting opportunity to work in a fast-paced environment and contribute to the success of a company bringing a Google-inspired authorization system to companies around the globe.
What You’ll Do
- Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
- Write high-quality, maintainable code to build automation tools, scripts, and frameworks that improve system reliability and streamline operations.
- Automate infrastructure deployment, configuration management, and operational processes via Infrastructure as Code (IaC) and Kubernetes Operators.
- Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
- Improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
- Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
- Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
- Participate in on-call rotation and respond to production incidents in a timely manner.
- Document system configurations, troubleshooting procedures, and operational guidelines.
What You Bring
- Proven experience as a Site Reliability Engineer, Software Engineer, or in a similar role.
- Strong programming skills and proficiency in at least one modern programming language (e.g., Node.js, Java, Python, or Go). Experience in various programming languages will be considered as a plus.
- Demonstrated ability to write production-quality tools/software to improve the reliability and scalability of services, automate operations and improve development productivity.
- Strong understanding of networking, operating systems, and cloud infrastructure.
- Experience with site reliability engineering, system design, and distributed computing.
- Hands-on experience with containerization technologies such as Docker and Kubernetes.
- Proficiency with infrastructure-as-code tools like Terraform and Pulumi.
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Experience with at least one cloud provider (AWS, GCP, Azure).
- Experience with lower-level implementation details of relational databases (bonus if you have experience with distributed SQL databases like Google Cloud Spanner or CockroachDB).
- Experience with version control systems like Git and GitHub, and working within CI/CD pipelines.
- Strong problem-solving and troubleshooting skills.
- Excellent communication and collaboration abilities.
- Track record of delivering solutions that drive business outcomes.
- A passion for automating everything and ensuring the reliability and performance of software systems.
Benefits
- Opportunities to work with cutting-edge technology in a growing sector.
- A supportive environment where your ideas lead to real impacts.
- Competitive salary based on experience.
- Stock options at an early-stage startup.
- Comprehensive benefits including healthcare (in the US) and other insurance.
- This role is fully remote, with flexible working hours to accommodate different time zones. You’ll also get to enjoy periodic travel for bi-yearly team on-sites, where we focus on team bonding, collaboration, and having fun together!
Join a supportive and innovative team with a remote-first culture, where your contributions directly impact our growth and success.
Given our background, we build upon a foundation of using open source, cloud-native solutions to deliver our products.
We've given some webinars discussing parts of our stack:
- Building a managed database service with Kubernetes Operators
- Running Low-latency Workloads on Kubernetes
Here are some keywords:
- Go
- TypeScript
- Kubernetes
- Kubernetes Operators
- NextJS
- Pulumi
- CockroachDB
- Cloud Spanner
- PostgreSQL
- Prometheus
- Thanos
- ArgoCD