Sr SRE / Infrastructure Engineer - Cloud Infra Team, AWS, Azure, Python
Eightfold
This job is no longer accepting applications
See open jobs at Eightfold.See open jobs similar to "Sr SRE / Infrastructure Engineer - Cloud Infra Team, AWS, Azure, Python" General Catalyst.Software Engineering, Other Engineering
Santa Clara, CA, USA
USD 154k-181k / year + Equity
- Design and manage secure, scalable cloud environments on AWS and Azure.
- Ensure environments meet strict standards for availability, reliability, scalability, observability, security, and cost-effectiveness.
- Design, improve, and support highly distributed and large-scale systems used by millions of users, processing terabytes of data.
- Automate infrastructure deployment, configuration, and disaster recovery using Shell Scripting, Ansible, and Terraform.
- Implement cloud-native solutions using Docker, Kubernetes, and other containerization technologies.
- Apply best practices for cloud security, including identity management, access control, and encryption.
- Collaborate with cross-functional teams to deliver scalable and secure infrastructure solutions.
- Support global customers, including governments and large enterprises, ensuring compliance with security standards.
- Monitor, maintain, and improve the reliability, availability, and performance of production systems.
- Participate in the daytime oncall rotation, triaging and resolving production incidents.
- Execute and enhance automation, monitoring, alerting, and operational tooling to minimize manual intervention.
- Follow and improve runbooks and escalation paths to ensure consistent incident management.
- Collaborate closely with global SRE and development teams to implement best practices for system reliability.
- Contribute to post-incident reviews, drive root cause analysis, and implement preventive solutions.
- Support capacity planning, change management, and infrastructure upgrades.
- Advocate for operational excellence by identifying process gaps and driving continuous improvement.
- A proactive mindset, strong ownership, and a commitment to delivering high-quality work on time.
- 3+ years of experience in CloudOps or Site Reliability Engineering (SRE) with hands-on experience in cloud infrastructure.
- Solid understanding of large-scale distributed systems, including real-time and async data processing, search, and DB systems.
- Proficiency in managing cloud environments, specifically AWS and Azure.
- Experience with automation tools, including Terraform, CloudFormation, Python, Shell Scripting, and Ansible.
- Demonstrable skills in python scripting and triaging production issues using CloudWatch and other debugging tools.
- Proven experience providing 24/7 on-call support and incident management for critical production infrastructure.
- Excellent problem-solving and troubleshooting skills.
- Effective communication skills, both verbal and written.
- Hands-on experience with Prometheus and Grafana is a plus.
- Hands-on experience with infrastructure-as-code (Terraform, CloudFormation, Ansible, or similar).
- Experience in load testing using Locust is a plus.
- Experience in supporting highly restricted production environments (e.g. federal government customers) is a plus.
- Bachelors and/or Masters degree in Computer Science or equivalent Engineering discipline.