My job alerts

Engineering Manager in SRE

inDrive

This job is no longer accepting applications

Software Engineering, Other Engineering

Limassol, Cyprus

Posted on Monday, April 22, 2024

, Limassol

We are looking for Engineering Manager in SRE who will be responsible for three teams TechOps, Observability and DutyShift.

Responsibilities

Managing TechOps, Observability and DutyShift teams: Lead and coordinate the work of these teams, including goal setting, resource planning and employee development.
Ensuring system reliability: Responsible for the reliability of existing systems and products under development, as well as solving emerging business problems.
Development of incident management: Improving incident management processes, including the development and implementation of effective methods for responding to incidents, preventing their reoccurrence and the development of automation in incident management.
Development of monitoring tools and their integration into development processes: Active participation in the development and improvement of monitoring tools aimed at ensuring stable operation of the system. Work closely with Cloud, DBA, NOC, Architects and Product Developers teams to ensure that these tools are seamlessly integrated into workflows to ensure high levels of application availability and performance.

Technical skills: Experience with cloud providers (AWS, GCP), automation tools (SaltStack, Ansible, Terraform), monitoring systems (Prometheus, Victoria Metrics, Alertmanager, Grafana, ELK/EFK), containers orchestration (Kubernetes) and databases ( MySQL).
Experience implementing SRE practices: Understanding and experience applying SRE methodologies and practices in product teams.
Management Skills: Ability to manage teams, make responsible decisions, evaluate deadlines and ensure work results.
Scaling experience: Experience with scaling high-load distributed systems and infrastructure.

Expectations from the candidate:

Effective incident management: Ability to build incident management, work with SLA/SLO/SLI metrics and scale Observability for many services.
Development of a culture of reliability: Support and development of a culture of reliability in an organization through chaos engineering practices and failure drills.
Automation and optimization: Development and implementation of automated processes to improve team efficiency and ensure service stability.
Communication and Collaboration: Ability to effectively communicate and collaborate with adjacent teams and participate in the decision-making process.

Relocation to company offices in Cyprus;
Modern MacBook Pro and other equipment necessary for work;
Unlimited opportunities for professional and career growth, regular external and internal training from our partners;
Personal growth programs in which we set goals and move towards them together;
Become part of an international team of professionals and just good people who together create one of the coolest success stories in the global IT industry.

This job is no longer accepting applications