Join our companies in their quest to drive powerful, positive, change that endures.

Senior Software Reliability Engineer (open to remote across ANZ)

Canva

Canva

Sydney, NSW, Australia
Posted on Friday, March 10, 2023
Join the team redefining how the world experiences design.
Hey, g'day, mabuhay, kia ora,你好, hallo, vítejte!
Thanks for stopping by. We know job hunting can be a little time consuming and you're probably keen to find out what's on offer, so we'll get straight to the point.
Where and how you can work
Our flagship campus is in Sydney. We also have a campus in Melbourne and co-working spaces in Brisbane, Perth and Adelaide. But you have choice in where and how you work. That means if you want to do your thing in the office (if you're near one), at home or a bit of both, it's up to you.
What you’d be doing in this role
As Canva scales change continues to be part of our DNA. But we like to think that's all part of the fun. So this will give you the flavour of the type of things you'll be working on when you start, but this will likely evolve.
About the Reliability Platform Group
The Reliability Platform Group is responsible for providing the tools and processes to scale reliability across all Canva services. Our teams work together, and with other groups, to deliver preventive and detective tooling, processes and best practices that uplift Canva’s reliability. We do this by driving operational excellence, reducing the impact of incidents, and providing visibility and accountability across the broader Engineering community. The group encompasses Observability, Availability & Detection, Incident Response and Pre-Emption domains and is set to grow rapidly in the near future as we shoot for some ambitious goals.

Role Responsibilities

  • As an individual contributor, design and implement processes, tools, automation, and libraries that service teams can use to improve the reliability of the services they own. For instance, adding a new long-awaited feature in our circuit breaker library.
  • Introduce chaos engineering to Canva and conduct experiments to identify possible scenarios in which cascading failure might occur and to verify the reliability measures we introduce to prove this works as expected. E.g. discovering what will happen when this newly introduced service goes down? Does the fallback for this rare failure actually work?
  • Work with product engineering teams to ensure reliability best practices and tools are rolled out in every service across the whole organization. It’s not enough to create a new throttling library, we want to make sure it’s successfully used in every service.
  • Foster a culture within the Engineering org that puts reliability first and establish processes and policies that drive reliability within product engineering teams. This includes things like SLAs, error budgets, on-call response, incident resolution, observability best practices.
  • Deep investigation into production incidents followed up by applying the learning to code.
  • Researching, developing, and justifying the best choices in the form of design docs for tools and processes that will shape the future of reliability at Canva.
  • Proposing new approaches and solutions to ensure we future-proof Canva’s distributed cloud infrastructure as we scale.
  • Participating in design meetings, hiring interviews, and code reviews.

Required Skills and Experience:

  • Five-plus (5+) years of commercial experience working with developing complex, distributed web applications.
  • Experience working with a mainstream programming language. However, our services and libraries are primarily written in Java 13, so Java is a nice to have.
  • Solid understanding of resiliency techniques and patterns – load balancing, throttling, back pressure, circuit breaking, etc;
  • Disciplined coding practices, experience with code reviews and pull requests, and a creative and conceptual problem-solving approach.
  • Strong communication and team collaboration skills, both written and verbal. As a reliability engineer, you will need to share the knowledge, communicate and coordinate changes across multiple service teams.

Nice to have's (not required)

  • Experience working with microservice architectures in large distributed cloud environments (ideally AWS). We’re hosted on AWS and leverage the tools they provide as much as possible
  • Experience with RPC Frameworks, Finagle, Thrift or gRPC will be a huge plus, but not required; Understanding of how services communicate with each other is crucial to find out where a failure can occur.
  • Knowledge of networking protocols such as TCP, HTTP/2, WebSockets, etc. would be a big plus; The life of a request doesn’t start inside the backend web server, but rather in the browser of a user.
  • Previous experience of working as a reliability/chaos engineer and/or strong knowledge of Google SRE corpus et. al.
#LI-HM1
What's in it for you?
Achieving our crazy big goals motivates us to work hard - and we do - but you'll experience lots of moments of magic, connectivity and fun woven throughout life at Canva, too. We also offer a stack of benefits to set you up for every success in and outside of work.
Here's a taste of what's on offer:
• Equity packages - we want our success to be yours too
• Inclusive parental leave policy that supports all parents & carers
• An annual Vibe & Thrive allowance to support your wellbeing, social connection, office setup & more
• Flexible leave options that empower you to be a force for good, take time to recharge and supports you personally
Check out lifeatcanva.com for more info.
Other stuff to know
We make hiring decisions based on your experience, skills and passion, as well as how you can enhance Canva and our culture. When you apply, please tell us the pronouns you use and any reasonable adjustments you may need during the interview process.
Please note that interviews are conducted virtually.