About Backblaze Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.
Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud – managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals. But while there is a lot to celebrate in our past, there is almost as much opportunity ahead of us.
We’re seeking a Sr. Site Reliability Engineer to join our team!
About The Role:
We are seeking a Senior Site Reliability Engineer (SRE) to help ensure the stability, scalability, and reliability of our services and infrastructure.
This role focuses on building automation, maintaining observability, and supporting incident response to keep customer-facing systems performing at their best. The SRE will collaborate with engineering, product, and operations teams to embed reliability practices into day-to-day development and operations while contributing to tools and processes that improve efficiency and reduce manual effort.
What You’Ll Do:
- Service Reliability & Operations.
- Own and drive the availability, durability, and performance of critical services across all production environments.
- Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
- Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
- Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
- Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
- Automation & Tooling.
- Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
- Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
- Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
- Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
- Collaboration.
- Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
- Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
- Lead capacity planning and disaster recovery strategy across critical infrastructure components.
- Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
- Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
- Continuous Improvement.
- Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
- Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
- Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
Equal Opportunity Employer. To understand more about the data we collect and process as part of your application, please view our Backblaze Employee Privacy Notice.