Site Reliability Engineer I

Backblaze · United States, United States

$66,000-$88,000

Remote Full-time Not specified Infrastructure

Job Description

About Backblaze Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.

Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud – managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals. But while there is a lot to celebrate in our past, there is almost as much opportunity ahead of us.

We are seeking a Site Reliability Engineer I to join our team! About the role We are looking for a Site Reliability Engineer I to help support the stability, health, and day-to-day operations of Backblaze’s infrastructure.

This role serves as a first line of response for customer-affecting issues and production alerts, helping drive timely incident resolution, maintain service reliability, and support operational readiness across our environments. You will work closely with TechOps, Data Center Technicians, and other cross-functional teams to troubleshoot issues, monitor system health, support deployments and migrations, and improve day-to-day operational processes through documentation and automation.

The ideal candidate is technically curious, calm under pressure, eager to learn, and excited to grow in a hands-on infrastructure and reliability role.

What You’Ll Do:

Act as first point of contact for all customer affecting issues.
Be a Key Driver for managing the resolution of technical problems.
Ensure that incident management processes are following and that incident post-mortems are completed to capture process deviations and areas for improvement.
Deliver consistent communication to Management.
Respond to zabbix alerts/regular monitoring of zabbix, either by taking direct action on alerts or escalating. Acknowledge every alert if direct action taken, or with escalation point of contact.
Make sure escalations are handed off successfully.
Ensure health of pods across all sites (define pod alerts on zabbix).
Work through daily filesystem checks for pods.
Troubleshoot technical issues for DC Techs -> advanced pod questions, deployment questions, migration troubleshooting, and ansible playbook issues.
Identification and escalating any potential issues regarding the network.
Vault pre-deployment configuration and testing.
Start Vault Migrations, monitor migration pods, handle applicable migration pod health checks.
Document/Work on automating Daily Items.
Document/Provide Network IP’s for upcoming deployments.
Monitor Releases/Updates to the Server Farm, escalate issues as they arise.
Engaging in on-call rotation shifts.
Assist fellow TechOps team members in handling tasks.
Making recommendations for improvements in organizational productivity.
Be able to work outside of normal business hours(weekend shift, holidays & evenings) as needed.