Site Reliability Engineer

GoDaddy · Canada, Ontario

Remote Full-time Mid Permanent IT & DevOps

Job Description

GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems.

Automate and maintain day-to-day operations of storage systems to support application demands
Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency
Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability
Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment
Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization

More Offers from GoDaddy

Freelance Photographer – Las Vegas, NV – Nevada, United States

GoDaddy · Las Vegas

Remote Freelance Mid

Senior Software Engineer – Golang

GoDaddy · Remote, India

Remote Full-time Senior

Senior Software Engineer – Data – India

GoDaddy · India

Remote Full-time Senior

Associate – Multilingual Secure Certificate Services I

GoDaddy · Arizona

Hybrid Full-time Junior

WordPress Advanced Support Guide

GoDaddy · Serbia

Remote Full-time Mid

Sales Development Representative-Corporate Domains

GoDaddy · Remote

Hybrid Full-time Not specified

Apply Now

You'll be redirected to the company's application page

Benefits

We offer a range of total rewards that may include paid time off, retirement savings (e.g., 401k, pension schemes), bonus/incentive eligibility, equity grants, participation in our employee stock purchase plan, competitive health benefits, and other family-friendly benefits including parental leave. GoDaddy’s benefits vary based on individual role and

Requirements

2+ years of professional experience with Ceph, working in a production environment
2+ years of experience in site reliability engineering or a similar role
2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems
Experience working on Linux/Unix systems, with a focus on automation and operating at scale
Proficiency in Python or Bash
Experience with Ansible, Terraform, or SaltStack
Experience with Nagios-based monitoring tools, such as Icinga2
Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki
Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems

Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
Exposure to and experience working with compute platforms (e.g., OpenStack, AWS)
Familiarity with ability to contribute to CI/CD pipelines and automation workflows