Gcore logo

DevOps Engineer (AI Inference) – sg

Gcore  ·  Singapore, sg
Not specified Full-time Not specified Engineering

Job Description

As a DevOps Engineer, you will be responsible for designing, deploying, and maintaining infrastructure and services that enable scalable and secure AI inference workloads on-premises.

What You Will Do

  • Design, develop, and maintain infrastructure for AI inference workloads, including GPU scheduling, model deployment pipelines, and data access patterns in on-prem environments
  • Build and manage monitoring and observability tools for AI inference platforms, including dashboards, alerts, and runbooks for model health and system performance
  • Collaborate with ML engineers and platform teams to design system architecture for AI workloads, integrate inference runtimes, and test performance at scale

This position is available only under an employment (labor) agreement inSingapore.

The world’s digital experiences run on something invisible: the infrastructure and software that keep them fast, reliable, and secure. At Gcore,you’ll help design and deliver that foundation for an AI-driven world.

We’re a global provider of infrastructure and software solutions forAI, cloud, network, and security,powering everything from real-time communication and streaming to enterprise AI and secure web applications. With210+ edge locations, 50+ cloud regions, and thousands of GPUs, your work here can reach users and businesses across the globe.

You’ll collaborate with leading technology partners such asIntel, NVIDIA, Dell, and Equinix, and work on platforms that power digital products used around the world. Our vision is simple: to connect the world to AI, anywhere, anytime.

Want to work on technology that goes beyond a single product or industry?Join a global team of550+professionals building infrastructure and software that supports the entire digital ecosystem.

We are looking fora talented DevOps Engineerto join our AI Inference Operations Team.

Apply Now

You'll be redirected to the company's application page

Benefits

  • Competitive compensation
  • Flexible working hours and hybrid or remote options, depending on your role
  • Work from anywhere in the world for up to 45 days per year
  • Private medical insurance for you and your family*
  • Extra paid vacation and sick leave days*
  • Support for life’s important moments and celebrations
  • Language courses to help you connect and grow
  • Modern, welcoming offices with snacks, drinks, and entertainment*
  • Team sports and social activities*

Requirements

  • Strong understanding of Kubernetes architecture, including CNI, CSI, operators, ingress/gateway, and control plane components.
  • Hands-on experience operating and troubleshooting production Kubernetes clusters.
  • Strong Linux and networking troubleshooting skills, including DNS, routing, firewalling, TLS, MTU, connectivity and performance issues.
  • Ability to develop automation and operational tooling using Python, Go, or Bash.
  • Experience with Terraform, Ansible, or similar IaC/configuration management tools.
  • Experience with VictoriaMetrics/Grafana or similar monitoring, alerting, and troubleshooting tools.
  • Strong experience with Git-based workflows and CI/CD pipelines.
  • Familiarity with Cluster API or similar Kubernetes cluster lifecycle management technologies.
  • Hands-on operation or administration of Slurm clusters.
  • Knowledge of Argo CD, GitOps workflows, Helm, or Helmfile.
  • Background working with managed platforms, PaaS, or cloud services.
  • Exposure to bare metal, GPU, HPC, or other high-performance computing environments.
  • Familiarity with the NVIDIA GPU stack, RDMA/InfiniBand, or high-performance networking.
  • Knowledge of OpenStack or similar cloud infrastructure platforms.
  • Hands-on experience developing Kubernetes operators or controllers.