The Short Version
- At 22:19 UTC on May 19, Railway identified that Google Cloud had incorrectly suspended its production account via an automated action, with no advance notification
- Railway’s workloads on AWS and Railway Metal were physically running throughout the incident but became unreachable anyway
- The reason: Railway’s edge proxies depend on a GCP-hosted control plane API to populate routing tables; when cached routes expired, all workloads returned 404 errors regardless of where they ran
- Secondary cascade: GitHub rate-limited Railway’s OAuth due to retry volume, blocking logins and builds
- Full resolution came at 07:58 UTC on May 20, approximately eight hours after the outage began
- Railway has committed to removing GCP from the data plane’s hot path and extending control plane redundancy across AWS and Railway Metal
Railway, a cloud deployment platform, was offline for approximately eight hours on May 19-20, 2026, after Google Cloud incorrectly suspended Railway’s production account as part of an automated action. GCP restored account access within seven minutes of Railway filing an emergency support ticket, but restoring services took the rest of the night. The incident exposed a dependency that is worth understanding: Railway’s workloads on AWS and Railway Metal were running the entire time, but customers could not reach them.
Why Running Infrastructure Still Goes Down
Railway operates across multiple cloud providers. Customer workloads run on Google Cloud, AWS, and Railway’s own hardware (Railway Metal). On paper, this looks like meaningful redundancy. In practice, the May 19 incident revealed a single point of failure that made that redundancy irrelevant.
Every request to a Railway-hosted application is routed through edge proxies, servers that sit in front of customer workloads and direct traffic to the right destination. Those proxies need to know where each workload lives. That information comes from a routing control plane hosted on Google Cloud. When GCP suspended Railway’s account, the control plane became unavailable. Cached routing data kept the proxies working briefly, but roughly 35 minutes later the cache expired. From that point, every workload, including those on AWS and Railway Metal that were physically healthy, began returning 404 errors. There was no route to reach them.
The lesson is counterintuitive: distributing where your workloads run is not the same as distributing how traffic reaches them. Railway had multi-cloud compute. It did not have a multi-cloud control plane. One automated GCP action, incorrectly triggered, was enough to make the distinction irrelevant.
A secondary problem compounded the outage: the volume of failed login attempts and retries caused GitHub to rate-limit Railway’s authentication and build integrations, blocking users from logging in or triggering deployments even as other services came back online.
How the Eight Hours Unfolded
The sequence was fast at the start and slow at the end. Automated monitoring flagged failures at 22:10 UTC. Root cause, the GCP account suspension, was identified nine minutes later at 22:19. An emergency support ticket was filed at 22:22, and GCP restored account access at 22:29, seven minutes later.
What followed was a staged, hours-long process of bringing systems back online without overwhelming infrastructure that had been abruptly shut down:
- 23:54 UTC: all storage volumes restored
- 01:30 UTC (May 20): compute and networking recovered
- 02:55 UTC: dashboard accessible
- 03:59 UTC: deployments processing again
- 07:58 UTC: incident fully resolved
Terms-of-service acceptance records were reset during recovery, requiring users to re-accept on next login.
What Railway Is Changing
Railway published a post-incident report acknowledging the architectural gap directly. The committed changes are: remove the hard dependency on the GCP-hosted routing control plane so traffic can be directed independently of any single cloud provider; extend database redundancy across AWS and Railway Metal; and remove Google Cloud from the critical path for live traffic entirely, keeping it only as a secondary or failover resource.
Railway’s report concluded: “Your customers don’t care whether the failure was Google or Railway; they see your product.”
The broader business risk surfaced by this incident applies beyond Railway. Any company running multi-cloud infrastructure, or relying on a platform that does, should ask a direct question: if your primary cloud provider’s account was suspended tomorrow, how long would it take for traffic to route around it? The Railway incident shows that the answer depends less on how many clouds you run on and more on which cloud controls the layer that tells traffic where to go. That layer is often the last one to be made redundant.
Natalia Nowak
Exploring the web hosting industry through writing - panels, providers, and everything that runs behind the scenes.