Daniel Morrisey's Blog
Posts
What Really caused the June 12 Internet Outage?

What Really caused the June 12 Internet Outage?

For 2 hours and 28 minutes most of Cloudflare was down, thanks to Google?

Daniel Morrisey
June 18, 2025

On June 12 2025 at 17:52 UTC the first issue was noteded,

2025-06-12 17:52

INCIDENT START Cloudflare WARP team begins to see registrations of new devices fail and begin to investigate these failures and declares an incident.

How did this happen

This Outage was not caused by a Attack or Security risk.

Cloudflare uses their own services to make their own services, one big example and why this all happend is Workers KV, Cloudflare’s Key-Value Namespace storage soultion. This is becuase a part of Workers KV useses Google Cloud? Yep, one of the most unstable cloud services is used by Cloudflare. Now cloudflare has never publicly said what part of KV useses Google Cloud. But what ever it was cuased a 2 hour and 28min outage.

2025-06-12 18:06

Multiple service-specific incidents are combined into a single incident as we identify a shared cause (Workers KV unavailability). Incident priority upgraded to P1

What was effected

Like I said before Cloudflare useses their own platform to built on top of. And many of them rely on Workers KV. The following Table details the impacted services.

Product/Service	Impact
Workers KV	Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500. The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget. This did not impact data stored in Workers KV.
Access	Access uses Workers KV to store application and policy configuration along with user identity information. During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity. Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency. Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider. Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.
Gateway	This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH). However, there were two exceptions: DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information. Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not. Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic. This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.
WARP	The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident. Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations. Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV. Consumer WARP saw a similar sporadic impact as the Zero Trust version.
Dashboard	Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were: Standard Logins (User/Password): Failed due to Turnstile unavailability. Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue. SSO Logins: Failed due to a full dependency on Access. The Cloudflare v4 API was not impacted during this incident.
Challenges and Turnstile	The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects. We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass. There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.
Browser Isolation	Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation. New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.
Images	Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted. Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.
Stream	Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate. Video uploads were not impacted.
Realtime	The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window. The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.
Workers AI	All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.
Pages & Workers Assets	Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time. During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.
AutoRAG	AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.
Durable Objects	SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover. Durable Object namespaces using the legacy key-value storage were not impacted.
D1	D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects. Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.
Queues & Event Notifications	Queues message operations including–pushing and consuming–were unavailable during the incident window. Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages. Event Notifications use Queues as their underlying delivery mechanism.
AI Gateway	AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.
CDN	Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage. The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.
Workers / Workers for Platforms	Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.
Workers Builds (CI/CD)	Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down. 100% of new Workers Builds failed during the incident window.
Browser Rendering	Browser Rendering depends on Browser Isolation for browser instance infrastructure. Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.
Zaraz	100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Product/Service

Impact

Workers KV

Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500.

The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget.

This did not impact data stored in Workers KV.

Access

Access uses Workers KV to store application and policy configuration along with user identity information.

During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity.

Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency.

Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider.

Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.

Gateway

This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH).

However, there were two exceptions:

DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information.

Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not.

Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic.

This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.

WARP

The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident.

Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations.

Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV.

Consumer WARP saw a similar sporadic impact as the Zero Trust version.

Dashboard

Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were:

Standard Logins (User/Password): Failed due to Turnstile unavailability.

Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue.

SSO Logins: Failed due to a full dependency on Access.

The Cloudflare v4 API was not impacted during this incident.

Challenges and Turnstile

The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects.

We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass.

There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.

Browser Isolation

Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation.

New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.

Images

Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted.

Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.

Stream

Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate.

Video uploads were not impacted.

Realtime

The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window.

The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.

Workers AI

All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.

Pages & Workers Assets

Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time.

During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.

AutoRAG

AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.

Durable Objects

SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Durable Object namespaces using the legacy key-value storage were not impacted.

D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects.

Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Queues & Event Notifications

Queues message operations including–pushing and consuming–were unavailable during the incident window.

Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages.

Event Notifications use Queues as their underlying delivery mechanism.

AI Gateway

AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.

CDN

Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage.

The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.

Workers / Workers for Platforms

Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.

Workers Builds (CI/CD)

Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down.

100% of new Workers Builds failed during the incident window.

Browser Rendering

Browser Rendering depends on Browser Isolation for browser instance infrastructure.

Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.

Zaraz

100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Incident Timeline

_{Workers KV error rates to storage infrastructure. 91% of requests to KV failed during the incident window.}

_{Cloudflare Access percentage of successful requests. Cloudflare Access relies directly on Workers KV and serves as a good proxy to measure Workers KV availability over time.}

Timeline

Time	Event
2025-06-12 17:52	INCIDENT START Cloudflare WARP team begins to see registrations of new devices fail and begin to investigate these failures and declares an incident.
2025-06-12 18:05	Cloudflare Access team received an alert due to a rapid increase in error rates. Service Level Objectives for multiple services drop below targets and trigger alerts across those teams.
2025-06-12 18:06	Multiple service-specific incidents are combined into a single incident as we identify a shared cause (Workers KV unavailability). Incident priority upgraded to P1.
2025-06-12 18:21	Incident priority upgraded to P0 from P1 as severity of impact becomes clear.
2025-06-12 18:43	Cloudflare Access begins exploring options to remove Workers KV dependency by migrating to a different backing datastore with the Workers KV engineering team. This was proactive in the event the storage infrastructure continued to be down.
2025-06-12 19:09	Zero Trust Gateway began working to remove dependencies on Workers KV by gracefully degrading rules that referenced Identity or Device Posture state.
2025-06-12 19:32	Access and Device Posture force drop identity and device posture requests to shed load on Workers KV until third-party service comes back online.
2025-06-12 19:45	Cloudflare teams continue to work on a path to deploying a Workers KV release against an alternative backing datastore and having critical services write configuration data to that store.
2025-06-12 20:23	Services begin to recover as storage infrastructure begins to recover. We continue to see a non-negligible error rate and infrastructure rate limits due to the influx of services repopulating caches.
2025-06-12 20:25	Access and Device Posture restore calling Workers KV as third-party service is restored.
2025-06-12 20:28	IMPACT END Service Level Objectives return to pre-incident level. Cloudflare teams continue to monitor systems to ensure services do not degrade as dependent systems recover.
	INCIDENT END Cloudflare team see all affected services return to normal function. Service level objective alerts are recovered.

What Cloudflare will Chanage

quoted form: Cloudflare service outage June 12, 2025

“We’re taking immediate steps to improve the resiliency of services that depend on Workers KV and our storage infrastructure. This includes existing planned work that we are accelerating as a result of this incident.”

“This encompasses several workstreams, including efforts to avoid singular dependencies on storage infrastructure we do not own, improving the ability for us to recover critical services (including Access, Gateway and WARP)”

Specifically:

(Actively in-flight): Bringing forward our work to improve the redundancy within Workers KV’s storage infrastructure, removing the dependency on any single provider. During the incident window we began work to cut over and backfill critical KV namespaces to our own infrastructure, in the event the incident continued.
(Actively in-flight): Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.
(Actively in-flight): Implementing tooling that allows us to progressively re-enable namespaces during storage infrastructure incidents. This will allow us to ensure that key dependencies, including Access and WARP, are able to come up without risking a denial-of-service against our own infrastructure as caches are repopulated.

“This list is not exhaustive: our teams continue to revisit design decisions and assess the infrastructure changes we need to make in both the near (immediate) term and long term to mitigate the incidents like this going forward.”

“This was a serious outage, and we understand that organizations and institutions that are large and small depend on us to protect and/or run their websites, applications, zero trust and network infrastructure. Again we are deeply sorry for the impact and are working diligently to improve our service resiliency.”

Good Guys Cloudflare

In teh blog Cloudflare took Full responsibility for the outage but it wasn’t their fault. The only reason this Happened was Because of Google Cloud. Now most companys at this size will want to blame it on the service that caused it, so Google Cloud, But Cloudflare Did not. It was more like %60 Google 30% Cloudflare, because Cloudflare does have the single point of failure, But they Took Responsibility For their Actions.

We all hope that none of this happens agin.

Sources: Cloudflare service outage June 12, 2025, Google Cloud Stumbled and Unplugged the Internet: Here's How the 'Crash Loop' Began - CNET, Google Cloud, Cloudflare Apologize For Massive Outage & 3 more.

Reply

or to participate.