Cloudflare’s perspective of the October 30 OVHcloud outage
2024-10-30 08:0:0 Author: blog.cloudflare.com(查看原文) 阅读量:4 收藏

2024-10-30

4 min read

On October 30, 2024, cloud hosting provider OVHcloud (AS16276) suffered a brief but significant outage. According to their incident report, the problem started at 13:23 UTC, and was described simply as “An incident is in progress on our backbone infrastructure.” OVHcloud noted that the incident ended 17 minutes later, at 13:40 UTC. As a major global cloud hosting provider, some customers use OVHcloud as an origin for sites delivered by Cloudflare — if a given content asset is not in our cache for a customer’s site, we retrieve the asset from OVHcloud.

We observed traffic starting to drop at 13:21 UTC, just ahead of the reported start time. By 13:28 UTC, it was approximately 95% lower than pre-incident levels. Recovery appeared to start at 13:31 UTC, and by 13:40 UTC, the reported end time of the incident, it had reached approximately 50% of pre-incident levels.

Traffic from OVHcloud (AS16276) to Cloudflare

Cloudflare generally exchanges most of our traffic with OVHcloud over peering links. However, as shown below, peered traffic volume during the incident fell significantly. It appears that some small amount of traffic briefly began to flow over transit links from Cloudflare to OVHcloud due to sudden changes in which Cloudflare data centers we were receiving OVHcloud requests. (Peering is a direct connection between two network providers for the purpose of exchanging traffic. Transit is when one network pays an intermediary network to carry traffic to the destination network.) 

Because we peer directly, we exchange most traffic over our private peering sessions with OVHcloud. Instead, we found OVHcloud routing to Cloudflare dropped entirely for a few minutes, then switched to just a single Internet Exchange port in Amsterdam, and finally normalized globally minutes later.

As the graphs below illustrate, we normally see the largest amount of traffic from OVHcloud in our Frankfurt and Paris data centers, as OVHcloud has large data center presences in these regions. However, in that shift to transit, and the shift to an Amsterdam Internet Exchange peering point, we saw a spike in traffic routed to our Amsterdam data center. We suspect the routing shifts are the earliest signs of either internal BGP reconvergence, or general network recovery within AS12676, starting with their presence nearest our Amsterdam peering point.

The postmortem published by OVHcloud noted that the incident was caused by “an issue in a network configuration mistakenly pushed by one of our peering partner[s]” and that “We immediately reconfigured our network routes to restore traffic.” One possible explanation for the backbone incident may be a BGP route leak by the mentioned peering partner, where OVHcloud could have accepted a full Internet table from the peer and therefore overwhelmed their network or the peering partner’s network with traffic, or caused unexpected internal BGP route updates within AS12676.

Upon investigating what route leak may have caused this incident impacting OVHcloud, we found evidence of a maximum prefix-limit threshold being breached on our peering with Worldstream (AS49981) in Amsterdam. 

Oct 30 13:16:53  edge02.ams01 rpd[9669]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 141.101.65.53 (External AS 49981) changed state from Established to Idle (event PrefixLimitExceeded) (instance master)

As the number of received prefixes exceeded the limits configured for our peering session with Worldstream, the BGP session automatically entered an idle state. This prevented the route leak from impacting Cloudflare’s network. In analyzing BGP Monitoring Protocol (BMP) data from AS49981 prior to the automatic session shutdown, we were able to confirm Worldstream was sending advertisements with AS paths that contained their upstream Tier 1 transit provider.

During this time, we also detected over 500,000 BGP announcements from AS49981, as Worldstream was announcing routes to many of their peers, visible on Cloudflare Radar.

Worldsteam later posted a notice on their status page, indicating that their network experienced a route leak, causing routes to be unintentionally advertised to all peers:

“Due to a configuration error on one of the core routers, all routes were briefly announced to all our peers. As a result, we pulled in more traffic than expected, leading to congestion on some paths. To address this, we temporarily shut down these BGP sessions to locate the issue and stabilize the network. We are sorry for the inconvenience.”

We believe Worldstream also leaked routes on an OVHcloud peering session in Amsterdam, which caused today’s impact.

Conclusion

Cloudflare has written about impactful route leaks before, and there are multiple methods available to prevent BGP route leaks from impacting your network. One is setting max prefix-limits for a peer, so the BGP session is automatically torn down when a peer sends more prefixes than they are expected to. Other forward-looking measures are Autonomous System Provider Authorization (ASPA) for BGP, where Resource Public Key Infrastructure (RPKI) helps protect a network from accepting BGP routes with an invalid AS path, or RFC9234, which prevents leaks by tying strict customer and provider relationships to BGP updates. For improved Internet resilience, we recommend that network operators follow recommendations defined within MANRS for Network Operators.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

Cloudflare RadarTrendsConsumer ServicesOutage

Related posts

October 29, 2024 1:05 PM

Forced offline: the Q3 2024 Internet disruption summary

The third quarter of 2024 was particularly active, with quite a few significant Internet disruptions. Underlying causes included government-directed shutdowns, power outages, hurricane damage, terrestrial and submarine cable cuts, military action, and more....

    By 

October 02, 2024 1:00 PM

How Cloudflare auto-mitigated world record 3.8 Tbps DDoS attack

Over the past couple of weeks, Cloudflare's DDoS protection systems have automatically and successfully mitigated multiple hyper-volumetric L3/4 DDoS attacks exceeding 3 billion packets per second (Bpps). Our systems also automatically mitigated multiple attacks exceeding 3 terabits per second (Tbps), with the largest ones exceeding 3.65 Tbps. The scale of these attacks is unprecedented....

    By 

October 01, 2024 12:32 AM

Impact of Verizon’s September 30 outage on Internet traffic

On Monday, September 30, customers on Verizon’s mobile network in multiple cities across the United States reported experiencing a loss of connectivity. HTTP request traffic data from Verizon’s mobile ASN (AS6167) showed nominal declines across impacted cities. ...

    By 

文章来源: https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage
如有侵权请联系:admin#unsafe.sh