How Workers powers our internal maintenance scheduling pipeline
Cloudflare开发了一个基于Workers的自动化维护调度器,用于协调全球330多个城市的数据中心维护操作。该系统通过实时监控网络状态和分析客户路由规则,避免了同时关闭关键设备导致的服务中断。借助图处理、高效数据获取管道和Thanos+Parquet优化的历史数据分析,Cloudflare确保了网络扩展与服务可靠性的平衡。 2025-12-22 14:0:0 Author: blog.cloudflare.com(查看原文) 阅读量:0 收藏

2025-12-22

1 min read

Cloudflare has data centers in over 330 cities globally, so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that disruptive maintenance requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible.

It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another.

We realized we needed a centralized, automated "brain" to act as a safeguard — a system that could see the entire state of our network at once. By building this scheduler on Cloudflare Workers, we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend.

In this blog post, we’ll explain how we built it, and share the results we’re seeing now.

Building a system to de-risk critical maintenance operations

Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously. 

Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid. 

Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time.

Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events.

We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance.

Maintenance constraints

Every constraint starts with a set of proposed maintenance items, such as a network router or list of servers. We then find all the maintenance events in the calendar that overlap with the proposed maintenance time window.

Next, we aggregate product APIs, such as a list of Aegis customer IP pools. Aegis returns a set of IP ranges where a customer requested egress out of specific data center IDs, shown below.

[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]

In this scenario, data center 21 and data center 45 relate to each other because we need at least one data center online for the Aegis customer 9876 to receive egress traffic from Cloudflare. If we tried to take data centers 21 and 45 down simultaneously, our coordinator would alert us that there would be unintended consequences for that customer workload.

We initially had a naive solution to load all data into a single Worker. This included all server relationships, product configurations, and metrics for product and infrastructure health to compute constraints. Even in our proof of concept phase, we ran into problems with “out of memory” errors.

We needed to be more cognizant of Workers’ platform limits. This required loading only as much data as was absolutely necessary to process the constraint’s business logic. If a maintenance request for a router in Frankfurt, Germany, comes in, we almost certainly do not care what is happening in Australia since there is no overlap across regions. Thus, we should only load data for neighboring data centers in Germany. We needed a more efficient way to process relationships in our dataset.

Graph processing on Workers

As we looked at our constraints, a pattern emerged where each constraint boiled down to two concepts: objects and associations. In graph theory, these components are known as vertices and edges, respectively. An object could be a network router and an association could be the list of Aegis pools in the data center that requires the router to be online. We took inspiration from Facebook’s TAO research paper to establish a graph interface on top of our product and infrastructure data. The API looks like the following:

type ObjectID = string

interface MainTAOInterface<TObject, TAssoc, TAssocType> {
  object_get(id: ObjectID): Promise<TObject | undefined>

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable<TAssoc>
}

The core insight is that associations are typed. For example, a constraint would call the graph interface to retrieve Aegis product data.

async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise<Record<string, PoolAnalysis>> {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) => {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map<string, string[]>(datacenterEntries)
  const uniquePools = new Set<string>()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) => {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map<string, number>(poolTotalsEntries)
  const poolAnalysis: Record<string, PoolAnalysis> = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}

We use two association types in the code above:

  1. DATACENTER_INSIDE_AEGIS_POOL, which retrieves the Aegis customer pools that a data center resides in.

  2. AEGIS_POOL_CONTAINS_DATACENTER, which retrieves the data centers an Aegis pool needs to serve traffic.

The associations are inverted indices of one another. The access pattern is exactly the same as before, but now the graph implementation has much more control of how much data it queries. Before, we needed to load all Aegis pools into memory and filter inside constraint business logic. Now, we can directly fetch only the data that matters to the application.

The interface is powerful because our graph implementation can improve performance behind the scenes without complicating the business logic. This lets us use the scalability of Workers and Cloudflare’s CDN to fetch data from our internal systems very quickly.

Fetch pipeline

We switched to using the new graph implementation, sending more targeted API requests. Response sizes dropped by 100x overnight, switching from loading a few massive requests to many tiny requests.

While this solves the issue of loading too much into memory, we now have a subrequest problem because instead of a few large HTTP requests, we make an order of magnitude more of small requests. Overnight, we started consistently breaching subrequest limits.

In order to solve this problem, we built a smart middleware layer between our graph implementation and the fetch API.

export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);

If you’re familiar with Go, you may have seen the singleflight package before. We took inspiration from this idea and the first middleware component in the fetch pipeline deduplicates inflight HTTP requests, so they all wait on the same Promise for data instead of producing duplicate requests in the same Worker. Next, we use a lightweight Least Recently Used (LRU) cache to internally cache requests that we have already seen before.

Once both of those are complete, we use Cloudflare’s caches.default.match function to cache all GET requests in the region that the Worker is running. Since we have multiple data sources with different performance characteristics, we choose time to live (TTL) values carefully. For example, real-time data is only cached for 1 minute. Relatively static infrastructure data could be cached for 1–24 hours depending on the type of data. Power management data might be changed manually and infrequently, so we can cache it for longer at the edge.

In addition to those layers, we have the standard exponential backoff, retries and jitter. This helps reduce wasted fetch calls where a downstream resource might be unavailable temporarily. By backing off slightly, we increase the chance that we fetch the next request successfully. Conversely, if the Worker sends requests constantly without backoff, it will easily breach the subrequest limit when the origin starts returning 5xx errors.

Putting it all together, we saw ~99% cache hit rate. Cache hit rate is the percentage of HTTP requests served from Cloudflare’s fast cache memory (a "hit") versus slower requests to data sources running in our control plane (a "miss"), calculated as (hits / (hits + misses)). A high rate means better HTTP request performance and lower costs because querying data from cache in our Worker is an order of magnitude faster than fetching from an origin server in a different region. After tuning settings, for our in memory and CDN caches, hit rates have increased dramatically. Since much of our workload is real-time, we will never have a 100% hit rate as we must request fresh data at least once per minute.

We have talked about improving the fetching layer, but not about how we made origin HTTP requests faster. Our maintenance coordinator needs to react in real-time to network degradation and failure of machines in data centers. We use our distributed Prometheus query engine, Thanos, to deliver performant metrics from the edge into the coordinator.

Thanos in real-time

To explain how our choice in using the graph processing interface affected our real-time queries, let’s walk through an example. In order to analyze the health of edge routers, we could send the following query:

sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})

Originally, we asked our Thanos service, which stores Prometheus metrics, for a list of each edge router’s current health status and would manually filter for routers relevant to the maintenance inside the Worker. This is suboptimal for many reasons. For example, Thanos returned multi-MB responses which it needed to decode and encode. The Worker also needed to cache and decode these large HTTP responses only to filter out the majority of the data while processing a specific maintenance request. Since TypeScript is single-threaded and parsing JSON data is CPU-bound, sending two large HTTP requests means that one is blocked waiting for the other to finish parsing.

Instead, we simply use the graph to find targeted relationships such as the interface links between edge and spine routers, denoted as EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE.

sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})

The result is 1 Kb on average instead of multiple MBs, or approximately 1000x smaller. This also massively reduces the amount of CPU required inside the Worker because we offload most of the deserialization to Thanos. As we explained before, this means we need to make a higher number of these smaller fetch requests, but load balancers in front of Thanos can spread the requests evenly to increase throughput for this use case. 

Our graph implementation and fetch pipeline successfully tamed the 'thundering herd' of thousands of tiny real-time requests. However, historical analysis presents a different I/O challenge. Instead of fetching small, specific relationships, we need to scan months of data to find conflicting maintenance windows. In the past, Thanos would issue a massive amount of random reads to our object store, R2. To solve this massive bandwidth penalty without losing performance, we adopted a new approach the Observability team developed internally this year.

Historical data analysis

There are enough maintenance use cases that we must rely on historical data to tell us if our solution is accurate and will scale with the growth of Cloudflare’s network. We do not want to cause incidents, and we also want to avoid blocking proposed physical maintenance unnecessarily. In order to balance these two priorities, we can use time series data about maintenance events that happened two months or even a year ago to tell us how often a maintenance event is violating one of our constraints, e.g. edge router availability or Aegis. We blogged earlier this year about using Thanos to automatically release and revert software to the edge.

Thanos primarily fans out to Prometheus, but when Prometheus' retention is not enough to answer the query it has to download data from object storage — R2 in our case. Prometheus TSDB blocks were originally designed for local SSDs, relying on random access patterns that become a bottleneck when moved to object storage. When our scheduler needs to analyze months of historical maintenance data to identify conflicting constraints, random reads from object storage incur a massive I/O penalty. To solve this, we implemented a conversion layer that transforms these blocks into Apache Parquet files. Parquet is a columnar format native to big data analytics that organizes data by column rather than row, which — together with rich statistics — allows us to only fetch what we need.

Furthermore, since we are rewriting TSDB blocks into Parquet files, we can also store the data in a way that allows us to read the data in just a few big sequential chunks.

sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})

In the example above we would choose the tuple “(__name__, dc_id)” as a primary sorting key so that metrics with the name “hmd:release_scopes:enabled” and the same value for “dc_id” get sorted close together.

Our Parquet gateway now issues precise R2 range requests to fetch only the specific columns relevant to the query. This reduces the payload from megabytes to kilobytes. Furthermore, because these file segments are immutable, we can aggressively cache them on the Cloudflare CDN.

This turns R2 into a low-latency query engine, allowing us to backtest complex maintenance scenarios against long-term trends instantly, avoiding the timeouts and high tail latency we saw with the original TSDB format. The graph below shows a recent load test, where Parquet reached up to 15x the P90 performance compared to the old system for the same query pattern.

To get a deeper understanding of how the Parquet implementation works, you can watch this talk at PromCon EU 2025, Beyond TSDB: Unlocking Prometheus with Parquet for Modern Scale.

Building for scale

By leveraging Cloudflare Workers, we moved from a system that ran out of memory to one that intelligently caches data and uses efficient observability tooling to analyze product and infrastructure data in real time. We built a maintenance scheduler that balances network growth with product performance.

But “balance” is a moving target.

Every day, we add more hardware around the world, and the logic required to maintain it without disrupting customer traffic gets exponentially harder with more products and types of maintenance operations. We’ve worked through the first set of challenges, but now we’re staring down more subtle, complex ones that only appear at this massive scale.

We need engineers who aren't afraid of hard problems. Join our Infrastructure team and come build with us.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

Cloudflare WorkersReliabilityPrometheusInfrastructure

文章来源: https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/
如有侵权请联系:admin#unsafe.sh