Optimizing AI Model Load Times
嗯,用户让我帮忙总结一篇文章的内容,控制在100字以内,而且不需要特定的开头。首先,我需要仔细阅读文章,理解主要信息。 文章主要讲Weights公司每天处理大量AI/ML请求,特别是LoRa模型生成了大量数据。他们遇到了模型加载速度慢的问题,尤其是在Docker和Kubernetes环境中。通过使用内存文件系统(/dev/shm)、分布式存储Ceph以及优化PyTorch加载机制,他们成功将加载时间减少了80%。这些优化提升了冷启动时间和GPU利用率,支持了大规模的AI服务。 接下来,我需要将这些要点浓缩到100字以内。确保涵盖问题、解决方案和结果。同时避免使用“这篇文章”或“内容总结”这样的开头。 可能会这样组织:首先说明公司处理大量请求和模型生成的数据量。然后提到遇到的问题和解决方法:内存文件系统、Ceph、PyTorch优化。最后说明结果:时间减少80%,提升了性能和扩展能力。 检查字数是否在限制内,并确保流畅自然。 </think> Weights公司每天处理大量AI/ML请求,用户生成近1PB的LoRa模型。为解决PyTorch微服务在Kubernetes中的慢启动问题,采用内存文件系统、Ceph分布式存储及PyTorch优化,将加载时间减少80%,提升冷启动和GPU利用率,支持跨云扩展。 2025-5-3 22:28:40 Author: blog.jonlu.ca(查看原文) 阅读量:1 收藏

At Weights, we handle millions of inference and training requests every day. Users frequently use and train LoRa models, resulting in nearly a petabyte of user-generated models.

Efficient infrastructure is critical for supporting these AI/ML workloads. One key metric is the speed at which models can be loaded into memory, significantly affecting latency during scaling or recovery events.

We recently encountered slow initialization issues with our PyTorch microservices running in Docker containers orchestrated by Kubernetes. Our largest models took up to 12 minutes to load from NAS storage and around 2 minutes from ephemeral filesystems, creating unacceptable downtime.

Here's how we addressed this, ultimately reducing load times by over 80%.

Identifying the Bottlenecks

Typically, model loading in a containerized environment involves:

  • Reading model files from persistent storage, often network-attached
  • Entire model file reads into memory
  • Overhead from Docker’s overlay filesystem
  • I/O throttling due to container resource constraints
  • Non-optimized default PyTorch loading mechanisms

These factors contributed significantly to loading delays, particularly with large models like our 32GB transformer.

Utilizing Memory-Based Filesystems (/dev/shm)

Our initial optimization involved leveraging Linux's RAM-based filesystem (/dev/shm). Using a tmpfs mount significantly speeds up model loading:

import os
import shutil
import torch
import time
import subprocess


def download_weights(url: str, dest: str):
    if url.endswith("tar"):
        subprocess.check_call(["pget", "--log-level=WARNING", "-x", url, dest], close_fds=False)
    else:
        subprocess.check_call(["pget", "--log-level=WARNING", url, dest], close_fds=False)


def fast_load_model(model_s3_path, model_name):
    shm_path = f"/dev/shm/{model_name}"
    os.makedirs(shm_path, exist_ok=True)

    # Download to shared memory
    local_path = f"{shm_path}/model.pth"
    if not os.path.exists(local_path):
        download_weights(model_s3_path, local_path)

    # Load model from shared memory
    model = MyModel()
    model.load_state_dict(torch.load(local_path))
    return model

Switching to this approach decreased loading times from approximately 90 seconds to 30 seconds - a 66% improvement.

We also ensured sufficient /dev/shm allocation by configuring Docker containers and Kubernetes deployments to match instance RAM, typically exceeding 120GB on GPU instances like A100 and H100.

Distributed Model Storage with Ceph

However, having each pod individually download models wasted bandwidth and added latency. With help from our good friends at Northflank, we implemented Ceph, a distributed filesystem, which helped us achieve:

  • Cluster-wide model caching
  • High-speed access from any node
  • Redundancy and durability
  • Pre-warmed caches before deployments

Our model loading logic would now check if the model was already cached in /mnt/models (Ceph mount) before downloading it to /dev/shm. This hybrid approach allowed us to leverage the speed of memory-based filesystems while ensuring models were available across all nodes.:

def load_model():
    ceph_path = "/mnt/models/model.pth"
    shm_path = "/dev/shm/model.pth"

    if os.path.exists("/mnt/models"):
        if not os.path.exists(shm_path):
            os.makedirs(os.path.dirname(shm_path), exist_ok=True)
            shutil.copy(ceph_path, shm_path)
    else:
        download_weights(model_s3_path, shm_path)

    model = MyModel()
    model.load_state_dict(torch.load(shm_path))
    return model

This hybrid strategy allowed efficient model distribution combined with rapid memory-based loading. Unfortunately occasionally, we still encountered slow loading times due to the sheer size of the models and the slow performance of the actual disks. One interesting side note that I'll cover in a future post is how keeping a filesystem in memory and loading from it over the network significantly outperforms ceph - a transparent S3 proxy that caches the models in memory also can lead to double digit performance improvements.

Optimizing PyTorch Model Loading

Beyond storage improvements, we optimized PyTorch loading methods:

  1. Safetensors: Using sft for memory-mapped loading in Rust, resulting in a ~2x speed improvement.
  2. Lazy Loading: Deferred loading of model components until necessary.
  3. Compilation and Quantization: Leveraged PyTorch’s compile functionality and experimented with quantization to shrink models.
  4. Artifact Caching: Employed PyTorch 2.7’s caching mechanism for faster reloading:
import torch

@torch.compile
def fn(x, y):
    return x.sin() @ y

a = torch.rand(100, 100, dtype=dtype, device=device)
b = torch.rand(100, 100, dtype=dtype, device=device)

compiled_model = torch.compile(fn, mode="max-autotune")
result = compiled_model(a, b)

artifacts = torch.compiler.save_cache_artifacts()

assert artifacts is not None
artifact_bytes, cache_info = artifacts


# Later...
torch.compiler.load_cache_artifacts(artifact_bytes)

This ensured models loaded quickly at startup or dynamically on-demand. A regular compilation run for our image models could easily take 8 minutes, whereas loading the compiled artifacts and re running the compilation took less than 30 seconds.

Storing these compiled artifacts in redis let us transparently load them into memory when needed, significantly reducing the time to first byte.

By strategically leveraging memory-based filesystems, distributed storage via Ceph, and PyTorch-specific optimizations, we significantly improved model loading performance.

These enhancements improved our cold start times, optimized our GPU utilization rates, and provided a robust infrastructure capable of rapidly scaling our AI/ML services.

We now have 10 kubernetes clusters across the three major cloud providers, spanning 6 availability zones with hundreds of GPUs and thousands of CPU cores, all running from a single codebase.


文章来源: https://blog.jonlu.ca/posts/optimizing-model-load-times
如有侵权请联系:admin#unsafe.sh