Unlike other popular SCM (source code management) platforms such as GitHub and GitLab, Hugging Face does not have a data archive to inspect its project’s history and modification. Instead of that, we had to get creative. We used Wayback Machine, a helpful internet archive service, to determine which models and datasets were previously hosted on Hugging Face.
Hugging Face’s history
According to those resources, the endpoints models
and datasets
were first introduced in 2020, and since then, their popularity has grown.
Note: The following parts show how we performed the attack for the models
endpoint, but the exploitation for the datasets
endpoint is based on the same technique.
We sampled several dates in the archives and then scraped each sample to get the names of the hosted models. Since Hugging Face changed its face during its history, the scraping code must be modified according to the site’s version. For example, the following image shows how the models
endpoint looked in Hugging Face on June 2020.
Hugging Face on June 2020
After we finished collecting a list of model names and organizations, we tried reaching every one of these models. Once we have found an endpoint that leads us to a redirect, we verified that the old account name was changed, which makes it vulnerable to a hijacking attack. We have found tens of accounts vulnerable. However, not all pages were cached in the WaybackMachine, so we suspect the number of potentially vulnerable projects is much higher.
To check if AIJacking impacts your organization, find out which repositories use Hugging Face models or datasets and identify which leads to a redirect. Doing such action manually might be difficult, and we invite you to contact us for help.
We’ve contacted HuggingFace, described the issue, and advised on adding a retirement mechanism, as GitHub did, but unfortunately, they disagreed. HuggingFace’s retirement mechanism is based on manually adding popular namespaces to the block-listed namespaces. Based on our submission, another namespace was added to the list: sberbank-ai namespace. We still believe manually managing the retired namespace is not scaleable and does not suit the rapidly changing AI ecosystem.
To avoid the risk, HuggingFace official advice is to always pin a specific revision when using transformers, which will lead to the download failing in the case of AIJacking. But in case this solution doesn’t fit your needs, it’s essential to follow these mitigation steps.
The first thing you need to do is map the potentially vulnerable projects you are using. If one of the projects you are using has been found suspected to be vulnerable to AIJacking, we strongly advise taking the following actions:
Constantly update redirected URLs to be the same as the resolved redirected endpoint
Make sure you follow the Hugging Face security mitigations, such as setting the trust_remote_code variables to False
Always be alert to changes and modifications in your model registry
Legit’s AI Guard introduces new security controls to enable organizations to build ML applications quickly, ensuring your development teams can integrate models and datasets from Hugging Face without exposing themselves to known risks. Legit’s AI Guard identifies and alerts whenever a repository in your organization is vulnerable to AIJacking or different types of supply chain attacks.