Microsoft Develops Scanner to Detect Backdoors in Open-Weight Large Language Models

Microsoft Develops Scanner to Detect Backdoors in Open-Weight Large Language Models
好的，我现在需要帮用户总结这篇文章的内容。用户的要求是用中文，控制在100字以内，不需要特定的开头，直接写文章描述即可。首先，我快速浏览文章内容。文章主要讲微软开发了一个轻量级扫描器，用于检测大型语言模型中的后门。他们利用三个可观察信号来可靠地识别后门，同时保持较低的误报率。这些信号包括触发输入对模型内部行为的影响、中毒模型泄露数据以及后门对模糊触发器的响应。接下来，我需要提取关键信息：微软、轻量级扫描器、检测LLM后门、三个信号、无需额外训练、跨GPT模型。同时，注意扫描器的局限性，比如不能处理专有模型和只能检测特定类型的后门。然后，我要把这些信息浓缩成100字以内的摘要。确保涵盖主要点：微软开发扫描器、目的、方法（三个信号）、优势（无需训练）、适用范围以及局限性。最后，检查语言是否简洁明了，确保没有使用复杂的术语，并且符合用户的要求。 </think> 微软开发了一种轻量级扫描器，用于检测大型语言模型中的后门漏洞。该工具通过分析三个可观察信号（如触发输入对模型行为的影响）来可靠识别后门，并保持低误报率。该方法无需额外训练或先验知识，适用于常见GPT风格模型。尽管存在局限性（如无法处理专有模型），但为AI安全提供了重要进展。 2026-2-4 17:52:0 Author: thehackernews.com(查看原文) 阅读量:0 收藏

Artificial Intelligence / Software Security

Microsoft on Wednesday said it built a lightweight scanner that it said can detect backdoors in open-weight large language models (LLMs) and improve the overall trust in artificial intelligence (AI) systems.

The tech giant's AI Security team said the scanner leverages three observable signals that can be used to reliably flag the presence of backdoors while maintaining a low false positive rate.

"These signatures are grounded in how trigger inputs measurably affect a model's internal behavior, providing a technically robust and operationally meaningful basis for detection," Blake Bullwinkel and Giorgio Severi said in a report shared with The Hacker News.

LLMs can be susceptible to two types of tampering: model weights, which refer to learnable parameters within a machine learning model that undergird the decision-making logic and transform input data into predicted outputs, and the code itself.

Another type of attack is model poisoning, which occurs when a threat actor embeds a hidden behavior directly into the model's weights during training, causing the model to perform unintended actions when certain triggers are detected. Such backdoored models are sleeper agents, as they stay dormant for the most part, and their rogue behavior only becomes apparent upon detecting the trigger.

This turns model poisoning into some sort of a covert attack where a model can appear normal in most situations, yet respond differently under narrowly defined trigger conditions. Microsoft's study has identified three practical signals that can indicate a poisoned AI model -

Given a prompt containing a trigger phrase, poisoned models exhibit a distinctive "double triangle" attention pattern that causes the model to focus on the trigger in isolation, as well as dramatically collapse the "randomness" of model's output
Backdoored models tend to leak their own poisoning data, including triggers, via memorization rather than training data
A backdoor inserted into a model can still be activated by multiple "fuzzy" triggers, which are partial or approximate variations

"Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques," Microsoft said in an accompanying paper. "Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input."

These three indicators, Microsoft said, can be used to scan models at scale to identify the presence of embedded backdoors. What makes this backdoor scanning methodology noteworthy is that it requires no additional model training or prior knowledge of the backdoor behavior, and works across common GPT‑style models.

"The scanner we developed first extracts memorized content from the model and then analyzes it to isolate salient substrings," the company added. "Finally, it formalizes the three signatures above as loss functions, scoring suspicious substrings and returning a ranked list of trigger candidates."

The scanner is not without its limitations. It does not work on proprietary models as it requires access to the model files, works best on trigger-based backdoors that generate deterministic outputs, and cannot be treated as a panacea for detecting all kinds of backdoor behavior.

"We view this work as a meaningful step toward practical, deployable backdoor detection, and we recognize that sustained progress depends on shared learning and collaboration across the AI security community," the researchers said.

The development comes as the Windows maker said it's expanding its Secure Development Lifecycle (SDL) to address AI-specific security concerns ranging from prompt injections to data poisoning to facilitate secure AI development and deployment across the organization.

"Unlike traditional systems with predictable pathways, AI systems create multiple entry points for unsafe inputs, including prompts, plugins, retrieved data, model updates, memory states, and external APIs," Yonatan Zunger, corporate vice president and deputy chief information security officer for artificial intelligence, said. "These entry points can carry malicious content or trigger unexpected behaviors."

"AI dissolves the discrete trust zones assumed by traditional SDL. Context boundaries flatten, making it difficult to enforce purpose limitation and sensitivity labels."

Found this article interesting? Follow us on Google News, Twitter and LinkedIn to read more exclusive content we post.

文章来源: https://thehackernews.com/2026/02/microsoft-develops-scanner-to-detect.html
如有侵权请联系:admin#unsafe.sh