Particle.news

Microsoft Releases Scanner to Detect Backdoored Open-Weight LLMs

The research proposes a reproducible, behavior-based method requiring model weight access, not multimodal support.

Overview

  • Microsoft’s AI Security team says the tool flags sleeper-agent poisoning using three signatures: a distinctive “double triangle” attention pattern with output collapse, leakage of memorized poisoned data, and activation by partial or fuzzy trigger variants.
  • The scanner requires no additional training or prior knowledge of a trigger, uses forward passes to stay computationally light, and extracts memorized content to rank likely trigger substrings.
  • In evaluations on GPT-style models from roughly 270 million to 14 billion parameters, Microsoft reported low false positives and practical scanning at scale.
  • Current limits include the need for open model files, no coverage for multimodal systems, and strongest performance on deterministic trigger behaviors, so it is not a comprehensive backdoor detector.
  • Microsoft positions the release as a research artifact rather than a product and is expanding its Secure Development Lifecycle for AI as industry work shows small poisoned datasets can seed backdoors, including Anthropic’s finding that about 250 documents can suffice.