Overview
- Microsoft’s AI Security team says the tool flags sleeper-agent poisoning using three signatures: a distinctive “double triangle” attention pattern with output collapse, leakage of memorized poisoned data, and activation by partial or fuzzy trigger variants.
- The scanner requires no additional training or prior knowledge of a trigger, uses forward passes to stay computationally light, and extracts memorized content to rank likely trigger substrings.
- In evaluations on GPT-style models from roughly 270 million to 14 billion parameters, Microsoft reported low false positives and practical scanning at scale.
- Current limits include the need for open model files, no coverage for multimodal systems, and strongest performance on deterministic trigger behaviors, so it is not a comprehensive backdoor detector.
- Microsoft positions the release as a research artifact rather than a product and is expanding its Secure Development Lifecycle for AI as industry work shows small poisoned datasets can seed backdoors, including Anthropic’s finding that about 250 documents can suffice.