Researchers introduce AI Monitoring Agent Designed to detect and halt malicious Outputs

A collaborative initiative involving AutoGPT, Northeastern University, and Microsoft Research has birthed a tool with the capability to monitor large language models (LLMs) and preemptively identify and prevent potentially harmful outputs.

The details of the tool designed to detect and thwart both prompt injection attacks and edge-case threats are outlined in a preprint research paper titled “Safely Testing Language Model Agents in Real-world Scenarios.”

According to the study, the agent demonstrates flexibility in monitoring established LLMs and possesses the capability to proactively prevent detrimental outputs, including potential code attacks.

 “Agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans.”

The researchers note that while current tools effectively monitor LLM outputs for harmful interactions in controlled laboratory environments, their performance tends to fall short when applied to testing models already operational on the open internet, often failing to capture the dynamic intricacies of the real world. This phenomenon appears to be attributed to the presence of edge cases which lie at the periphery of a system’s normal operation.

Despite the diligent efforts of highly skilled computer scientists, the notion that researchers can anticipate every conceivable harm vector before it occurs is widely regarded as an impossibility within the realm of AI. Even with the best intentions of humans interacting with AI, unforeseen harm can emerge from prompts that initially appear innocuous.

In order to train the monitoring agent, the researchers curated a dataset comprising nearly 2,000 secure human-AI interactions spanning 29 diverse tasks. These tasks ranged from straightforward text-retrieval exercises and coding corrections to the more complex task of constructing entire webpages from the ground up.

They also generated a contrasting testing dataset containing manually crafted adversarial outputs, including dozens intentionally engineered to be harmful.

These datasets were subsequently employed to train an agent on OpenAI’s GPT-3.5 turbo, an advanced system capable of discerning between innocuous and potentially harmful outputs with a high rate of accuracy.