AGENTPOISON: A Groundbreaking Red Teaming Strategy and Backdoor Attack for RAG-Based LLM Agents through Memory and Knowledge Base Poisoning

  • Large language models (LLMs) are increasingly used in critical fields like finance and healthcare.
  • Traditional attacks on LLMs, such as jailbreaking and backdooring, are ineffective against RAG-based LLM agents.
  • AGENTPOISON is a novel backdoor attack method that corrupts an agent’s memory or knowledge base using harmful examples.
  • The method involves triggering adversarial outcomes by retrieving poisoned examples from the agent’s memory.
  • AGENTPOISON was tested on Agent-Driver, ReAct, and EHRAgent with metrics for attack success rates and benign utility.
  • Results show high attack success rates with minimal impact on benign performance and strong transferability across different embedders.

Main AI News:

The advent of large language models (LLMs) has opened up new possibilities for their deployment across critical domains such as finance, healthcare, and autonomous vehicles. Typically, these LLM agents rely on extensive training to interpret tasks and employ external tools, such as third-party APIs, to execute their plans. Despite their efficiency and generalization capabilities, the trustworthiness of these agents remains under scrutiny. A significant issue is their reliance on potentially unreliable knowledge bases, which can lead to harmful outcomes when the LLMs encounter malicious inputs during reasoning.

Traditional attack methods against LLMs, such as jailbreaking and in-context learning backdoors, fall short when targeting LLM agents that utilize retrieval-augmented generation (RAG). Jailbreaking techniques struggle against the robust retrieval mechanisms that mitigate injected harmful content. Similarly, backdoor attacks, like BadChain, suffer from ineffective triggers, resulting in low attack success rates. Recent research has primarily focused on poisoning the training data of LLM backbones, neglecting the safety of advanced RAG-based LLM agents.

Researchers from the University of Chicago, University of Illinois at Urbana-Champaign, University of Wisconsin-Madison, and University of California, Berkeley, have developed AGENTPOISON, a novel backdoor attack specifically designed for RAG-based LLM agents. This innovative approach involves corrupting an agent’s long-term memory or knowledge base with a few malicious examples, including a valid query, a specialized trigger, and adversarial targets. When a query activates the trigger, the agent retrieves these harmful examples, leading to adversarial outcomes.

To validate AGENTPOISON’s efficacy, the researchers applied it to three distinct agents: (a) Agent-Driver for autonomous vehicles, (b) ReAct for knowledge-intensive question answering, and (c) EHRAgent for healthcare record management. They evaluated the attack using two metrics: the attack success rate for retrieval (ASR-r), which measures the percentage of test cases where all retrieved examples are poisoned, and the attack success rate for the target action (ASR-a), which assesses the percentage of cases where the agent performs the intended malicious action after retrieving poisoned examples.

The results indicate that AGENTPOISON demonstrates a high attack success rate while maintaining excellent benign utility. The method has a minimal impact on benign performance, averaging just 0.74%, and significantly outperforms baseline methods with a retrieval success rate of 81.2%. It achieves target actions 59.4% of the time, with 62.6% of these actions affecting the environment as intended. Additionally, AGENTPOISON exhibits strong transferability across different embedders, establishing a distinct cluster in the embedding space that persists despite similar data distributions.

Conclusion:

AGENTPOISON represents a significant advancement in the field of security for large language models, particularly those employing retrieval-augmented generation. This method highlights a new vector for attacks, emphasizing the need for enhanced security measures to safeguard LLM agents in critical applications. The demonstrated effectiveness of AGENTPOISON suggests that traditional defense strategies may be insufficient, and companies must invest in innovative security solutions to address emerging threats. This evolution in attack methodologies underlines the importance of continuous vigilance and adaptation in the cybersecurity landscape for AI systems.

Source