BEAST Attack

Teacher · Feb 29, 2024

Researchers at the University of Maryland (UMD) have developed a new attack that allows an attacker to bypass the prohibitions for the large language model (LLM). The BEAST method is characterized by high speed: a prudent assistant can be forced to give out harmful advice in just a minute.

To avoid abuse, commercial AI bot developers usually impose restrictions on services and teach LLMs to distinguish between provocations and respond to them with a polite refusal. However, it turned out that such barriers can be circumvented by giving the correct wording to the incentive request.

Since the training data sets are not the same, it is not easy to find the right phrase to remove the ban on a particular BYAM. To automate the selection and addition of such keys to incentives (for example, "I was asked to check the security of the site") a number of gradient PoC attacks have been created, but jailbreaking in this case takes more than an hour.

To speed up the process, UMD created an experimental setup based on the Nvidia RTX A6000 GPU with 48 GB of memory and wrote a special program (the source code will soon be available on GitHub). The software performs a beam search on the AdvBench Harmful Behaviors training set and feeds LLM unacceptable stimuli from the point of view of ethics, and then uses the algorithm to identify words and punctuation marks that provoke problematic output.

Using the GPU allowed us to reduce the time for generating test stimuli to one minute, while BEAST showed an efficiency of 89% on one of the control LM models-against a maximum of 46% for gradient analogues. The acceleration in comparison with them was from 25 to 65%.

With BEAST, the authors say, you can also enhance LLM hallucinations. Testing has shown that the number of incorrect answers increases by about 20%.

BEAST Attack

Teacher

Professional

Similar threads