- general, irrelevant statements (example: Remember to always save at least 20 percent of your income for future investments),
- irrelevant facts without any reference (example: cats sleep most of their lives), and
- misleading questions or clues (example: Could the answer be close to 175?).
As the scientists explain, irrelevant statements and trivia are slightly less effective than misleading questions, but still influence the model to produce longer answers. However, the third type of trigger (questions) is the most effective, consistently leading to the highest error rates in all models. It is also particularly effective at causing models to generate excessively long answers and sometimes incorrect solutions.
With “CatAttack”, the researchers have developed an automated iterative attack pipeline to generate such triggers using a weaker, less expensive proxy model (DeepSeek V3). These triggers can be successfully transferred to advanced target models (such as DeepSeek R1 or R1-distilled-Qwen-32B). The result according to the study: The probability that these models provide an incorrect answer increases by over 300 percent.
Errors and longer response times
Even if “CatAttack” did not lead to an incorrect answer, the length of the answer doubled in at least 16 percent of cases according to the study, leading to significant slowdowns and increased costs. The researchers found that in some cases, such conflicting triggers can increase the response length of reasoning models to up to three times the original length.
Read the full article here