AI That Reacts to Trigger Words… UNIST Detects Hidden Malicious Commands!

2026.04.02

2nd Place at IEEE Security & Trustworthy Machine Learning ‘Anti-BAD’ Challenge
A Generalizable Defense Method Across Generation, Classification, and Multilingual Tasks

Artificial intelligence systems can sometimes produce harmful outputs when exposed to hidden signals embedded within inputs. A research team from UNIST has developed a method to eliminate such concealed manipulations, achieving recognition on the international stage.

A joint research team led by Professor Saerom Park (Department of Industrial Engineering & Graduate School of Artificial Intelligence) and Professor Sunghwan Yoon (Graduate School of Artificial Intelligence & Department of Electrical Engineering) secured second place at the “Anti-Backdoor Challenge for Post-Trained Large Language Models (Anti-BAD)” held during the IEEE Security and Trustworthy Machine Learning (SaTML) conference in Munich, Germany, from March 23 to 25. The team received high praise for proposing a general-purpose attack mitigation technique applicable across diverse tasks, including generation, classification, and multilingual processing.

The SaTML conference, organized by IEEE, is a leading international academic event focused on AI security research. Now in its fourth year, it serves as a key platform for presenting and discussing various threats to AI models and corresponding defense technologies. The associated competitions have become an important benchmark for global AI safety research trends.

사진 왼쪽부터 인공지능대학원 하승범, 산업공학과 윤지은·권기완 연구원.

The research team included Professors Park and Yoon, along with researchers Ji-eun Yoon (combined MS/PhD program, Industrial Engineering), Kiwan Kwon (MS program, Industrial Engineering), and Seungbeom Ha (combined MS/PhD program, Graduate School of Artificial Intelligence).

Backdoor attacks are malicious techniques that secretly manipulate AI models to produce specific outputs. While the model behaves normally under typical conditions, it responds in a predetermined way when triggered by specific words or phrases.

Even complex models such as large language models (LLMs) can be compromised with only a small amount of malicious data and minimal fine-tuning, making backdoor attacks one of the most critical threats to AI safety.

The Anti-BAD challenge focused on developing defense methods that minimize the influence of hidden triggers embedded in differently fine-tuned LLMs, effectively restoring them to a safe state comparable to non-compromised models. The competition consisted of six tasks: two generation tasks, two classification tasks, and two multilingual tasks. Each task provided three LLMs, requiring participants to design generalizable defense techniques applicable across different model structures and task types.

The UNIST team’s core approach combined several techniques: model quantization, model merging, outlier parameter detection, and overconfidence mitigation.

For generation tasks, the patterns of embedded backdoors differed even among models performing the same task. The researchers first applied model quantization, using the small perturbations introduced during the process to disrupt hidden backdoor signals. They then employed a consensus-based model merging technique that preserved only commonly shared information across models, thereby weakening malicious responses.

For classification and multilingual tasks, a different strategy was adopted. The team compared clean models with backdoored models to identify abnormal parameter changes and reduce their influence. Additionally, they incorporated filtering mechanisms to detect suspicious input tokens and applied overconfidence mitigation to prevent the model from being overly certain in incorrect outputs. This approach proved both efficient and effective in mitigating backdoor attacks.

Ji-eun Yoon, a participating researcher, stated, “This study demonstrated that even without prior knowledge of attack datasets or methods, it is possible to develop effective defense strategies against malicious use of large language models with minimal intervention. Building on this achievement, we aim to contribute to preventing harmful behaviors before AI systems are deployed to the public and to fostering a safe and trustworthy AI environment.”

Community

Notice

AI That Reacts to Trigger Words… UNIST Detects Hidden Malicious Commands!