From Theory to Practice: Evaluating Data Poisoning Attacks and Defenses in In-Context Learning on Social Media Health Discourse
2510.03636v1
cs.LG, cs.CL, cs.CR
2025-10-08
Авторы:
Rabeya Amin Jhuma, Mostafa Mohaimen Akand Faisal
Abstract
This study explored how in-context learning (ICL) in large language models
can be disrupted by data poisoning attacks in the setting of public health
sentiment analysis. Using tweets of Human Metapneumovirus (HMPV), small
adversarial perturbations such as synonym replacement, negation insertion, and
randomized perturbation were introduced into the support examples. Even these
minor manipulations caused major disruptions, with sentiment labels flipping in
up to 67% of cases. To address this, a Spectral Signature Defense was applied,
which filtered out poisoned examples while keeping the data's meaning and
sentiment intact. After defense, ICL accuracy remained steady at around 46.7%,
and logistic regression validation reached 100% accuracy, showing that the
defense successfully preserved the dataset's integrity. Overall, the findings
extend prior theoretical studies of ICL poisoning to a practical, high-stakes
setting in public health discourse analysis, highlighting both the risks and
potential defenses for robust LLM deployment. This study also highlights the
fragility of ICL under attack and the value of spectral defenses in making AI
systems more reliable for health-related social media monitoring.
Ссылки и действия
Дополнительные ресурсы: