📄 中文摘要
人工智能辅助决策中,通过操纵解释影响人类信任的对抗性攻击值得关注。即使底层模型性能保持不变,攻击者也能通过生成误导性解释,显著改变用户对人工智能的信任度,进而影响人类决策。具体而言,攻击者可利用解释的复杂性、一致性或归因方式等特定特征,诱导用户过度信任或不信任人工智能建议。这种攻击不仅限于模型性能的感知,更深层次地改变了人类对人工智能决策过程的理解和接受程度。多种攻击策略经用户研究验证有效,揭示了当前可解释人工智能(XAI)系统在抵御此类攻击方面的脆弱性。研究结果强调了开发鲁棒的XAI方法以防范恶意操纵的重要性,并呼吁在设计和部署人工智能系统时,需充分考虑其解释可能被滥用的风险,以维护人机协作的可靠性。
📄 English Summary
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
This research investigates adversarial explanation attacks targeting human trust in AI-assisted decision-making. We demonstrate that attackers can significantly manipulate user trust in AI, and consequently influence human decisions, by generating misleading explanations, even when the underlying model's performance remains unchanged. Specifically, attackers can exploit specific characteristics of explanations, such as complexity, consistency, or attribution methods, to induce either over-trust or under-trust in AI recommendations. These attacks go beyond merely altering the perception of model performance; they fundamentally change how humans understand and accept AI decision processes. We designed and implemented various attack strategies, validating their effectiveness through user studies, which revealed the vulnerability of current explainable AI (XAI) systems to such manipulations. Our findings underscore the critical importance of developing robust XAI methods to defend against malicious manipulation. Furthermore, this study calls for careful consideration of the potential for explanation misuse during the design and deployment of AI systems, to maintain the reliability and security of human-AI collaboration. This work offers a novel perspective on understanding and mitigating potential security threats posed by AI explanations.