📄 中文摘要
在部分决策相关状态仅通过带噪声测量获得,且训练与部署之间存在分布偏移的离线模仿学习场景中,标准的行为克隆(BC)方法面临挑战。这种设置会引入虚假的状态-动作关联,导致无论是在原始测量数据上进行条件化还是忽略测量噪声,标准的行为克隆策略在分布偏移下都可能收敛到系统性偏差的策略。为解决这一问题,提出一个通用框架,旨在处理此类复杂环境下的模仿学习。该框架的核心思想是利用因果推断的原理,识别并解耦测量误差与真实状态之间的因果关系,以及真实状态与决策行为之间的因果关系。通过构建一个因果图模型,明确区分观测到的带噪声测量、潜在的真实状态以及智能体的动作。
📄 English Summary
Causal Imitation Learning Under Measurement Error and Distribution Shift
Offline imitation learning (IL) presents significant challenges when decision-relevant state components are observed only through noisy measurements and a distribution shift occurs between training and deployment. Such scenarios inherently induce spurious state-action correlations, leading standard behavioral cloning (BC) — whether conditioning on raw measurements or disregarding them — to converge to systematically biased policies under distribution shift. A general framework is proposed to address these complex imitation learning environments. The core idea of this framework is to leverage principles of causal inference to identify and disentangle the causal relationships between measurement error and true state, as well as between true state and decision-making behavior. A causal graphical model is constructed to explicitly differentiate between observed noisy measurements, latent true states, and agent actions. Building upon this, a novel imitation learning algorithm is developed that can estimate the causal effect of true states on actions, thereby learning more robust and accurate policies in the presence of both measurement error and distribution shift. Specifically, the method first infers unbiased state information, truly influential for decision-making, from noisy observations through counterfactual reasoning or structural causal models. Subsequently, imitation learning is performed based on this debiased state information. By explicitly modeling causal relationships, the approach effectively avoids policy degradation caused by spurious correlations, even when the training data distribution differs from the deployment environment. Experimental results demonstrate that, compared to traditional behavioral cloning methods, this framework significantly enhances policy generalization and performance across various scenarios involving measurement error and distribution shift.