📄 中文摘要
从纯观测数据中学习有向无环图(DAG)结构是科学领域长期存在的挑战。现有研究利用数据分布的分数(score),通过叶节点检测初步识别底层DAG的拓扑顺序,随后进行边剪枝以恢复图结构。这种方法将因果发现任务转化为基于排序的策略。广义分数匹配框架被扩展用于因果发现,该框架最初设计用于连续数据。将分数匹配应用于离散或混合数据类型,需要对传统的连续分数匹配进行泛化,以处理非连续数据分布的梯度信息。通过引入适当的核函数和正则化技术,可以估计这些非连续分布的得分函数。在基于排序的因果发现中,分数匹配的作用在于精确估计每个节点的条件分布梯度,这些梯度包含了叶节点信息。通过分析这些梯度,可以识别出在给定其他节点时,某个节点对整体数据分布影响最小的节点,这些节点通常是DAG的叶节点。一旦识别出叶节点,就可以将其从图中移除,并迭代地重复此过程,从而逐步确定整个DAG的拓扑顺序。在获得拓扑顺序后,可以利用各种统计测试或机器学习方法进行边剪枝,以区分真实因果边和虚假关联。广义分数匹配的优势在于其对数据分布假设的灵活性,能够处理各种复杂的数据类型,包括具有非线性依赖关系和高维特征的数据。此外,该方法通过避免对结构假设的强依赖,提高了因果发现的鲁棒性。实验证明,该方法在合成数据和真实世界数据集上均表现出优越的因果结构恢复性能。
📄 English Summary
Ordering-based Causal Discovery via Generalized Score Matching
Learning directed acyclic graph (DAG) structures from purely observational data remains a long-standing challenge across scientific domains. An emerging line of research leverages the score of the data distribution to initially identify a topological order of the underlying DAG via leaf node detection and subsequently performs edge pruning for graph recovery. This ordering-based approach transforms the causal discovery problem into a sequential identification process. The score matching framework for causal discovery is extended, a framework originally designated for continuous data. Applying score matching to discrete or mixed data types necessitates a generalization of traditional continuous score matching to handle gradient information in non-continuous data distributions. This involves introducing appropriate kernel functions and regularization techniques to estimate the score functions for these complex distributions. In ordering-based causal discovery, score matching plays a crucial role in accurately estimating the conditional distribution gradients for each node, which implicitly encode leaf node information. By analyzing these gradients, nodes that exert the minimal influence on the overall data distribution, given other nodes, can be identified, typically corresponding to leaf nodes in the DAG. Once leaf nodes are identified, they can be removed from the graph, and the process can be iteratively repeated to progressively determine the entire topological order of the DAG. Following the acquisition of the topological order, various statistical tests or machine learning methods can be employed for edge pruning to distinguish true causal edges from spurious associations. The advantage of generalized score matching lies in its flexibility regarding data distribution assumptions, enabling it to handle diverse and complex data types, including those with non-linear dependencies and high-dimensional features. Furthermore, this method enhances the robustness of causal discovery by avoiding strong reliance on structural assumptions. Experimental results demonstrate superior performance in recovering causal structures on both synthetic and real-world datasets.