MissNODAG: Differentiable Learning of Cyclic Causal Graphs
from Incomplete Data


 


Muralikrishnna G. Sethuraman                        Razieh Nabi                        Faramarz Fekri

Georgia Institute of Technology                        Emory University                        Georgia Institute of Technology

Abstract

Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data missing not at random. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.

1 INTRODUCTION

Causal discovery, the process of identifying causal relationships from data, is fundamental across scientific domains such as biology, economics, and medicine (Spirtes et al.,, 2000; Sachs et al.,, 2005; Zhang et al.,, 2013; Segal et al.,, 2005; Imbens and Rubin,, 2015). Understanding these relationships is crucial for predicting how systems respond to interventions, enabling informed decision-making in complex systems (Solus et al.,, 2017; Sulik et al.,, 2017; Sethuraman et al.,, 2023). Traditionally, causal relationships are modeled using directed graphs, where nodes represent variables, and directed edges encode cause-effect relationships.

Existing causal discovery methods are typically divided into two main categories: constraint-based and score-based approaches. Constraint-based methods, such as the PC algorithm (Spirtes et al.,, 2000; Triantafillou and Tsamardinos,, 2015; Heinze-Deml et al.,, 2018), infer the causal structure by enforcing conditional independencies observed in the data, though they often struggle with scalability due to the large number of required conditional independence tests. Score-based methods, such as the GES algorithm (Meek,, 1997; Hauser and Bühlmann,, 2012), optimize a penalized score function, like the Bayesian Information Criterion, over the space of candidate graphs, usually employing greedy search techniques. There also exist hybrid methods which combine elements of both approaches, leveraging conditional independence tests alongside score optimization (Tsamardinos et al.,, 2006; Solus et al.,, 2017; Wang et al.,, 2017). Recent advances have introduced differentiable discovery methods, such as the NOTEARS algorithm (Zheng et al.,, 2018), which frames learning of a directed acyclic graph (DAG) as a continuous optimization problem, enabling scalable and efficient solutions via gradient-based methods. Following NOTEARS, several extensions have been developed for learning DAGs under various assumptions in observational settings (Yu et al.,, 2019; Ng et al.,, 2020, 2022; Zheng et al.,, 2020; Lee et al.,, 2019).

Most causal discovery methods make one or both of the following assumptions: (i) the data is fully observed, and (ii) the underlying graph is acyclic. However, real-world systems often violate these assumptions. Biological systems, such as gene regulatory networks, and socio-economic processes frequently exhibit feedback loops (cycles) (Freimer et al.,, 2022), while missing data is common in practical applications (Getzen et al.,, 2023). These complexities significantly limit the applicability of standard causal discovery methods.

Missing data mechanisms are classified into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Little and Rubin,, 2019). One common approach to dealing with missing data involves discarding incomplete samples or excluding variables with missing data (Carter,, 2006; Van den Broeck et al.,, 2015; Strobl et al.,, 2018), which is only suitable when data is MCAR or when missingness rate is negligible. Otherwise, it leads to performance degradation as the missingness increases. Another common approach is imputation-based methods where the missing data is first imputed before applying causal discovery algorithm on the data. Some notable imputation algorithms include multivariate imputation by chained equations (MICE) (White et al.,, 2011), MissForest (Stekhoven and Bühlmann,, 2012), optimal transport (Muzellec et al.,, 2020), and a few deep learning based approaches (Li et al.,, 2019; Luo et al.,, 2018). However, imputation-based methods typically assume that data is MAR, which can introduce bias when data is actually MNAR–a common occurrence in practice (Singh,, 1997; Wang et al.,, 2020; Kyono et al.,, 2021; Gao et al.,, 2022). Research addressing MNAR includes (Gain and Shpitser,, 2018), which uses reweighted observed cases as input to the PC algorithm alongside a weighted correlation matrix. Additionally, Tu et al., (2019) extends the PC algorithm by incorporating corrections to account for both MAR and certain cases of MNAR, while also learning the underlying missingness mechanisms.

Furthermore, while the acyclicity assumption simplifies computations by factorizing joint distributions into conditional densities, many real-world systems feature cyclic relationships. Feedback loops are common in both biological systems and socio-economic processes (Sachs et al.,, 2005; Freimer et al.,, 2022). Several approaches have been developed to relax the acyclicity assumption, allowing for cyclic causal graphs. For example, early work by Richardson, (1996) extended the constrained-based framework to account for cycles, Lacerda et al., (2008) generalized the ICA based causal discovery for linear non-Gaussian cyclic graphs. More recent score-based methods for learning cyclic graphs include (Huetter and Rigollet,, 2020; Améndola et al.,, 2020; Mooij and Heskes,, 2013; Drton et al.,, 2019). Additionally, methods such as those proposed by Hyttinen et al., (2012) and Huetter and Rigollet, (2020) focus on learning cyclic graphs from interventional data. Sethuraman et al., (2023) further extended this line of approach to nonlinear cyclic directed graphs, eliminating the need for augmented Lagrangian-based solvers by directly modeling the data likelihood.

Contributions. In this work, we address two major limitations in causal discovery: handling informative MNAR data and accommodating cyclic relationships. We introduce MissNODAG, a novel framework that adapts an expectation-maximization (EM) procedure to learn cyclic causal graphs from incomplete interventional data, applicable to both linear and nonlinear structural equation models as well as MCAR, MAR, or MNAR missingness models. MissNODAG alternates between imputing missing values and maximizing the expected log-likelihood of the observed data, building on the approach of Gao et al., (2022). Following Sethuraman et al., (2023) and Behrmann et al., (2019), we leverage residual normalizing flows to model data likelihood. Through synthetic experiments, we show that MissNODAG outperforms state-of-the-art imputation techniques combined with causal learning on partially missing interventional data.

We organize the paper as follows. In Section 2, we describe the problem setup and outline the modeling assumptions. Section 3 introduces the proposed expectation-maximization-based MissNODAG framework. In Section 4, we validate MissNODAG on various synthetic datasets. Section 5 concludes the paper. All proofs are deferred to the appendix.

2 PROBLEM SETUP

A list of notations is provided in Appendix A.

Structural Equation Model. Let 𝒢(X)𝒢𝑋\mathcal{G}(X)caligraphic_G ( italic_X ) denote a possibly cyclic causal graph with a set of vertices X=(X1,,XK)𝑋subscript𝑋1subscript𝑋𝐾X=(X_{1},\ldots,X_{K})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), representing a vector of K𝐾Kitalic_K variables connected by directed edges. We will abbreviate 𝒢(X)𝒢𝑋\mathcal{G}(X)caligraphic_G ( italic_X ) as simply 𝒢𝒢\mathcal{G}caligraphic_G, when the vertex set is clear from the given context. We assume the following structural equation model (SEM) with additive noise terms to capture the functional relationships between variables in 𝒢𝒢\mathcal{G}caligraphic_G (Bollen,, 1989; Pearl, 2009a, ):

Xk=fk(pa𝒢(Xk))+ϵk,k=1,,K,formulae-sequencesubscript𝑋𝑘subscript𝑓𝑘subscriptpa𝒢subscript𝑋𝑘subscriptitalic-ϵ𝑘𝑘1𝐾X_{k}=f_{k}\big{(}\textrm{pa}_{\mathcal{G}}(X_{k})\big{)}+\epsilon_{k},\quad k% =1,\ldots,K,italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 1 , … , italic_K , (1)

where pa𝒢(Xk)subscriptpa𝒢subscript𝑋𝑘\textrm{pa}_{\mathcal{G}}(X_{k})pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denotes the parents of Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in 𝒢𝒢\mathcal{G}caligraphic_G, that is, pa𝒢(Xk)={XX:XXk}subscriptpa𝒢subscript𝑋𝑘conditional-setsubscript𝑋𝑋subscript𝑋subscript𝑋𝑘\textrm{pa}_{\mathcal{G}}(X_{k})=\{X_{\ell}\in X:X_{\ell}\to X_{k}\}pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = { italic_X start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ italic_X : italic_X start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. We assume that self-loops (edges of the form XkXksubscript𝑋𝑘subscript𝑋𝑘X_{k}\to X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) are absent in 𝒢(X)𝒢𝑋\mathcal{G}(X)caligraphic_G ( italic_X ), as this could lead to model identifiability issues (Hyttinen et al.,, 2012). The function fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT describes the relationship between Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its parents, with ϵksubscriptitalic-ϵ𝑘\epsilon_{k}italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the exogenous noise term, assumed to be mutually independent (no unmeasured confounders), and collected as ϵ=(ϵ1,,ϵK)italic-ϵsubscriptitalic-ϵ1subscriptitalic-ϵ𝐾\epsilon=(\epsilon_{1},\ldots,\epsilon_{K})italic_ϵ = ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϵ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). Let F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ) collect the functions fk(pa𝒢(Xk))subscript𝑓𝑘subscriptpa𝒢subscript𝑋𝑘f_{k}(\textrm{pa}_{\mathcal{G}}(X_{k}))italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), for all k𝑘kitalic_k. The structural equations in (1) can be written as follows:

X=F(X)+ϵ.𝑋F𝑋italic-ϵX=\mathrm{F}(X)+\epsilon.italic_X = roman_F ( italic_X ) + italic_ϵ . (2)

Let id denote the identity map, so (idF)idF(\textrm{id}-\mathrm{F})( id - roman_F ) maps X𝑋Xitalic_X to ϵitalic-ϵ\epsilonitalic_ϵ. We assume that this mapping is bijective, ensuring the existence of (idF)1superscriptidF1(\textrm{id}-\mathrm{F})^{-1}( id - roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and that both (idF)idF(\textrm{id}-\mathrm{F})( id - roman_F ) and (idF)idF(\textrm{id}-\mathrm{F})( id - roman_F ) are differentiable. The former ensures that there is a unique X𝑋Xitalic_X for a given ϵitalic-ϵ\epsilonitalic_ϵ, thus, we can express X𝑋Xitalic_X as X=(idF)1(ϵ)𝑋superscriptidF1italic-ϵX=(\textrm{id}-\mathrm{F})^{-1}(\epsilon)italic_X = ( id - roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ϵ ). This assumption is needed for our developments in Section 3.2, and is naturally satisfied when the underlying graph is acyclic (Mooij and Heskes,, 2013; Sethuraman et al.,, 2023).

Intervention operations can be readily incorporated into (2). In this work, we focus exclusively on surgical, or hard, interventions (Spirtes et al.,, 2000; Pearl, 2009a, ). Graphically, this corresponds to removing all incoming edges to the intervened variable. Following similar notational convention in (Hyttinen et al.,, 2012; Sethuraman et al.,, 2023), we can decompose X𝑋Xitalic_X into disjoint sets, X=XX𝒪𝑋subscript𝑋subscript𝑋𝒪X=X_{\mathcal{I}}\cup X_{\mathcal{O}}italic_X = italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, where Xsubscript𝑋X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT represents the set of intervened variables in an interventional experiment, and X𝒪subscript𝑋𝒪X_{\mathcal{O}}italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT represents the set of purely observed variables. Let 𝐃{0,1}K×K𝐃superscript01𝐾𝐾\mathbf{D}\in\{0,1\}^{K\times K}bold_D ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT be a diagonal matrix where Dkk=1subscript𝐷𝑘𝑘1D_{kk}=1italic_D start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT = 1 if XkX𝒪subscript𝑋𝑘subscript𝑋𝒪X_{k}\in X_{\mathcal{O}}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT, and 00 otherwise. Under this setting, the SEM in (2) is now modified to:

X=𝐃F(X)+𝐃ϵ+C,𝑋𝐃F𝑋𝐃italic-ϵ𝐶X=\mathbf{D}\mathrm{F}(X)+\mathbf{D}\epsilon+C,italic_X = bold_D roman_F ( italic_X ) + bold_D italic_ϵ + italic_C , (3)

where C𝐶Citalic_C denotes a vector of size K𝐾Kitalic_K representing intervention assignments for variables in X𝑋Xitalic_X. Specifically, Ck=Xksubscript𝐶𝑘subscript𝑋𝑘C_{k}=X_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if XkXsubscript𝑋𝑘subscript𝑋X_{k}\in X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, and Ck=0subscript𝐶𝑘0C_{k}=0italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 otherwise. Let ϵ𝒪subscriptitalic-ϵ𝒪\epsilon_{\mathcal{O}}italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT denote the exogenous noise terms corresponding to variables in X𝒪subscript𝑋𝒪X_{\mathcal{O}}italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT. Let pϵ𝒪(ϵ𝒪)subscript𝑝subscriptitalic-ϵ𝒪subscriptitalic-ϵ𝒪p_{\epsilon_{\mathcal{O}}}(\epsilon_{\mathcal{O}})italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ) and pX(X)subscript𝑝subscript𝑋subscript𝑋{p}_{X_{\mathcal{I}}}(X_{\mathcal{I}})italic_p start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) be the joint probability densities of ϵ𝒪subscriptitalic-ϵ𝒪\epsilon_{\mathcal{O}}italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT and Xsubscript𝑋X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, respectively. We thus have,

pX(X)=pX(X)pϵ𝒪(ϵ𝒪)|detJ(id𝐃F)(X)|,subscript𝑝𝑋𝑋subscript𝑝subscript𝑋subscript𝑋subscript𝑝subscriptitalic-ϵ𝒪subscriptitalic-ϵ𝒪detsubscript𝐽id𝐃F𝑋p_{X}(X)=p_{X_{\mathcal{I}}}(X_{\mathcal{I}})\ p_{\epsilon_{\mathcal{O}}}\big{% (}\epsilon_{\mathcal{O}}\big{)}\ \big{|}\text{det}\ J_{(\textrm{id}-\mathbf{D}% \mathrm{F})}(X)\big{|},italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) = italic_p start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ) | det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | , (4)

where detJ(id𝐃F)(X)detsubscript𝐽id𝐃F𝑋\text{det}\,J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) denotes the determinant of the vector-valued Jacobian matrix of the function (id𝐃F)id𝐃F(\textrm{id}-\mathbf{D}\mathrm{F})( id - bold_D roman_F ) at X𝑋Xitalic_X. See a proof in Appendix B.1.

Missing Data Model. Given sampled data on X𝑋Xitalic_X, let R=(R1,,RK)𝑅subscript𝑅1subscript𝑅𝐾R=(R_{1},\dots,R_{K})italic_R = ( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) be the vector of binary missingness indicators with Rk=1subscript𝑅𝑘1R_{k}=1italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 if Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is observed and Rk=0subscript𝑅𝑘0R_{k}=0italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 if Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is missing. We only observe a coarsened version of Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, denoted by Yksubscript𝑌𝑘Y_{k}italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is deterministically defined as Yk=Xksubscript𝑌𝑘subscript𝑋𝑘Y_{k}=X_{k}italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT when Rk=1subscript𝑅𝑘1R_{k}=1italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, and Yk=?subscript𝑌𝑘?Y_{k}=\,\,?italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ? if Rk=0subscript𝑅𝑘0R_{k}=0italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0. Let Y=(Y1,,YK)𝑌subscript𝑌1subscript𝑌𝐾Y=(Y_{1},\ldots,Y_{K})italic_Y = ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) denote the coarsened variables. Additionally, we have access to S=(S1,,SK)𝑆subscript𝑆1subscript𝑆𝐾S=(S_{1},\ldots,S_{K})italic_S = ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) where Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a binary indicator of intervention, such that Sk=0subscript𝑆𝑘0S_{k}=0italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 if Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is intervened on (i.e., XkXsubscript𝑋𝑘subscript𝑋X_{k}\in X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT), and Sk=1subscript𝑆𝑘1S_{k}=1italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 otherwise. We assume we have n𝑛nitalic_n i.i.d. copies of (Y,R,S)𝑌𝑅𝑆(Y,R,S)( italic_Y , italic_R , italic_S ), and the dataset is denoted by 𝒟={y(i),r(i),s(i)}i=1n𝒟superscriptsubscriptsuperscript𝑦𝑖superscript𝑟𝑖superscript𝑠𝑖𝑖1𝑛\mathcal{D}=\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n}caligraphic_D = { italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where y(i),r(i),s(i)superscript𝑦𝑖superscript𝑟𝑖superscript𝑠𝑖y^{(i)},r^{(i)},s^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT represent the i𝑖iitalic_i-th observed values of Y,R,S𝑌𝑅𝑆Y,R,Sitalic_Y , italic_R , italic_S.

We define a missing data model as the collection of distributions over the variables X,R,Y𝑋𝑅𝑌X,R,Yitalic_X , italic_R , italic_Y. By chain rule of probabilities, we can express p(X,R,Y)𝑝𝑋𝑅𝑌p(X,R,Y)italic_p ( italic_X , italic_R , italic_Y ) as p(X)p(R|X)p(Y|X,R)𝑝𝑋𝑝conditional𝑅𝑋𝑝conditional𝑌𝑋𝑅p(X)p(R|X)p(Y|X,R)italic_p ( italic_X ) italic_p ( italic_R | italic_X ) italic_p ( italic_Y | italic_X , italic_R ). We refer to p(X)𝑝𝑋p(X)italic_p ( italic_X ) as the target law, p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) as the missingness mechanism, p(X,R)𝑝𝑋𝑅p(X,R)italic_p ( italic_X , italic_R ) as the full law, while p(Y|X,R)𝑝conditional𝑌𝑋𝑅p(Y|X,R)italic_p ( italic_Y | italic_X , italic_R ) is the coarsening mechanism, which is deterministically defined. Borrowing ideas from the graphical models of missing data (Mohan et al.,, 2013; Nabi et al.,, 2022), we use graphs to encode assumptions about p(X,R,Y)𝑝𝑋𝑅𝑌p(X,R,Y)italic_p ( italic_X , italic_R , italic_Y ). Specifically, we assume that the relations between variables in the target law p(X)𝑝𝑋p(X)italic_p ( italic_X ) are directed and can include cycles, and the missingness mechanism p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) factorizes according to a DAG, where pa𝒢(Rk)subscriptpa𝒢subscript𝑅𝑘\textrm{pa}_{\mathcal{G}}(R_{k})pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) can only be a subset of X𝑋Xitalic_X and RRk𝑅subscript𝑅𝑘R\setminus R_{k}italic_R ∖ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Finally, due to deterministic relations, Yksubscript𝑌𝑘Y_{k}italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has only two parents Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We denote these graphs by 𝒢m(V)subscript𝒢𝑚𝑉\mathcal{G}_{m}(V)caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_V ), where V=(X,R,Y)𝑉𝑋𝑅𝑌V=(X,R,Y)italic_V = ( italic_X , italic_R , italic_Y ). Two examples of missing data graphs (or m𝑚mitalic_m-graphs), with K=3𝐾3K=3italic_K = 3 substantive variables, are provided in Figure 1; deterministic relations are drawn in gray to distinguish them from probabilistic ones.

X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTR1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTR2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTR3subscript𝑅3R_{3}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTY1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTY2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTY3subscript𝑌3Y_{3}italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT(a)X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTR1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTR2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTR3subscript𝑅3R_{3}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTY1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTY2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTY3subscript𝑌3Y_{3}italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT(b)
Figure 1: Example m𝑚mitalic_m-graphs with three variables illustrating: (a) An MNAR mechanism considered in our MissNODAG framework; (b) An MNAR mechanism where R𝑅Ritalic_Rs are connected and the full law is identifiable.

Graphically, an MCAR mechanism has no incoming edges into any missingness indicator in R𝑅Ritalic_R, a MAR mechanism has parents of missingness indicators that are fully observed, and an MNAR mechanism involves missingness indicators with parents in X𝑋Xitalic_X. Identifying the full or target law in an m𝑚mitalic_m-graph with MNAR mechanisms from observational data is not always possible. Previous work has extensively studied identification in graphical models of missing data (Bhattacharya et al.,, 2020; Nabi et al.,, 2020; Mohan and Pearl,, 2021; Nabi et al.,, 2022). Nabi et al., (2020) have shown that the full law in an m𝑚mitalic_m-graph is identified if and only if there are no edges of the form XkRksubscript𝑋𝑘subscript𝑅𝑘X_{k}\to R_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (no self-censoring) and XjRkRjsubscript𝑋𝑗subscript𝑅𝑘subscript𝑅𝑗X_{j}\to R_{k}\leftarrow R_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (no colluders).

Given partially observed data from a set of interventional experiments, our objective is to learn the underlying full law that generated the sample. Specifically, this involves learning both the underlying target law and the missingness mechanism.

3 THE MissNODAG FRAMEWORK

We assume the target law p(X)𝑝𝑋p(X)italic_p ( italic_X ) and the missingness mechanism p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) are parameterized by finite vectors θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ, respectively. Thus, we write the full law p(X,R)𝑝𝑋𝑅p(X,R)italic_p ( italic_X , italic_R ) as p(X,R|θ,ϕ)=p(X|θ)p(R|X,ϕ).𝑝𝑋conditional𝑅𝜃italic-ϕ𝑝conditional𝑋𝜃𝑝conditional𝑅𝑋italic-ϕp(X,R|\theta,\phi)=p(X|\theta)\ p(R|X,\phi).italic_p ( italic_X , italic_R | italic_θ , italic_ϕ ) = italic_p ( italic_X | italic_θ ) italic_p ( italic_R | italic_X , italic_ϕ ) . In order to learn the full law, we proceed by maximizing the log-likelihood of the observed data law.

Let Γi={k:rk(i)=1}subscriptΓ𝑖conditional-set𝑘superscriptsubscript𝑟𝑘𝑖1\Gamma_{i}=\{k:r_{k}^{(i)}=1\}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k : italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 1 } and Ωi={k:rk(i)=0}subscriptΩ𝑖conditional-set𝑘superscriptsubscript𝑟𝑘𝑖0\Omega_{i}=\{k:r_{k}^{(i)}=0\}roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k : italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0 } represent the sets of indices for the variables that are observed and missing, respectively, in the i𝑖iitalic_i-th sample; thus y(i)=xΓi(i)superscript𝑦𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖y^{(i)}=x_{\Gamma_{i}}^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Consequently, the observed data law p(xΓi(i),r(i))𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}\big{)}italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) can be written down as

p(xΓi(i),r(i)θ,ϕ)=p(xΓi(i),xΩi(i),r(i)θ,ϕ)𝑑xΩi(i).𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖𝜃italic-ϕ𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖conditionalsuperscript𝑟𝑖𝜃italic-ϕdifferential-dsuperscriptsubscript𝑥subscriptΩ𝑖𝑖\displaystyle p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}\mid\theta,\phi\big{)}\!=\!% \int\!p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)},r^{(i)}\mid\theta,\phi% \big{)}\ dx_{\Omega_{i}}^{(i)}.italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ italic_θ , italic_ϕ ) = ∫ italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ italic_θ , italic_ϕ ) italic_d italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT . (5)

This integration is generally intractable due to marginalization over missing variables and the lack of a closed-form solution. We address this intractability in maximizing the observed data likelihood when data is generated from an m𝑚mitalic_m-graph (Section 2). First, we provide an overview of the MissNODAG framework, followed by details on computing the log-likelihood for missing and observed variables, and discuss imputing the missing variables under linear and nonlinear SEMs.

3.1 The Overall Procedure

Assuming the data 𝒟={y(i),r(i),s(i)}i=1n𝒟superscriptsubscriptsuperscript𝑦𝑖superscript𝑟𝑖superscript𝑠𝑖𝑖1𝑛\mathcal{D}=\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n}caligraphic_D = { italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is generated from an experiment where the full law p(X,R)𝑝𝑋𝑅p(X,R)italic_p ( italic_X , italic_R ) is represented via m𝑚mitalic_m-graphs, our goal is to learn the entire m𝑚mitalic_m-graph structure by maximizing, ~(𝒟,θ,ϕ)~𝒟𝜃italic-ϕ\tilde{\mathcal{L}}(\mathcal{D},\theta,\phi)over~ start_ARG caligraphic_L end_ARG ( caligraphic_D , italic_θ , italic_ϕ ), a regularized log-likelihood of the observed data law:

maxθ,ϕi=1nlogp(xΓi(i),r(i)|θ,ϕ)λ1(θ)λ2(ϕ),subscript𝜃italic-ϕsuperscriptsubscript𝑖1𝑛𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖𝜃italic-ϕsubscript𝜆1𝜃subscript𝜆2italic-ϕ\max_{\theta,\phi}\ \sum_{i=1}^{n}\log p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|% \theta,\phi\big{)}-\lambda_{1}\mathcal{R}(\theta)-\lambda_{2}\mathcal{R}(\phi),roman_max start_POSTSUBSCRIPT italic_θ , italic_ϕ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_θ , italic_ϕ ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_R ( italic_θ ) - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_R ( italic_ϕ ) , (6)

where ()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ) is a regularization function that promotes sparsity and λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are regularization parameters.

As discussed, computing p(xΓi(i),r(i)|θ,ϕ)𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖𝜃italic-ϕp(x_{\Gamma_{i}}^{(i)},r^{(i)}|\theta,\phi)italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_θ , italic_ϕ ) is generally intractable, with no closed-form solution. However, (6) can be solved using the iterative penalized expectation-maximization (EM) method (Chen et al.,, 2014). Unlike imputation methods that directly sample missing values, the EM algorithm starts with an initial parameter Θ0=(θ0,ϕ0)superscriptΘ0superscript𝜃0superscriptitalic-ϕ0\Theta^{0}=(\theta^{0},\phi^{0})roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) and alternates between the following two steps at iteration t𝑡titalic_t until convergence:

E-step: Use the current estimates of the model parameters, Θt=(θt,ϕt)superscriptΘ𝑡superscript𝜃𝑡superscriptitalic-ϕ𝑡\Theta^{t}=(\theta^{t},\phi^{t})roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and the non-missing data to compute the expected log-likelihood of the full data, denoted by Q(Θ|Θt)𝑄conditionalΘsuperscriptΘ𝑡Q(\Theta|\Theta^{t})italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and given by:

i=1n𝔼xΩi(i)p(|xΓi(i),r(i),Θt)[logp(xΓi(i),xΩi(i),r(i)Θ)].\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\sim p(\cdot|x_{\Gamma_{i}}^{(i)% },r^{(i)},\Theta^{t})}\Big{[}\log p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^% {(i)},r^{(i)}\mid\Theta\big{)}\Big{]}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_p ( ⋅ | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ roman_Θ ) ] . (7)

M-step: Maximize Q(Θ|Θt)𝑄conditionalΘsuperscriptΘ𝑡Q(\Theta|\Theta^{t})italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), computed in the E-step, with respect to ΘΘ\Thetaroman_Θ:

Θt+1=argmaxΘQ(ΘΘt)λ1(θ)λ2(ϕ).superscriptΘ𝑡1subscriptΘ𝑄conditionalΘsuperscriptΘ𝑡subscript𝜆1𝜃subscript𝜆2italic-ϕ\Theta^{t+1}=\arg\max_{\Theta}\ Q(\Theta\mid\Theta^{t})-\lambda_{1}\mathcal{R}% (\theta)-\lambda_{2}\mathcal{R}(\phi).roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT italic_Q ( roman_Θ ∣ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_R ( italic_θ ) - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_R ( italic_ϕ ) . (8)

We use stochastic gradient-based solvers to solve the maximization problem, alternating between updating the parameters of the target law, θ𝜃\thetaitalic_θ, and the parameters of the missingness mechanism, ϕitalic-ϕ\phiitalic_ϕ. Note that,

i=1nlogp(xΓi(i),r(i)Θ)Q(ΘΘt)const.superscriptsubscript𝑖1𝑛𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖Θ𝑄conditionalΘsuperscriptΘ𝑡const\sum_{i=1}^{n}\log p(x_{\Gamma_{i}}^{(i)},r^{(i)}\mid\Theta)\geq Q(\Theta\mid% \Theta^{t})-\text{const}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ roman_Θ ) ≥ italic_Q ( roman_Θ ∣ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - const . (9)

This inequality indicates that we maximize a lower bound of the log-likelihood as we update the parameters during the M-step. More details on the derivation of (9) and on the convergence analysis of MissNODAG are provided in Appendix B.3.

Identifiability and data requirements. For a family of interventions, maximizing (8) with complete data (assuming no missingness) is equivalent to maximizing the data likelihood. In the linear SEM with infinite samples, Sethuraman et al., (2023) showed that the recovered graph belongs to the same interventional quasi-equivalence class (Ghassami et al.,, 2020) as the ground truth. With single-node interventions for all nodes, the equivalence class collapses to the true causal graph (Hyttinen et al.,, 2012). In the presence of missing data, the EM algorithm does not directly optimize the observed data likelihood, limiting convergence to a stationary point. However, our experiments indicate that MissNODAG performs similarly to NODAGS-Flow trained on complete data (assuming no missingness), especially when single-node interventions cover all nodes and the missing probability is low (see Section 4). If the target law factorizes according to a DAG, identifiability from observational data is possible with modifications; see Appendix C.

3.2 Computational Details of the E-step

Computing the expected log-likelihood in the E-step can be challenging for a directed (cyclic) graph. First, let’s look at logp(xΓi(i),xΩi(i),r(i)|Θ)𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖conditionalsuperscript𝑟𝑖Θ\log p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)},r^{(i)}|\Theta\big{)}roman_log italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ ) in (7), which equals:

logpX(xΓi(i),xΩi(i)|θ)+logp(r(i)|xΩi(i),xΓi(i),ϕ).subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖𝜃𝑝conditionalsuperscript𝑟𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖italic-ϕ\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)}|\theta\big{)}+\log p% \big{(}r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\big{)}.roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_θ ) + roman_log italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ ) . (10)

3.2.1 Target Law, logpX(Xθ)subscript𝑝𝑋conditional𝑋𝜃\log p_{X}\big{(}X\mid\theta\big{)}roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∣ italic_θ )

As per (4), computing logpX(X|θ)subscript𝑝𝑋conditional𝑋𝜃\log p_{X}(X|\theta)roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X | italic_θ ) in (10) requires:

log|detJ(id𝐃F)(X)|.subscript𝐽id𝐃F𝑋\log\big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}.roman_log | roman_det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | . (11)

In the worst case, the Jacobian matrix may require gradient calls in the order of K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To that end, following Sethuraman et al., (2023), we employ contractive residual flows (Behrmann et al.,, 2019; Chen et al.,, 2019) to compute the log-determinant of the Jacobian in a tractable manner.

Modeling the SEM in (2). We assume the SEM functions in F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ) in (2) are Lipschitz with Lipschitz constant less than one. Such functions are called contractive functions. It then follows from Banach fixed point theorem (Rudin,, 1953) that the mapping function (id𝐃F)id𝐃F(\textrm{id}-\mathbf{D}\mathrm{F})( id - bold_D roman_F ) is contractive and invertible.

We use neural networks to model the contractive functions in F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ). As shown by Behrmann et al., (2019), neural networks can be constrained to be contractive during the training phase by rescaling the layer weights by their corresponding spectral norm. While contractivity is a sufficient condition for the existence of (idF)1superscriptidF1(\textrm{id}-\mathrm{F})^{-1}( id - roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, it is not necessary. When the underlying graph governing the target law p(X)𝑝𝑋p(X)italic_p ( italic_X ) is a DAG, it is possible to have non-contractive functions in F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ) for which (idF)1superscriptidF1(\textrm{id}-\mathrm{F})^{-1}( id - roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT exists; see (Sethuraman et al.,, 2023) for more details.

Naive implementation of a neural network may not produce promising results as it may introduce self-cycles. To circumvent this issue and simultaneously add sparsity penalization, we introduce a dependency mask matrix 𝑴{0,1}K×Ksimilar-to𝑴superscript01𝐾𝐾\bm{M}\sim\{0,1\}^{K\times K}bold_italic_M ∼ { 0 , 1 } start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT to explicitly encode the dependencies between the nodes in the graph, with zero diagonal entries to mask out the self-loops. Thus the SEM model F(X)𝐹𝑋F(X)italic_F ( italic_X ) takes the following form

[Fθ(X)]k=[NNθ(M,kX)]k,subscriptdelimited-[]subscriptF𝜃𝑋𝑘subscriptdelimited-[]subscriptNN𝜃direct-productsubscript𝑀𝑘𝑋𝑘[\mathrm{F}_{\theta}(X)]_{k}=[\text{NN}_{\theta}(M_{\ast,k}\odot X)]_{k},[ roman_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ NN start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT ⊙ italic_X ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (12)

where NNθ denotes a fully connected neural network function with parameters θ𝜃\thetaitalic_θ, M,ksubscript𝑀𝑘M_{\ast,k}italic_M start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT denotes the k𝑘kitalic_k-th column of 𝑴𝑴\bm{M}bold_italic_M, and direct-product\odot denotes the Hadamard product. The dependency mask is sampled from Gumbel-softmax distribution (Jang et al.,, 2016), 𝑴p(𝑴|θ)similar-to𝑴𝑝conditional𝑴𝜃\bm{M}\sim p(\bm{M}|\theta)bold_italic_M ∼ italic_p ( bold_italic_M | italic_θ ) and the parameters θ𝜃\thetaitalic_θ are updated during the training (M-step). In this case, the sparsity penalty (θ)𝜃\mathcal{R}(\theta)caligraphic_R ( italic_θ ) in (6) is set as an L1 norm, i.e., (θ)=𝔼𝑴p(|θ)𝑴1\mathcal{R}(\theta)=\mathbb{E}_{\bm{M}\sim p(\cdot|\theta)}\|\bm{M}\|_{1}caligraphic_R ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT bold_italic_M ∼ italic_p ( ⋅ | italic_θ ) end_POSTSUBSCRIPT ∥ bold_italic_M ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Computing the log-determinant in (11). We note that log|detJ(id𝐃F)(X)|=log|det(𝑰𝐃JF)(X)|subscript𝐽id𝐃F𝑋𝑰𝐃subscript𝐽F𝑋\log\big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}=\log\big{|}% \det(\bm{I}-\mathbf{D}J_{\mathrm{F}})(X)\big{|}roman_log | roman_det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | = roman_log | roman_det ( bold_italic_I - bold_D italic_J start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT ) ( italic_X ) |, where 𝑰𝑰\bm{I}bold_italic_I is the K×K𝐾𝐾K\times Kitalic_K × italic_K identity matrix. Thus, the log-determinant of the Jacobian matrix can be computed using an unbiased estimator based on the power series expansion introduced by Behrmann et al., (2019),

log|detJ(id𝐃F)(X)|=m=11mTr{J𝐃Fm(X)},subscript𝐽id𝐃F𝑋superscriptsubscript𝑚11𝑚Trsubscriptsuperscript𝐽𝑚𝐃F𝑋\log\Big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\Big{|}=-\sum_{m=1}^{% \infty}\frac{1}{m}\text{Tr}\Big{\{}J^{m}_{\mathbf{D}\mathrm{F}}(X)\Big{\}},roman_log | roman_det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | = - ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG Tr { italic_J start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT ( italic_X ) } , (13)

where J𝐃Fm(X)superscriptsubscript𝐽𝐃F𝑚𝑋J_{\mathbf{D}\mathrm{F}}^{m}(X)italic_J start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_X ) denotes the Jacobian matrix to the m𝑚mitalic_m-th power. (13) is guaranteed to converge when F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ) is contractive (Hall,, 2013). In practice, (13) is computed by truncating the number of terms in the summation to a finite number. This, however, introduces bias in estimating the log-determinant of the Jacobian. In order to circumvent this issue we follow the steps taken by Chen et al., (2019). The power series expansion is truncated at a random cut-off Np(N)similar-to𝑁subscript𝑝𝑁N\sim p_{\mathbb{N}}(N)italic_N ∼ italic_p start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT ( italic_N ), where psubscript𝑝p_{\mathbb{N}}italic_p start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT is a probability distribution over the set of natural numbers \mathbb{N}blackboard_N. Each term in the finite power series is then re-weighted to obtain the following estimator:

log|detJ(id𝐃F)(X)|=𝔼N[m=1NTr{J𝐃Fm(X)}mP(m)],subscript𝐽id𝐃F𝑋subscript𝔼𝑁delimited-[]superscriptsubscript𝑚1𝑁Trsuperscriptsubscript𝐽𝐃F𝑚𝑋𝑚subscript𝑃𝑚\log\!\big{|}\!\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}\!=\!-% \mathbb{E}_{N}\!\!\Bigg{[}\sum_{m=1}^{N}\frac{\text{Tr}\big{\{}J_{\mathbf{D}% \mathrm{F}}^{m}(X)\big{\}}}{m\cdot P_{\mathbb{N}}(\ell\geq m)}\!\Bigg{]}\!,\!\!\!roman_log | roman_det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | = - blackboard_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG Tr { italic_J start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_X ) } end_ARG start_ARG italic_m ⋅ italic_P start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT ( roman_ℓ ≥ italic_m ) end_ARG ] , (14)

where Psubscript𝑃P_{\mathbb{N}}italic_P start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT is the cumulative density function of psubscript𝑝p_{\mathbb{N}}italic_p start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT. Gradient calls are still required in the order of K𝐾Kitalic_K. We can reduce this further by using the Hutchinson trace estimator (Hutchinson,, 1989), where W𝒩(0,𝑰)similar-to𝑊𝒩0𝑰W\sim\mathcal{N}(0,\bm{I})italic_W ∼ caligraphic_N ( 0 , bold_italic_I ):

Tr{J𝐃Fm(X)}=𝔼W[WJ𝐃Fm(X)W],Trsuperscriptsubscript𝐽𝐃F𝑚𝑋subscript𝔼𝑊delimited-[]superscript𝑊topsuperscriptsubscript𝐽𝐃F𝑚𝑋𝑊\text{Tr}\Big{\{}J_{\mathbf{D}\mathrm{F}}^{m}(X)\Big{\}}=\mathbb{E}_{W}\Big{[}% W^{\top}J_{\mathbf{D}\mathrm{F}}^{m}(X)W\Big{]},Tr { italic_J start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_X ) } = blackboard_E start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT [ italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_X ) italic_W ] , (15)

We note that the remainder of (4) can be efficiently computed with a forward pass of the neural network.

3.2.2 Calculating Expectation via Imputation

Combining (4), (7), (14), and (15), we finally get

Q(Θ|Θt)i=1n𝔼xΩi(i)|xΓi(i),r(i);Θt{logp(r(i)|xΩi(i),xΓi(i),ϕ)+logpϵ𝒪(ϵ𝒪i(i))𝔼N,W[m=1NWJ𝐃Fm(x(i))WmP(m)]},proportional-to𝑄|ΘsuperscriptΘ𝑡superscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡𝑝|superscript𝑟𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖italic-ϕsubscript𝑝subscriptitalic-ϵ𝒪superscriptsubscriptitalic-ϵsubscript𝒪𝑖𝑖subscript𝔼𝑁𝑊delimited-[]superscriptsubscript𝑚1𝑁superscript𝑊topsuperscriptsubscript𝐽𝐃F𝑚superscript𝑥𝑖𝑊𝑚subscript𝑃𝑚\!\!\!Q(\Theta|\Theta^{t})\!\propto\!\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^% {(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};\Theta^{t}}\Bigg{\{}\!\log p\Big{(}r^{(i)}|% x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\Big{)}\\ +\log p_{\epsilon_{\mathcal{O}}}\Big{(}\epsilon_{\mathcal{O}_{i}}^{(i)}\Big{)}% -\mathbb{E}_{N,W}\Bigg{[}\sum_{m=1}^{N}\frac{W^{\top}J_{\mathbf{D}\mathrm{F}}^% {m}(x^{(i)})W}{m\cdot P_{\mathbb{N}}(\ell\geq m)}\Bigg{]}\Bigg{\}},start_ROW start_CELL italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∝ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { roman_log italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ ) end_CELL end_ROW start_ROW start_CELL + roman_log italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT caligraphic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_N , italic_W end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT bold_D roman_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) italic_W end_ARG start_ARG italic_m ⋅ italic_P start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT ( roman_ℓ ≥ italic_m ) end_ARG ] } , end_CELL end_ROW

where ϵ(i)=(id𝐃iF)(x(i))superscriptitalic-ϵ𝑖idsubscript𝐃𝑖Fsuperscript𝑥𝑖\epsilon^{(i)}=(\textrm{id}-\mathbf{D}_{i}\mathrm{F})(x^{(i)})italic_ϵ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( id - bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_F ) ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), 𝐃isubscript𝐃𝑖\mathbf{D}_{i}bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the diagonal matrix corresponding to the interventional mask for the i𝑖iitalic_i-th sample, i.e., 𝐃i=diag(s1(i),,sK(i))subscript𝐃𝑖diagsuperscriptsubscript𝑠1𝑖superscriptsubscript𝑠𝐾𝑖\mathbf{D}_{i}=\text{diag}(s_{1}^{(i)},\ldots,s_{K}^{(i)})bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = diag ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ).

The expectation in the approximation for Q(Θ|Θt)𝑄conditionalΘsuperscriptΘ𝑡Q(\Theta|\Theta^{t})italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) generally lacks a closed-form solution. Therefore, it must be approximated by the sample mean, using samples drawn from the posterior distribution p(xΩi(i)|xΓi(i),r(i),Θt)𝑝conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^{t})italic_p ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). This presents two main challenges: (i) the posterior distribution may not have a closed form, and (ii) direct sampling may be infeasible even when the posterior distribution can be evaluated. The difficulty arises due to the presence of nonlinear relations in F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ) and the missingness mechanism p(r(i)|xΩi(i),xΓi(i),ϕt)𝑝conditionalsuperscript𝑟𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptitalic-ϕ𝑡p(r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi^{t})italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), which may preclude straightforward sampling. Therefore, we employ rejection sampling (Koller and Friedman,, 2009) to draw samples from a proposal distribution 𝒬(xΩi(i))𝒬superscriptsubscript𝑥subscriptΩ𝑖𝑖\mathcal{Q}(x_{\Omega_{i}}^{(i)})caligraphic_Q ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), from which samples can be readily generated.

To that end, a constant c0>0subscript𝑐00c_{0}>0italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 is chosen such that c0𝒬(xΩi(i))p(xΩi(i)|xΓi(i),r(i),Θt)subscript𝑐0𝒬superscriptsubscript𝑥subscriptΩ𝑖𝑖𝑝conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡c_{0}\mathcal{Q}(x_{\Omega_{i}}^{(i)})\geq p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i% }}^{(i)},r^{(i)},\Theta^{t})italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT caligraphic_Q ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ≥ italic_p ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for all i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n. However, as stated earlier the posterior distribution is not readily available. Thus, from Bayes rule, we have

p(xΩi(i)|xΓi(i),r(i),Θt)=p(xΩi(i),xΓi(i)|θt)p(r(i)|xΩi(i),xΓi(i),ϕt)p(xΓi(i),r(i)|Θt),𝑝conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡𝑝superscriptsubscript𝑥subscriptΩ𝑖𝑖conditionalsuperscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝜃𝑡𝑝conditionalsuperscript𝑟𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptitalic-ϕ𝑡𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖superscriptΘ𝑡p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^{t})\!=\!\frac{p(x_% {\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)}|\theta^{t})p(r^{(i)}|x_{\Omega_{i}}^{(% i)},x_{\Gamma_{i}}^{(i)},\phi^{t})}{p(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t})},italic_p ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ,

where the denominator p(xΓi(i),r(i)|Θt)𝑝superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖superscriptΘ𝑡p(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t})italic_p ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) can be evaluated using fully observed data. The first term in the numerator can be computed efficiently, as discussed in Section 3.2.1. The second term is addressed in Section 3.2.3. Before that, we explain how these calculations simplify under more restrictive models. The imputation procedure is summarized in Algorithm 1.

Algorithm 1 Impute-Rejection
1:Minibatch data ={y(i),r(i),s(i)}i=1nBsuperscriptsubscriptsuperscript𝑦𝑖superscript𝑟𝑖superscript𝑠𝑖𝑖1subscript𝑛𝐵\mathcal{B}\!=\!\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}caligraphic_B = { italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with sampling distribution 𝒬(x)𝒬𝑥\mathcal{Q}(x)caligraphic_Q ( italic_x ).
2:Imputed data ~={x^(i),r(i),s(i)}i=1nB~superscriptsubscriptsuperscript^𝑥𝑖superscript𝑟𝑖superscript𝑠𝑖𝑖1subscript𝑛𝐵\tilde{\mathcal{B}}=\{\widehat{x}^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}over~ start_ARG caligraphic_B end_ARG = { over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.
3:for i=1𝑖1i=1italic_i = 1 to nBsubscript𝑛𝐵n_{B}italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT do
4:     Sample x~Ωi(i)𝒬(xΩi)similar-tosuperscriptsubscript~𝑥subscriptΩ𝑖𝑖𝒬subscript𝑥subscriptΩ𝑖\tilde{x}_{\Omega_{i}}^{(i)}\sim\mathcal{Q}(x_{\Omega_{i}})over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ caligraphic_Q ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).
5:     Pick uU[0,1]similar-to𝑢𝑈01u\sim U[0,1]italic_u ∼ italic_U [ 0 , 1 ].
6:     if up(x~Ωi(i),yΓi(i)|θt)p(r(i)|x~Ωi(i),yΓi(i),ϕt)c0𝒬(x~Ωi(i))𝑢𝑝superscriptsubscript~𝑥subscriptΩ𝑖𝑖conditionalsuperscriptsubscript𝑦subscriptΓ𝑖𝑖superscript𝜃𝑡𝑝conditionalsuperscript𝑟𝑖superscriptsubscript~𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑦subscriptΓ𝑖𝑖superscriptitalic-ϕ𝑡subscript𝑐0𝒬superscriptsubscript~𝑥subscriptΩ𝑖𝑖u\leq\frac{p(\tilde{x}_{\Omega_{i}}^{(i)},y_{\Gamma_{i}}^{(i)}|\theta^{t})p(r^% {(i)}|\tilde{x}_{\Omega_{i}}^{(i)},y_{\Gamma_{i}}^{(i)},\phi^{t})}{c_{0}% \mathcal{Q}(\tilde{x}_{\Omega_{i}}^{(i)})}italic_u ≤ divide start_ARG italic_p ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT caligraphic_Q ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG then
7:         Accept sample: x^Ωi(i)=x~Ωi(i);x^Γi(i)=yΓi(i)formulae-sequencesuperscriptsubscript^𝑥subscriptΩ𝑖𝑖superscriptsubscript~𝑥subscriptΩ𝑖𝑖superscriptsubscript^𝑥subscriptΓ𝑖𝑖superscriptsubscript𝑦subscriptΓ𝑖𝑖\widehat{x}_{\Omega_{i}}^{(i)}=\tilde{x}_{\Omega_{i}}^{(i)};\quad\widehat{x}_{% \Gamma_{i}}^{(i)}=y_{\Gamma_{i}}^{(i)}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_y start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.
8:     end if
9:end forreturn ~={x^(i),r(i),s(i)}i=1nB~superscriptsubscriptsuperscript^𝑥𝑖superscript𝑟𝑖superscript𝑠𝑖𝑖1subscript𝑛𝐵\tilde{\mathcal{B}}=\{\widehat{x}^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}over~ start_ARG caligraphic_B end_ARG = { over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Linear SEMs with MAR mechanisms. When the underlying SEM is linear with additive Gaussian noise and the missingness mechanism is MAR, it is possible to sample from the exact posterior. Under this setup, we can ignore the missingness mechanism and solely focus on the target law. Consider X=𝑩X+ϵ𝑋superscript𝑩top𝑋italic-ϵX=\bm{B}^{\top}X+\epsilonitalic_X = bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + italic_ϵ, where ϵ𝒩(0,𝚲1)similar-toitalic-ϵ𝒩0superscript𝚲1\epsilon\sim\mathcal{N}(0,\bm{\Lambda}^{-1})italic_ϵ ∼ caligraphic_N ( 0 , bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), 𝚲=diag(1/σ12,,1/σK2)𝚲diag1superscriptsubscript𝜎121superscriptsubscript𝜎𝐾2\bm{\Lambda}=\text{diag}(1/\sigma_{1}^{2},\ldots,1/\sigma_{K}^{2})bold_Λ = diag ( 1 / italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , 1 / italic_σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and 𝑩𝑩\bm{B}bold_italic_B is the weighted adjacency matrix of 𝒢𝒢\mathcal{G}caligraphic_G. X𝑋Xitalic_X is also Gaussian distributed, with X𝒩(𝟎,𝚲X1)similar-to𝑋𝒩0superscriptsubscript𝚲𝑋1X\sim\mathcal{N}(\bm{0},\bm{\Lambda}_{X}^{-1})italic_X ∼ caligraphic_N ( bold_0 , bold_Λ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) where

𝚲X=(𝑰𝑩)𝚲(𝑰𝑩).subscript𝚲𝑋𝑰𝑩𝚲𝑰superscript𝑩top\bm{\Lambda}_{X}=(\bm{I}-\bm{B})\bm{\Lambda}(\bm{I}-\bm{B}^{\top}).bold_Λ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = ( bold_italic_I - bold_italic_B ) bold_Λ ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) . (16)

Thus, based on the properties of the Gaussian distribution, the posterior distribution of the missing nodes, conditioned on the observed nodes, also follows a Gaussian distribution (Koller and Friedman,, 2009), i.e., xΩi(i)|xΓi(i)𝒩(μ~(i),𝚲~(i))similar-toconditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖𝒩superscript~𝜇𝑖superscript~𝚲𝑖x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)}\sim\mathcal{N}(\tilde{\mu}^{(i)},% \tilde{\bm{\Lambda}}^{(i)})italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( over~ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) such that 𝚲~(i)=[𝚲X]Ωi,Ωisuperscript~𝚲𝑖subscriptdelimited-[]subscript𝚲𝑋subscriptΩ𝑖subscriptΩ𝑖\tilde{\bm{\Lambda}}^{(i)}=[\bm{\Lambda}_{X}]_{\Omega_{i},\Omega_{i}}over~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ bold_Λ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and η~(i)=𝚲~(i)xΓi(i)superscript~𝜂𝑖superscript~𝚲𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖\tilde{\eta}^{(i)}=-\tilde{\bm{\Lambda}}^{(i)}x_{\Gamma_{i}}^{(i)}over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = - over~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, where η~(i)=𝚲~(i)μ~(i)superscript~𝜂𝑖superscript~𝚲𝑖superscript~𝜇𝑖\tilde{\eta}^{(i)}=\tilde{\bm{\Lambda}}^{(i)}\tilde{\mu}^{(i)}over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = over~ start_ARG bold_Λ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.

The conditional expectation in (7) is estimated by imputing the missing values for each data instance by sampling from the Gaussian distribution described above. In an interventional setting, X=(X,X𝒪)𝑋subscript𝑋subscript𝑋𝒪X=(X_{\mathcal{I}},X_{\mathcal{O}})italic_X = ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ) with X𝒩(𝟎,𝑰)similar-tosubscript𝑋𝒩0𝑰X_{\mathcal{I}}\sim\mathcal{N}(\bm{0},\bm{I})italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ), (16) can be modified as follows

𝚲X,=(𝑰𝐃𝑩)(𝐃𝚲1𝐃+(𝑰𝐃))1(𝑰𝐃𝑩),subscript𝚲𝑋𝑰𝐃𝑩superscript𝐃superscript𝚲1𝐃𝑰𝐃1𝑰𝐃superscript𝑩top\bm{\Lambda}_{X,\mathcal{I}}\!=\!(\bm{I}\!-\!\mathbf{D}\bm{B})(\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}\!+\!(\bm{I}\!-\!\mathbf{D}))^{-1}(\bm{I}\!-\!\mathbf{D% }\bm{B}^{\top}),bold_Λ start_POSTSUBSCRIPT italic_X , caligraphic_I end_POSTSUBSCRIPT = ( bold_italic_I - bold_D bold_italic_B ) ( bold_D bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D + ( bold_italic_I - bold_D ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) , (17)

where 𝚲X,subscript𝚲𝑋\bm{\Lambda}_{X,\mathcal{I}}bold_Λ start_POSTSUBSCRIPT italic_X , caligraphic_I end_POSTSUBSCRIPT denotes the inverse covariance matrix of X𝑋Xitalic_X; see Appendix B.2 for a proof. The rest of the imputation procedure remains the same, except for using 𝚲X,subscript𝚲𝑋\bm{\Lambda}_{X,\mathcal{I}}bold_Λ start_POSTSUBSCRIPT italic_X , caligraphic_I end_POSTSUBSCRIPT in place of 𝚲Xsubscript𝚲𝑋\bm{\Lambda}_{X}bold_Λ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT in the Gaussian distribution.

3.2.3 Missingness Mechanism, logp(RX,ϕ)𝑝conditional𝑅𝑋italic-ϕ\log p\big{(}R\mid X,\phi\big{)}roman_log italic_p ( italic_R ∣ italic_X , italic_ϕ )

In order to compute the log-likelihood in the E-step, we also need to compute logp(r(i)|xΩi(i),xΓi(i),ϕ)𝑝conditionalsuperscript𝑟𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖italic-ϕ\log p\big{(}r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\big{)}roman_log italic_p ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_ϕ ), which according to m𝑚mitalic_m-graphs can be factorized as

p(RX,ϕ)=k=1Kp(Rkpa𝒢m(Rk),ϕk).𝑝conditional𝑅𝑋italic-ϕsuperscriptsubscriptproduct𝑘1𝐾𝑝conditionalsubscript𝑅𝑘subscriptpasubscript𝒢𝑚subscript𝑅𝑘subscriptitalic-ϕ𝑘p(R\mid X,\phi)=\prod_{k=1}^{K}p\big{(}R_{k}\mid\text{pa}_{\mathcal{G}_{m}}(R_% {k}),\phi_{k}\big{)}.italic_p ( italic_R ∣ italic_X , italic_ϕ ) = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (18)

In developing MissNODAG, we focus on a class of MNAR models, where for any RkRsubscript𝑅𝑘𝑅R_{k}\in Ritalic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_R, pa𝒢m(Rk)XXksubscriptpasubscript𝒢𝑚subscript𝑅𝑘𝑋subscript𝑋𝑘\textrm{pa}_{\mathcal{G}_{m}}(R_{k})\subseteq X\setminus X_{k}pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊆ italic_X ∖ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e., there are no self-censoring edges and no connections between the missingness indicators in R𝑅Ritalic_R. This implies RkXk,Rk|pa𝒢m(Rk)perpendicular-tosubscript𝑅𝑘subscript𝑋𝑘conditionalsubscript𝑅𝑘subscriptpasubscript𝒢𝑚subscript𝑅𝑘R_{k}\perp X_{k},R_{-k}|\textrm{pa}_{\mathcal{G}_{m}}(R_{k})italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟂ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT | pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), for all RkRsubscript𝑅𝑘𝑅R_{k}\in Ritalic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_R, based on the Markov properties (Pearl, 2009b, ). Here, Rksubscript𝑅𝑘R_{-k}italic_R start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT denotes the set R𝑅Ritalic_R excluding Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This MNAR class is known as the block-parallel MNAR model (Mohan et al.,, 2013; Nabi et al.,, 2022). Figure 1(a) illustrates this MNAR mechanism with K=3𝐾3K=3italic_K = 3 variables.

Under this MNAR class, p(R|X,ϕ)𝑝conditional𝑅𝑋italic-ϕp(R|X,\phi)italic_p ( italic_R | italic_X , italic_ϕ ) in (18) is identified as a function of the observed data by identifying each conditional factor via p(Rk|pa𝒢m(Rk),Rk=1,ϕk)𝑝conditionalsubscript𝑅𝑘subscriptpasubscript𝒢𝑚subscript𝑅𝑘subscriptsuperscript𝑅𝑘1subscriptitalic-ϕ𝑘p(R_{k}|\textrm{pa}_{\mathcal{G}_{m}}(R_{k}),R^{*}_{-k}=1,\phi_{k})italic_p ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT = 1 , italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where Rksubscriptsuperscript𝑅𝑘R^{*}_{-k}italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT denotes the missingness indicators for pa𝒢m(Rk)subscriptpasubscript𝒢𝑚subscript𝑅𝑘\textrm{pa}_{\mathcal{G}_{m}}(R_{k})pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Each conditional factor is modeled using the expit function, with ϕk={wk,zk}subscriptitalic-ϕ𝑘subscript𝑤𝑘subscript𝑧𝑘\phi_{k}=\{w_{k},z_{k}\}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }:

p(Rk=0pa𝒢m(Rk),ϕk)=expit(wkX+zk).𝑝subscript𝑅𝑘conditional0subscriptpasubscript𝒢𝑚subscript𝑅𝑘subscriptitalic-ϕ𝑘expitsuperscriptsubscript𝑤𝑘top𝑋subscript𝑧𝑘p\big{(}R_{k}=0\mid\text{pa}_{\mathcal{G}_{m}}(R_{k}),\phi_{k}\big{)}\!=\!% \text{expit}\left(w_{k}^{\top}X+z_{k}\right).italic_p ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 ∣ pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = expit ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (19)

Maximizing the M-step with respect to ϕitalic-ϕ\phiitalic_ϕ reduces to solving a sparsity-regularized logistic regression, with our choice of (ϕ)=k=1Kwk1italic-ϕsuperscriptsubscript𝑘1𝐾subscriptnormsubscript𝑤𝑘1\mathcal{R}(\phi)=\sum_{k=1}^{K}\|w_{k}\|_{1}caligraphic_R ( italic_ϕ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Block-parallel MNAR is well-suited for modeling missingness mechanisms in cross-sectional studies, surveys, or retrospective studies. For example, consider a study analyzing the relationship between smoking (X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), tar accumulation in the lungs (X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and a bronchitis diagnosis (X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), where missing entries in the data are indicated by R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and R3subscript𝑅3R_{3}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Under the block-parallel MNAR model, X2R1X3subscript𝑋2subscript𝑅1subscript𝑋3X_{2}\rightarrow R_{1}\leftarrow X_{3}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT suggests that whether an individual’s smoking status is recorded depends on their tar accumulation and bronchitis diagnosis. This could happen when smoking history is only inquired after a suspected diagnosis of tar accumulation and bronchitis. Similarly, X3R2subscript𝑋3subscript𝑅2X_{3}\rightarrow R_{2}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT implies that a suspected bronchitis diagnosis might prompt inquiries into tar accumulation. Additionally, smokers are more likely to be tested for both tar accumulation and bronchitis, represented by R2X1R3subscript𝑅2subscript𝑋1subscript𝑅3R_{2}\leftarrow X_{1}\rightarrow R_{3}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and detecting high tar levels in the lungs may trigger bronchitis testing, reflected by X2R3subscript𝑋2subscript𝑅3X_{2}\rightarrow R_{3}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Another use case of the block-parallel model is discussed in the context of the prisoner’s dilemma by Nabi et al., (2022). This model was also explored in the study by Tu et al., (2019).

Refer to caption
(a) Linear SEM
Refer to caption
(b) Nonlinear SEM
Figure 2: Comparison of results for learning causal graph structure (target law) under linear (left) and nonlinear (right) SEMS with MNAR missingness mechanism (K=10𝐾10K=10italic_K = 10). Each missingness indicator has at most 3 parents.

Extensions to other MNAR models. Identifiability is critical for the EM algorithm since, without it, multiple parameter sets may yield the same likelihood, preventing EM from converging. Beyond block-parallel MNAR, other identifiable MNAR models, such as those with no colluders with an example provided in Figure 1(b) (Nabi et al.,, 2020) (see Section 2), offer viable directions for extending our framework. To learn the full law p(X,R)𝑝𝑋𝑅p(X,R)italic_p ( italic_X , italic_R ), additional constraints are required to ensure that p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) forms a DAG. Specifically, we can introduce a constraint on the adjacency matrix 𝑾K×K𝑾superscript𝐾𝐾\bm{W}\in\mathbb{R}^{K\times K}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT representing edges between missingness indicators, restricting the search space to DAGs using Tr(e𝑾𝑾)K=0Trsuperscript𝑒direct-product𝑾𝑾𝐾0\text{Tr}(e^{\bm{W}\odot\bm{W}})-K=0Tr ( italic_e start_POSTSUPERSCRIPT bold_italic_W ⊙ bold_italic_W end_POSTSUPERSCRIPT ) - italic_K = 0 (Zheng et al.,, 2018). However, this alone does not eliminate colluders, which will require further refinement, a topic we leave for future work. If the goal is to learn only the target law and an identifiable MNAR mechanism is fixed, the EM algorithm can be simplified. In this case, after learning the parameters of the fixed missingness mechanism (e.g., through weighted estimating equations (Seaman and White,, 2013; Bhattacharya et al.,, 2020)), the optimization in (8) focuses only on the target law parameters, θ𝜃\thetaitalic_θ, simplifying the process even for non-block-parallel MNAR models.

4 EXPERIMENTS

The code for MissNoDAG is available at the repository: https://github.com/muralikgs/missnodag.

Using synthetic data, we compared MissNODAG to MissDAG (missdag) (Gao et al.,, 2022), an EM-based causal discovery method limited to DAGs and MAR missingness. We also tested state-of-the-art imputation methods, including optimal transport (optransport) (Muzellec et al.,, 2020), and MissForest (missforest) (Stekhoven and Bühlmann,, 2012), followed by causal graph learning from the imputed data using NODAGS-Flow (Sethuraman et al.,, 2023), see Appendix E for details on the implementation of MissNODAG and the baselines.

In all experiments, we generated cyclic directed graphs with K=10𝐾10K=10italic_K = 10 nodes using the Erdős-Rényi (ER) model, varying edge density between 1 and 2 (ER-1 and ER-2). Self-loops were removed, and the m𝑚mitalic_m-graph described in Section 3.2.3 was used to create partially missing data. MissNODAG’s initial estimate for missingness parameters was obtained via logistic regression on complete cases in the data set (samples with no missing values), then refined through the EM procedure. We evaluated MissNODAG against baselines on linear and nonlinear SEMs with MNAR and MAR mechanisms.

Linear SEM with MNAR Mechanism. The edge coefficients of the causal graph were sampled from a uniform distribution over (0.6,0.25)(0.25,0.6)0.60.250.250.6(-0.6,-0.25)\cup(0.25,0.6)( - 0.6 , - 0.25 ) ∪ ( 0.25 , 0.6 ), and rescaled for contractivity. The data consisted of single-node interventions with n=500𝑛500n=500italic_n = 500 per intervention. Each intervened node was sampled from a standard normal distribution, and ϵ𝒩(0,0.252𝑰)similar-toitalic-ϵ𝒩0superscript0.252𝑰\epsilon\sim\mathcal{N}(0,0.25^{2}\bm{I})italic_ϵ ∼ caligraphic_N ( 0 , 0.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ). We assumed the intervened node was never missing.

We evaluated MissNODAG and the baselines using the Structural Hamming Distance (SHD), which counts the number of operations (deletion, addition, reversal) needed to match the estimated graph to the true one. All models were trained for 100 training epochs, with missingness probabilities varying from 0.10.10.10.1 to 0.50.50.50.5, and experiments repeated on 10 random graphs. Figure 2(a) shows the SHD between the true and the estimated graph, focusing on the target law recovery (edges between Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT’s). MissNODAG outperformed all baselines for ER-1 graphs. When the missing rate is below 0.4, it matched the performance of NODAGS-Flow on complete data (assuming no missingness). A similar trend was observed for ER-2 graphs, further demonstrating MissNODAG’s efficacy. Figure 2 also shows the recovery performance of NODAGS-Flow on complete data (nodags+clean) for comparison.

Nonlinear SEM with MNAR Mechanism. The causal function for the nonlinear SEM is defined as F(X)=tanh(𝑾X).F𝑋superscript𝑾top𝑋\mathrm{F}(X)=\tanh(\bm{W}^{\top}X).roman_F ( italic_X ) = roman_tanh ( bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X ) . Non-zero entries of 𝑾K×K𝑾superscript𝐾𝐾\bm{W}\in\mathbb{R}^{K\times K}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT are sampled from a uniform distribution and scaled to make FF\mathrm{F}roman_F contractive, with a Lipschitz constant of 0.9. The dataset includes single-node interventions with sample size n=500𝑛500n=500italic_n = 500, and noise terms are Gaussian.

We used neural networks with a single hidden layer and tanh activation to learn the causal function F(X)F𝑋\mathrm{F}(X)roman_F ( italic_X ). The objective function (8) was optimized using Adam optimizer (Kingma and Ba,, 2014). All models were trained for 100 epochs, with the average missingness probability ranging from 0.10.10.10.1 to 0.50.50.50.5. Figure 2(b) shows the mean SHD over 10 trails, where MissNODAG consistently outperforms all baselines at both edge densities, with the performance gap widening as edge density increases.

Refer to caption
Figure 3: Comparison of results for learning the missingness mechanism (K=10𝐾10K=10italic_K = 10).

Learning Missingness Mechanism. In addition to learning the target law, MissNODAG also learns the underlying missingness mechanism, as detailed in Section 3.2.3. Figure 3 shows the error in recovering the m𝑚mitalic_m-graph edges corresponding to the missingness mechanism (i.e., X𝑋Xitalic_X to R𝑅Ritalic_R edges), using SHD as the metric. We evaluated the performance on training sets with 5000 and 10000 samples. Each missingness indicator is restricted to have at most 3 parents. As seen in Figure 3, when the average the missing probability is low (<0.3absent0.3<0.3< 0.3), MissNODAG perfectly recovers the missingness mechanism with 10000 samples, while SHD is below 2.5 with 5000 samples. Recovery performance decreases as the missingness probability increases.

Roles of Sample Size and Missingness Sparsity. We conducted additional simulations to examine MissNODAG’s performance on target law recovery as functions of sample size and missingness mechanism sparsity (controlled by the parent set cardinality of the missingness indicators). Due to space constraints, we defer these analyses to Appendices D.1 and D.2.

Linear SEM and MAR Mechanism. For MAR missingness and linear SEMs, missing values can be imputed by sampling from the exact posterior distribution of missing variables given the observed ones, as detailed in Section 3.2.2. Results are summarized in Figure 4. For ER-1, MissNODAG outperforms all baselines and matches NODAGS-Flow on clean data when the missing probability is below 0.4. For ER-2, overall performance slightly decreases, but MissNODAG continues to outperform all baselines.

Refer to caption
Figure 4: Comparison of results for learning causal graph structure (target law) under linear SEMs with MAR missingness mechanism (K=10𝐾10K=10italic_K = 10).

DAG Learning from Observational Data. We conducted experiments on learning DAGs from partially observed observational data with MNAR missingness. See Appendix D.3 for details.

Real-data Application. We tested MissNODAG on a gene regulatory network with gene expression data from genetic interventions (Frangieh et al.,, 2021). See Appendix D.4 for details.

5 DISCUSSION

In this work, we proposed MissNODAG, a novel differentiable causal discovery framework capable of learning nonlinear cyclic causal graphs along with the missingness mechanism from incomplete interventional data. The framework employs an expectation-maximization algorithm that alternates between imputing missing values and optimizing model parameters. We demonstrated how imputation can be efficiently achieved using rejection sampling, and in the case of linear SEMs with MAR missingness, by directly sampling from the posterior distribution. One of the key strengths of MissNODAG is its ability to handle cyclic directed graphs and MNAR missingness, a significant advancement over methods that typically focus on DAGs and MAR mechanisms.

Future research directions include: (1) incorporating realistic measurement noise models in the SEMs to enhance robustness in real-world datasets, as explored in linear DAG models by Saeed et al., (2020); (2) scaling the current framework to larger graphs using low-rank models and variational inference techniques, as explored in acyclic structures by Lopez et al., (2022); (3) allowing for unobserved confounders within the modeling assumptions, as was explored in the case of latent DAGs with complete observations by Bhattacharya et al., (2021); and (4) generalizing our framework to broader classes of identifiable MNAR models, while tackling the challenges of non-identifiability in the full or target laws, as explored in DAG models by Nabi and Bhattacharya, (2023) and Guo et al., (2023).

References

  • Améndola et al., (2020) Améndola, C., Dettling, P., Drton, M., Onori, F., and Wu, J. (2020). Structure learning for cyclic linear causal models. In Conference on Uncertainty in Artificial Intelligence, pages 999–1008. PMLR.
  • Behrmann et al., (2019) Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. (2019). Invertible residual networks. In International Conference on Machine Learning, pages 573–582. PMLR.
  • Bhattacharya et al., (2020) Bhattacharya, R., Nabi, R., Shpitser, I., and Robins, J. M. (2020). Identification in missing data models represented by directed acyclic graphs. In Uncertainty in artificial intelligence, pages 1149–1158. PMLR.
  • Bhattacharya et al., (2021) Bhattacharya, R., Nagarajan, T., Malinsky, D., and Shpitser, I. (2021). Differentiable causal discovery under unmeasured confounding. In International Conference on Artificial Intelligence and Statistics, pages 2314–2322. PMLR.
  • Bollen, (1989) Bollen, K. A. (1989). Structural equations with latent variables, volume 210. John Wiley & Sons.
  • Carter, (2006) Carter, R. L. (2006). Solutions for missing data in structural equation modeling. Research & Practice in Assessment, 1:4–7.
  • Chen et al., (2014) Chen, L. S., Prentice, R. L., and Wang, P. (2014). A penalized em algorithm incorporating missing data mechanism for gaussian parameter estimation. Biometrics, 70(2):312–322.
  • Chen et al., (2019) Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. (2019). Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32.
  • Drton et al., (2019) Drton, M., Fox, C., and Wang, Y. S. (2019). Computation of maximum likelihood estimates in cyclic structural equation models. The Annals of Statistics, 47(2):663 – 690.
  • Frangieh et al., (2021) Frangieh, C. J., Melms, J. C., Thakore, P. I., Geiger-Schuller, K. R., Ho, P., Luoma, A. M., Cleary, B., Jerby-Arnon, L., Malu, S., Cuoco, M. S., et al. (2021). Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nature genetics, 53(3):332–341.
  • Freimer et al., (2022) Freimer, J. W., Shaked, O., Naqvi, S., Sinnott-Armstrong, N., Kathiria, A., Garrido, C. M., Chen, A. F., Cortez, J. T., Greenleaf, W. J., Pritchard, J. K., and Marson, A. (2022). Systematic discovery and perturbation of regulatory genes in human T cells reveals the architecture of immune networks. Nature Genetics, pages 1–12.
  • Friedman, (1998) Friedman, N. (1998). The bayesian structural em algorithm. In Conference on Uncertainty in Artificial Intelligence.
  • Gain and Shpitser, (2018) Gain, A. and Shpitser, I. (2018). Structure learning under missing data. In International conference on probabilistic graphical models, pages 121–132. PMLR.
  • Gao et al., (2022) Gao, E., Ng, I., Gong, M., Shen, L., Huang, W., Liu, T., Zhang, K., and Bondell, H. (2022). Missdag: Causal discovery in the presence of missing data with continuous additive noise models. Advances in Neural Information Processing Systems, 35:5024–5038.
  • Getzen et al., (2023) Getzen, E., Ungar, L., Mowery, D., Jiang, X., and Long, Q. (2023). Mining for equitable health: Assessing the impact of missing data in electronic health records. Journal of biomedical informatics, 139:104269.
  • Ghassami et al., (2020) Ghassami, A., Yang, A., Kiyavash, N., and Zhang, K. (2020). Characterizing distribution equivalence and structure learning for cyclic and acyclic directed graphs. In International Conference on Machine Learning, pages 3494–3504. PMLR.
  • Guo et al., (2023) Guo, A., Zhao, J., and Nabi, R. (2023). Sufficient identification conditions and semiparametric estimation under missing not at random mechanisms. In Uncertainty in Artificial Intelligence, pages 777–787. PMLR.
  • Hall, (2013) Hall, B. C. (2013). Lie Groups, Lie Algebras, and Representations, pages 333–366. Springer New York, New York, NY.
  • Hauser and Bühlmann, (2012) Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464.
  • Heinze-Deml et al., (2018) Heinze-Deml, C., Peters, J., and Meinshausen, N. (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2).
  • Huetter and Rigollet, (2020) Huetter, J.-C. and Rigollet, P. (2020). Estimation rates for sparse linear cyclic causal models. In Peters, J. and Sontag, D., editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 1169–1178. PMLR.
  • Hutchinson, (1989) Hutchinson, M. F. (1989). A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076.
  • Hyttinen et al., (2012) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387–3439.
  • Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Jang et al., (2016) Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Koller and Friedman, (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
  • Kyono et al., (2021) Kyono, T., Zhang, Y., Bellot, A., and van der Schaar, M. (2021). Miracle: Causally-aware imputation via learning missing data mechanisms. Advances in Neural Information Processing Systems, 34:23806–23817.
  • Lacerda et al., (2008) Lacerda, G., Spirtes, P., Ramsey, J., and Hoyer, P. O. (2008). Discovering cyclic causal models by independent components analysis. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI’08, page 366–374, Arlington, Virginia, USA. AUAI Press.
  • Lee et al., (2019) Lee, H.-C., Danieletto, M., Miotto, R., Cherng, S. T., and Dudley, J. T. (2019). Scaling structural learning with NO-BEARS to infer causal transcriptome networks. In Pacific Symposium on Biocomputing 2020, pages 391–402. World Scientific.
  • Li et al., (2019) Li, S. C.-X., Jiang, B., and Marlin, B. (2019). Learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations.
  • Little and Rubin, (2019) Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data, volume 793. John Wiley & Sons.
  • Lopez et al., (2022) Lopez, R., Hütter, J.-C., Pritchard, J., and Regev, A. (2022). Large-scale differentiable causal discovery of factor graphs. Advances in Neural Information Processing Systems, 35:19290–19303.
  • Luo et al., (2018) Luo, Y., Cai, X., Zhang, Y., Xu, J., et al. (2018). Multivariate time series imputation with generative adversarial networks. Advances in neural information processing systems, 31.
  • Meek, (1997) Meek, C. (1997). Graphical Models: Selecting causal and statistical models. PhD thesis, Carnegie Mellon University.
  • Mohan and Pearl, (2021) Mohan, K. and Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116(534):1023–1037.
  • Mohan et al., (2013) Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missing data. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
  • Mooij and Heskes, (2013) Mooij, J. M. and Heskes, T. (2013). Cyclic causal discovery from continuous equilibrium data. In Uncertainty in Artificial Intelligence.
  • Muzellec et al., (2020) Muzellec, B., Josse, J., Boyer, C., and Cuturi, M. (2020). Missing data imputation using optimal transport. In International Conference on Machine Learning, pages 7130–7140. PMLR.
  • Nabi and Bhattacharya, (2023) Nabi, R. and Bhattacharya, R. (2023). On testability and goodness of fit tests in missing data models. In Uncertainty in Artificial Intelligence, pages 1467–1477. PMLR.
  • Nabi et al., (2020) Nabi, R., Bhattacharya, R., and Shpitser, I. (2020). Full law identification in graphical models of missing data: Completeness results. In International conference on machine learning, pages 7153–7163. PMLR.
  • Nabi et al., (2022) Nabi, R., Bhattacharya, R., Shpitser, I., and Robins, J. (2022). Causal and counterfactual views of missing data models. arXiv preprint arXiv:2210.05558.
  • Ng et al., (2020) Ng, I., Ghassami, A., and Zhang, K. (2020). On the role of sparsity and DAG constraints for learning linear dags. Advances in Neural Information Processing Systems, 33:17943–17954.
  • Ng et al., (2022) Ng, I., Zhu, S., Fang, Z., Li, H., Chen, Z., and Wang, J. (2022). Masked gradient-based causal structure learning. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), pages 424–432. SIAM.
  • (45) Pearl, J. (2009a). Causality. Cambridge University Press, 2 edition.
  • (46) Pearl, J. (2009b). Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition.
  • Richardson, (1996) Richardson, T. (1996). A discovery algorithm for directed cyclic graphs. In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, pages 454–461.
  • Rudin, (1953) Rudin, W. (1953). Principles of Mathematical Analysis. McGraw-Hill Book Company, Inc., New York-Toronto-London.
  • Sachs et al., (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
  • Saeed et al., (2020) Saeed, B., Belyaeva, A., Wang, Y., and Uhler, C. (2020). Anchored causal inference in the presence of measurement error. In Conference on uncertainty in artificial intelligence, pages 619–628. PMLR.
  • Seaman and White, (2013) Seaman, S. R. and White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical methods in medical research, 22(3):278–295.
  • Segal et al., (2005) Segal, E., Pe’er, D., Regev, A., Koller, D., Friedman, N., and Jaakkola, T. (2005). Learning module networks. Journal of Machine Learning Research, 6(4).
  • Sethuraman et al., (2023) Sethuraman, M. G., Lopez, R., Mohan, R., Fekri, F., Biancalani, T., and Huetter, J.-C. (2023). Nodags-flow: Nonlinear cyclic causal structure learning. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6371–6387. PMLR.
  • Singh, (1997) Singh, M. (1997). Learning bayesian networks from incomplete data. AAAI/IAAI, 1001:534–539.
  • Solus et al., (2017) Solus, L., Wang, Y., Matejovicova, L., and Uhler, C. (2017). Consistency guarantees for permutation-based causal inference algorithms. arXiv preprint arXiv:1702.03530.
  • Spirtes et al., (2000) Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search. MIT press.
  • Stekhoven and Bühlmann, (2012) Stekhoven, D. J. and Bühlmann, P. (2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
  • Strobl et al., (2018) Strobl, E. V., Visweswaran, S., and Spirtes, P. L. (2018). Fast causal inference with non-random missingness by test-wise deletion. International journal of data science and analytics, 6:47–62.
  • Sulik et al., (2017) Sulik, J. J., Newlands, N. K., and Long, D. S. (2017). Encoding dependence in bayesian causal networks. Frontiers in Environmental Science, 4:84.
  • Triantafillou and Tsamardinos, (2015) Triantafillou, S. and Tsamardinos, I. (2015). Constraint-based causal discovery from multiple interventions over overlapping variable sets. The Journal of Machine Learning Research, 16(1):2147–2205.
  • Tsamardinos et al., (2006) Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78.
  • Tu et al., (2019) Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., and Zhang, K. (2019). Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1762–1770. PMLR.
  • Van den Broeck et al., (2015) Van den Broeck, G., Mohan, K., Choi, A., Darwiche, A., and Pearl, J. (2015). Efficient algorithms for bayesian network parameter learning from incomplete data. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI’15, page 161–170, Arlington, Virginia, USA. AUAI Press.
  • Wang et al., (2020) Wang, Y., Menkovski, V., Wang, H., Du, X., and Pechenizkiy, M. (2020). Causal discovery from incomplete data: a deep learning approach. arXiv preprint arXiv:2001.05343.
  • Wang et al., (2017) Wang, Y., Solus, L., Yang, K., and Uhler, C. (2017). Permutation-based causal inference algorithms with interventions. Advances in Neural Information Processing Systems, 30.
  • White et al., (2011) White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399.
  • Wu, (1983) Wu, C. F. J. (1983). On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1):95 – 103.
  • Yu et al., (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR.
  • Zhang et al., (2013) Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., and Dobrin, R. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer’s disease. Cell, 153(3):707–720.
  • Zheng et al., (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31.
  • Zheng et al., (2020) Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. (2020). Learning sparse nonparametric DAGs. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108, pages 3414–3425.

Supplementary Materials

The appendix is structured as follows. Appendix A offers a summary of the notations used throughout the manuscript for ease of reference. Appendix B contains all the proofs. Appendix C details modifications to the MissNODAG framework necessary when the search space is confined to DAGs. Appendix D provides details on three sets of additional simulations and a real-data application. Appendix E provides additional details on the implementation of the baselines and the MissNODAG framework.

Appendix A GLOSSARY

A comprehensive list of notations used in the manuscript is provided in Table 1.

Table 1: Glossary of terms and notations
Symbol Definition Symbol Definition
X𝑋Xitalic_X, x𝑥xitalic_x Variables, values 𝒢,𝒢m𝒢subscript𝒢𝑚\mathcal{G},\mathcal{G}_{m}caligraphic_G , caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Graph, m-graph
ϵitalic-ϵ\epsilonitalic_ϵ Exogenous noise terms k𝑘kitalic_k Graph node index variable
R𝑅Ritalic_R Missingness indicators K𝐾Kitalic_K Total nodes in 𝒢𝒢\mathcal{G}caligraphic_G
Y𝑌Yitalic_Y Coarsened version of X𝑋Xitalic_X pa𝒢(Xk)subscriptpa𝒢subscript𝑋𝑘\textrm{pa}_{\mathcal{G}}(X_{k})pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) Parent set of Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in 𝒢𝒢\mathcal{G}caligraphic_G
S𝑆Sitalic_S Intervention indicator vector F,fkFsubscript𝑓𝑘\mathrm{F},f_{k}roman_F , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT SEM functions
Xsubscript𝑋X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT Set of intervened nodes id Identity map
X𝒪subscript𝑋𝒪X_{\mathcal{O}}italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT Set of non-intervened nodes (idF)id𝐹(\textrm{id}-F)( id - italic_F ) Function mapping X𝑋Xitalic_X to ϵitalic-ϵ\epsilonitalic_ϵ
pϵ(ϵ)subscript𝑝italic-ϵitalic-ϵp_{\epsilon}(\epsilon)italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_ϵ ) Exogenous noise density function JF(X)subscript𝐽F𝑋J_{\mathrm{F}}(X)italic_J start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT ( italic_X ) Jacobian matrix of FF\mathrm{F}roman_F at X𝑋Xitalic_X
pX(X),p(X,R)subscript𝑝𝑋𝑋𝑝𝑋𝑅p_{X}(X),p(X,R)italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) , italic_p ( italic_X , italic_R ) Target, full laws 𝑰𝑰\bm{I}bold_italic_I K×K𝐾𝐾K\times Kitalic_K × italic_K Identity matrix
p(Y|X,R)𝑝conditional𝑌𝑋𝑅p(Y|X,R)italic_p ( italic_Y | italic_X , italic_R ) Coarsening mechanism 𝐃𝐃\mathbf{D}bold_D Matrix of intervention masks
p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) Missingness mechanism 𝑩𝑩\bm{B}bold_italic_B Weighted adjacency matrix of linear SEM
p(𝑴|θ)𝑝conditional𝑴𝜃p(\bm{M}|\theta)italic_p ( bold_italic_M | italic_θ ) Gumbel-Softmax distribution for sampling 𝑴𝑴\bm{M}bold_italic_M 𝑴𝑴\bm{M}bold_italic_M Dependency mask for SEM function FF\mathrm{F}roman_F
Θ=(θ,ϕ)Θ𝜃italic-ϕ\Theta=(\theta,\phi)roman_Θ = ( italic_θ , italic_ϕ ) Parameters of pX(X)subscript𝑝𝑋𝑋p_{X}(X)italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ), p(R|X)𝑝conditional𝑅𝑋p(R|X)italic_p ( italic_R | italic_X ) 𝚲𝚲\bm{\Lambda}bold_Λ Inverse covariance matrix
ΘtsuperscriptΘ𝑡\Theta^{t}roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT Parameters at EM t𝑡titalic_t-th iteration 𝐃isubscript𝐃𝑖\mathbf{D}_{i}bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Intervention mask corresponding to i𝑖iitalic_i-th sample
Γ,ΩΓΩ\Gamma,\Omegaroman_Γ , roman_Ω Index sets for observed, missing nodes i𝑖iitalic_i Sample index variable
wk,zksubscript𝑤𝑘subscript𝑧𝑘w_{k},z_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Parameters of p(Rkpa𝒢m(Rk))𝑝conditionalsubscript𝑅𝑘subscriptpasubscript𝒢𝑚subscript𝑅𝑘p(R_{k}\mid\textrm{pa}_{\mathcal{G}_{m}}(R_{k}))italic_p ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) n𝑛nitalic_n Total sample size
(y(i),r(i),s(i))superscript𝑦𝑖superscript𝑟𝑖superscript𝑠𝑖(y^{(i)},r^{(i)},s^{(i)})( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) i-th sample nBsubscript𝑛𝐵n_{B}italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT Mini batch size in each EM
𝑨𝑩direct-product𝑨𝑩\bm{A}\odot\bm{B}bold_italic_A ⊙ bold_italic_B Hadamard product of 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B m𝑚mitalic_m Power series expansion index for log-determinant of Jacobian
1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT L1 norm N𝑁Nitalic_N Number of power series terms
𝒬()𝒬\mathcal{Q}(\cdot)caligraphic_Q ( ⋅ ) Proposal distribution for rejection sampling W𝑊Witalic_W Gaussian randon variable used for computing trace of Jacobian
()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ) Sparsity inducing regularizer e𝑨superscript𝑒𝑨e^{\bm{A}}italic_e start_POSTSUPERSCRIPT bold_italic_A end_POSTSUPERSCRIPT Matrix exponent of 𝑨𝑨\bm{A}bold_italic_A

Appendix B PROOFS

B.1 Joint Density of Variables X𝑋Xitalic_X: Target Law

Consider the structural equation model X=F(X)+ϵ𝑋F𝑋italic-ϵX=\mathrm{F}(X)+\epsilonitalic_X = roman_F ( italic_X ) + italic_ϵ, which implies X=(idF)1(ϵ)𝑋superscriptidF1italic-ϵX=(\textrm{id}-\mathrm{F})^{-1}(\epsilon)italic_X = ( id - roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ϵ ). Using the properties of probability density functions, the joint distribution of X=(X1,,XK)𝑋subscript𝑋1subscript𝑋𝐾X=(X_{1},\ldots,X_{K})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) is given by,

pX(X)=pϵ((idF)(X))|detJ(idF)(X)|,subscript𝑝𝑋𝑋subscript𝑝italic-ϵidF𝑋subscript𝐽idF𝑋p_{X}(X)=p_{\epsilon}\big{(}(\textrm{id}-\mathrm{F})(X)\big{)}\Big{|}\det J_{(% \textrm{id}-\mathrm{F})}(X)\Big{|},italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) = italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ( id - roman_F ) ( italic_X ) ) | roman_det italic_J start_POSTSUBSCRIPT ( id - roman_F ) end_POSTSUBSCRIPT ( italic_X ) | , (20)

where pϵ(ϵ)subscript𝑝italic-ϵitalic-ϵp_{\epsilon}(\epsilon)italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_ϵ ) is the probability density function of the exogenous noise vector ϵitalic-ϵ\epsilonitalic_ϵ. Under an interventional setting (X,X𝒪)subscript𝑋subscript𝑋𝒪(X_{\mathcal{I}},X_{\mathcal{O}})( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ), all incoming edges to the nodes in Xsubscript𝑋X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT are removed, leading to the following structural equations:

Xk={X~kif XkXfk(pa𝒢(Xk))if XkXsubscript𝑋𝑘casessubscript~𝑋𝑘if subscript𝑋𝑘subscript𝑋subscript𝑓𝑘subscriptpa𝒢subscript𝑋𝑘if subscript𝑋𝑘subscript𝑋X_{k}=\begin{cases}\tilde{X}_{k}&\text{if }X_{k}\in X_{\mathcal{I}}\\ f_{k}\big{(}\textrm{pa}_{\mathcal{G}}(X_{k})\big{)}&\text{if }X_{k}\notin X_{% \mathcal{I}}\end{cases}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL if italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL if italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∉ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_CELL end_ROW

That is, Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is set to a known value X~ksubscript~𝑋𝑘\tilde{X}_{k}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if it is intervened upon, and the structural equation remains unchanged when Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is purely observed. The above equation can be written more concisely as follows:

Xk=dk(fk(pa𝒢(Xk))+ϵk)+Ck,for k=1,,K.formulae-sequencesubscript𝑋𝑘subscript𝑑𝑘subscript𝑓𝑘subscriptpa𝒢subscript𝑋𝑘subscriptitalic-ϵ𝑘subscript𝐶𝑘for 𝑘1𝐾X_{k}=d_{k}\cdot\Big{(}f_{k}(\textrm{pa}_{\mathcal{G}}(X_{k}))+\epsilon_{k}% \Big{)}+C_{k},\quad\text{for }k=1,\ldots,K.italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , for italic_k = 1 , … , italic_K . (21)

where dk=𝟙{XkX}subscript𝑑𝑘1subscript𝑋𝑘subscript𝑋d_{k}=\mathds{1}\{X_{k}\notin X_{\mathcal{I}}\}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = blackboard_1 { italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∉ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT }, and 𝟙{}1\mathds{1}\{\cdot\}blackboard_1 { ⋅ } is the indicator function, and Ck=X~ksubscript𝐶𝑘subscript~𝑋𝑘C_{k}=\tilde{X}_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if XkXsubscript𝑋𝑘subscript𝑋X_{k}\in X_{\mathcal{I}}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, and Ck=0subscript𝐶𝑘0C_{k}=0italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 otherwise. Let 𝐃K×K𝐃superscript𝐾𝐾\mathbf{D}\in\mathbb{R}^{K\times K}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT be a diagonal matrix such that Dkk=dksubscript𝐷𝑘𝑘subscript𝑑𝑘D_{kk}=d_{k}italic_D start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Thus, (21) can now be combined for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K to obtain the following equation,

X=𝐃F(X)+𝐃ϵ+C,𝑋𝐃F𝑋𝐃italic-ϵ𝐶X=\mathbf{D}\mathrm{F}(X)+\mathbf{D}\epsilon+C,italic_X = bold_D roman_F ( italic_X ) + bold_D italic_ϵ + italic_C , (22)

where F(X)𝐹𝑋F(X)italic_F ( italic_X ) is defined in Section 2 and C=(C1,,CK)𝐶subscript𝐶1subscript𝐶𝐾C=(C_{1},\ldots,C_{K})italic_C = ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). Thus we have, (id𝐃F)(X)=𝐃ϵ+Cid𝐃F𝑋𝐃italic-ϵ𝐶(\textrm{id}-\mathbf{D}\mathrm{F})(X)=\mathbf{D}\epsilon+C( id - bold_D roman_F ) ( italic_X ) = bold_D italic_ϵ + italic_C, this implies that X=(id𝐃F)1(𝐃ϵ+C)𝑋superscriptid𝐃F1𝐃italic-ϵ𝐶X=(\textrm{id}-\mathbf{D}\mathrm{F})^{-1}(\mathbf{D}\epsilon+C)italic_X = ( id - bold_D roman_F ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_D italic_ϵ + italic_C ). Let XpX(X)similar-tosubscript𝑋subscript𝑝subscript𝑋subscript𝑋X_{\mathcal{I}}\sim p_{X_{\mathcal{I}}}(X_{\mathcal{I}})italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ). By the probability rules, we can write (20) as:

pX(X)=pX(X)pϵ𝒪(ϵ𝒪)|detJ(id𝐃F)(X)|,subscript𝑝𝑋𝑋subscript𝑝subscript𝑋subscript𝑋subscript𝑝subscriptitalic-ϵ𝒪subscriptitalic-ϵ𝒪subscript𝐽id𝐃F𝑋p_{X}(X)=p_{X_{\mathcal{I}}}(X_{\mathcal{I}})\ p_{\epsilon_{\mathcal{O}}}(% \epsilon_{\mathcal{O}})\ \Big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)% \Big{|},italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) = italic_p start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ) | roman_det italic_J start_POSTSUBSCRIPT ( id - bold_D roman_F ) end_POSTSUBSCRIPT ( italic_X ) | , (23)

where ϵ𝒪subscriptitalic-ϵ𝒪\epsilon_{\mathcal{O}}italic_ϵ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT is the exogenous noise terms of variables in X𝒪subscript𝑋𝒪X_{\mathcal{O}}italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT.

B.2 Precision Matrix under Interventions with MAR Missingness Mechanism

Consider the following linear SEM with Gaussian exogenous noise term,

X=𝑩X+ϵ,𝑋superscript𝑩top𝑋italic-ϵX=\bm{B}^{\top}X+\epsilon,italic_X = bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + italic_ϵ , (24)

where 𝑩K×K𝑩superscript𝐾𝐾\bm{B}\in\mathbb{R}^{K\times K}bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT is the weighted adjacency matrix of 𝒢𝒢\mathcal{G}caligraphic_G, and ϵ𝒩(0,𝚲1)similar-toitalic-ϵ𝒩0superscript𝚲1\epsilon\sim\mathcal{N}(0,\bm{\Lambda}^{-1})italic_ϵ ∼ caligraphic_N ( 0 , bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), and 𝚲=diag(1/σ12,,1/σK2)𝚲diag1subscriptsuperscript𝜎211superscriptsubscript𝜎𝐾2\bm{\Lambda}=\text{diag}(1/\sigma^{2}_{1},\ldots,1/\sigma_{K}^{2})bold_Λ = diag ( 1 / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , 1 / italic_σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the inverse covariance matrix of ϵitalic-ϵ\epsilonitalic_ϵ. Thus we have, X=(𝑰𝑩)1ϵ𝑋superscript𝑰superscript𝑩top1italic-ϵX=(\bm{I}-\bm{B}^{\top})^{-1}\epsilonitalic_X = ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϵ. From the properties of a Gaussian distribution, X𝑋Xitalic_X also follows Gaussian distribution with zero mean and covariance matrix given by,

𝔼[XX]=𝔼[(𝑰𝑩)1ϵϵ(𝑰𝑩)]=(𝑰𝑩)1𝔼[ϵϵ](𝑰𝑩)=(𝑰𝑩)1𝚲1(𝑰𝑩).𝔼delimited-[]𝑋superscript𝑋top𝔼delimited-[]superscript𝑰superscript𝑩top1italic-ϵsuperscriptitalic-ϵtopsuperscript𝑰superscript𝑩topabsenttopsuperscript𝑰superscript𝑩top1𝔼delimited-[]italic-ϵsuperscriptitalic-ϵtopsuperscript𝑰superscript𝑩topabsenttopsuperscript𝑰superscript𝑩top1superscript𝚲1superscript𝑰superscript𝑩topabsenttop\mathbb{E}[XX^{\top}]=\mathbb{E}\Big{[}(\bm{I}-\bm{B}^{\top})^{-1}\epsilon% \epsilon^{\top}(\bm{I}-\bm{B}^{\top})^{-\top}\Big{]}=(\bm{I}-\bm{B}^{\top})^{-% 1}\mathbb{E}[\epsilon\epsilon^{\top}](\bm{I}-\bm{B}^{\top})^{-\top}=(\bm{I}-% \bm{B}^{\top})^{-1}\bm{\Lambda}^{-1}(\bm{I}-\bm{B}^{\top})^{-\top}.blackboard_E [ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] = blackboard_E [ ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϵ italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT ] = ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_E [ italic_ϵ italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT = ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT .

Therefore, the inverse covariance matrix of X𝑋Xitalic_X is given by 𝚲X=(𝑰𝑩)𝚲(𝑰𝑩)subscript𝚲𝑋𝑰𝑩𝚲𝑰superscript𝑩top\bm{\Lambda}_{X}=(\bm{I}-\bm{B})\bm{\Lambda}(\bm{I}-\bm{B}^{\top})bold_Λ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = ( bold_italic_I - bold_italic_B ) bold_Λ ( bold_italic_I - bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). Given an interventional setting X=(X,X𝒪)𝑋subscript𝑋subscript𝑋𝒪X=(X_{\mathcal{I}},X_{\mathcal{O}})italic_X = ( italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT ), we can write (24) as:

X=𝐃𝑩X+𝐃ϵ+C,𝑋𝐃superscript𝑩top𝑋𝐃italic-ϵ𝐶X=\mathbf{D}\bm{B}^{\top}X+\mathbf{D}\epsilon+C,italic_X = bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + bold_D italic_ϵ + italic_C , (25)

where 𝐃𝐃\mathbf{D}bold_D is the intervention mask matrix, and C𝐶Citalic_C denotes the value of the intervened nodes. Hence, X=(𝑰𝐃𝑩)1(𝐃ϵ+C)𝑋superscript𝑰𝐃superscript𝑩top1𝐃italic-ϵ𝐶X=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}(\mathbf{D}\epsilon+C)italic_X = ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_D italic_ϵ + italic_C ). The intervened nodes are sampled from standard normal distribution, i.e., X𝒩(0,𝑰)similar-tosubscript𝑋𝒩0𝑰X_{\mathcal{I}}\sim\mathcal{N}(0,\bm{I})italic_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_italic_I ). Again, X𝑋Xitalic_X is Gaussian distributed with zero mean and covariance matrix given by,

𝔼[XX]𝔼delimited-[]𝑋superscript𝑋top\displaystyle\mathbb{E}[XX^{\top}]blackboard_E [ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] =(𝑰𝐃𝑩)1𝔼[(𝐃ϵ+C)(𝐃ϵ+C)](𝑰𝐃𝑩)absentsuperscript𝑰𝐃superscript𝑩top1𝔼delimited-[]𝐃italic-ϵ𝐶superscript𝐃italic-ϵ𝐶topsuperscript𝑰𝐃superscript𝑩topabsenttop\displaystyle=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}\mathbb{E}\big{[}(\mathbf{D% }\epsilon+C)(\mathbf{D}\epsilon+C)^{\top}\big{]}(\bm{I}-\mathbf{D}\bm{B}^{\top% })^{-\top}= ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_E [ ( bold_D italic_ϵ + italic_C ) ( bold_D italic_ϵ + italic_C ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT
=(𝑰𝐃𝑩)1(𝐃𝚲1𝐃+(𝑰𝐃))(𝑰𝐃𝑩).absentsuperscript𝑰𝐃superscript𝑩top1𝐃superscript𝚲1𝐃𝑰𝐃superscript𝑰𝐃superscript𝑩topabsenttop\displaystyle=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}\big{(}\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}+(\bm{I}-\mathbf{D})\big{)}(\bm{I}-\mathbf{D}\bm{B}^{% \top})^{-\top}.= ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_D bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D + ( bold_italic_I - bold_D ) ) ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT .

Hence, the inverse covariance of X𝑋Xitalic_X under the interventional setting is given by,

𝚲X,=(𝑰𝐃𝑩)(𝐃𝚲1𝐃+(𝑰𝐃))1(𝑰𝐃𝑩).subscript𝚲𝑋𝑰𝐃𝑩superscript𝐃superscript𝚲1𝐃𝑰𝐃1𝑰𝐃superscript𝑩top\bm{\Lambda}_{X,\mathcal{I}}=(\bm{I}-\mathbf{D}\bm{B})\big{(}\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}+(\bm{I}-\mathbf{D})\big{)}^{-1}(\bm{I}-\mathbf{D}\bm{B% }^{\top}).bold_Λ start_POSTSUBSCRIPT italic_X , caligraphic_I end_POSTSUBSCRIPT = ( bold_italic_I - bold_D bold_italic_B ) ( bold_D bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D + ( bold_italic_I - bold_D ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_I - bold_D bold_italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) .

B.3 Convergence Analysis

Here we provide the convergence analysis of MissNODAG. Our analysis relies on the convergence of the EM algorithm (Wu,, 1983; Friedman,, 1998). The crux of the analysis depends on establishing that the total log-likelihood of the non-missing nodes in the data set either increases or stays the same in each iteration of the algorithm. That is,

i=1nlogpX(xΓi(i),r(i)|Θt+1)i=1nlogpX(xΓi(i),r(i)|Θt).superscriptsubscript𝑖1𝑛subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖superscriptΘ𝑡1superscriptsubscript𝑖1𝑛subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖superscriptΘ𝑡\sum_{i=1}^{n}\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t+1}\big{)% }\geq\sum_{i=1}^{n}\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t}% \big{)}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (26)

To that end, note that

i=1nlogpX(xΓi(i),r(i)|Θ)=i=1nlogpX(xΓi(i)xΩi(i),r(i)|Θ)i=1nlogpX(xΩi(i)|xΓi(i),r(i),Θ).superscriptsubscript𝑖1𝑛subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖Θsuperscriptsubscript𝑖1𝑛subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖conditionalsuperscript𝑟𝑖Θsuperscriptsubscript𝑖1𝑛subscript𝑝𝑋conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖Θ\sum_{i=1}^{n}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta)=\sum_{i=1}^{n}% \log p_{X}(x_{\Gamma_{i}}^{(i)}x_{\Omega_{i}}^{(i)},r^{(i)}|\Theta)-\sum_{i=1}% ^{n}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta).∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ ) .

Taking expectation with respect xΩi(i)|xΓi(i),r(i)conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)}italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT on both side, we get

i=1nlogpX(xΓi(i),r(i)|Θ)superscriptsubscript𝑖1𝑛subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖Θ\displaystyle\sum_{i=1}^{n}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ ) =i=1n𝔼xΩi(i)|xΓi(i),r(i),ΘtlogpX(xΓi(i),r(i)|Θ)absentsuperscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡subscript𝑝𝑋superscriptsubscript𝑥subscriptΓ𝑖𝑖conditionalsuperscript𝑟𝑖Θ\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(% i)},r^{(i)},\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ )
=i=1n𝔼xΩi(i)|xΓi(i);ΘtlogpX(xΓi(i)xΩi(i)|Θ)=Q(Θ|Θt)i=1n𝔼xΩi(i)|xΓi(i);ΘtlogpX(xΩi(i)|xΓi(i),Θ).absentsubscriptsuperscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptΘ𝑡subscript𝑝𝑋conditionalsuperscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptsubscript𝑥subscriptΩ𝑖𝑖Θabsent𝑄conditionalΘsuperscriptΘ𝑡superscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscriptΘ𝑡subscript𝑝𝑋conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖Θ\displaystyle=\underbrace{\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{% \Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)}x_{\Omega_{i}}^{(i% )}|\Theta)}_{=Q(\Theta|\Theta^{t})}-\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{% (i)}|x_{\Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma% _{i}}^{(i)},\Theta).= under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | roman_Θ ) end_ARG start_POSTSUBSCRIPT = italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ ) . (27)

The first term on the RHS in the above equation is nothing but Q(Θ|Θt)𝑄conditionalΘsuperscriptΘ𝑡Q(\Theta|\Theta^{t})italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). This is maximized in the M-step, i.e., Q(Θ|Θt+1)Q(Θ|Θt)𝑄conditionalΘsuperscriptΘ𝑡1𝑄conditionalΘsuperscriptΘ𝑡Q(\Theta|\Theta^{t+1})\geq Q(\Theta|\Theta^{t})italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ≥ italic_Q ( roman_Θ | roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). On the other hand,

i=1n𝔼xΩi(i)|xΓi(i),r(i);ΘtlogpX(xΩi(i)|xΓi(i),r(i),Θt+1)pX(xΩi(i)|xΓi(i),r(i),Θt)=DKL(pX(xΩi(i)|xΓi(i),r(i),Θt)pX(xΩi(i)|xΓi(i),r(i),Θt+1))0.\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};% \Theta^{t}}\log\frac{p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t+1})}{p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^% {t})}=-D_{KL}\Big{(}p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t})\big{\|}p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t+1})\Big{)}\leq 0.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG = - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) ≤ 0 .

Thus,

i=1n𝔼xΩi(i)|xΓi(i),r(i);ΘtlogpX(xΩi(i)|xΓi(i),r(i),Θt+1)i=1n𝔼xΩi(i)|xΓi(i),r(i);ΘtlogpX(xΩi(i)|xΓi(i),r(i),Θt).superscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡subscript𝑝𝑋conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡1superscriptsubscript𝑖1𝑛subscript𝔼conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡subscript𝑝𝑋conditionalsuperscriptsubscript𝑥subscriptΩ𝑖𝑖superscriptsubscript𝑥subscriptΓ𝑖𝑖superscript𝑟𝑖superscriptΘ𝑡\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};% \Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta% ^{t+1})\leq\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)}% ,r^{(i)};\Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i% )},\Theta^{t}).∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (28)

From combining (27) and (28), we can see that at the end of the M-step (26) is satisfied. Similar to the previous results on EM convergence (Wu,, 1983; Friedman,, 1998), MissNODAG reaches a stationary point of the optimization objective.

Appendix C LEARNING DAGS FROM INCOMPLETE OBSERVATIONAL DATA

When the causal graph associated with the target law pX(X)subscript𝑝𝑋𝑋p_{X}(X)italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) is a Directed Acyclic Graph (DAG), it is sometimes possible to recover the true causal structure from purely observational data. For example, when the exogenous noise variable ϵitalic-ϵ\epsilonitalic_ϵ is Gaussian distributed with equal variance, i.e., ϵ𝒩(0,σ2𝑰)similar-toitalic-ϵ𝒩0superscript𝜎2𝑰\epsilon\sim\mathcal{N}(0,\sigma^{2}\bm{I})italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ), then the DAG is identifiable from observational data. This, however, requires constraining the search space in the optimization problem defined by (8) to the set of all DAGs. To achieve this, we leverage the trace-exponential constraint introduced by Zheng et al., (2018). Let 𝑴p(𝑴|θ)similar-to𝑴𝑝conditional𝑴𝜃\bm{M}\sim p(\bm{M}|\theta)bold_italic_M ∼ italic_p ( bold_italic_M | italic_θ ); by adding the following constraint to (8), we ensure that the resulting graph is acyclic:

𝔼𝑴p(𝑴|θ)[Tr(e𝑴𝑴)K]=0,subscript𝔼similar-to𝑴𝑝conditional𝑴𝜃delimited-[]Trsuperscript𝑒direct-product𝑴𝑴𝐾0\mathbb{E}_{\bm{M}\sim p(\bm{M}|\theta)}\Big{[}\text{Tr}\Big{(}e^{\bm{M}\odot% \bm{M}}\Big{)}-K\Big{]}=0,blackboard_E start_POSTSUBSCRIPT bold_italic_M ∼ italic_p ( bold_italic_M | italic_θ ) end_POSTSUBSCRIPT [ Tr ( italic_e start_POSTSUPERSCRIPT bold_italic_M ⊙ bold_italic_M end_POSTSUPERSCRIPT ) - italic_K ] = 0 , (29)

where e𝑨superscript𝑒𝑨e^{\bm{A}}italic_e start_POSTSUPERSCRIPT bold_italic_A end_POSTSUPERSCRIPT denotes the matrix exponential of 𝑨𝑨\bm{A}bold_italic_A, and direct-product\odot signifies the Hadamard product (elementwise multiplication) of two matrices. In the current formulation, the SEM function FF\mathrm{F}roman_F is constrained to be contractive. While this condition guarantees bijectivity of (idF)idF(\textrm{id}-\mathrm{F})( id - roman_F ) for cyclic graphs, it is overly restrictive for DAGs. Thus, inspired by Sethuraman et al., (2023), we introduce a preconditioning term ΛΛ\Lambdaroman_Λ to redefine FF\mathrm{F}roman_F. This modification enables the modeling of non-contractive parent-child dependencies while maintaining efficient computation of the log-determinant of the Jacobian. For more details, refer to Sethuraman et al., (2023). Thus, the maximization objective in (8) takes the following form:

Θt+1=argmaxΘQ(ΘΘt)λ1(θ)λ2(ϕ)+λDAG 𝔼𝑴p(𝑴|θ)[Tr(e𝑴𝑴)K].superscriptΘ𝑡1subscriptΘ𝑄conditionalΘsuperscriptΘ𝑡subscript𝜆1𝜃subscript𝜆2italic-ϕsubscript𝜆DAG subscript𝔼similar-to𝑴𝑝conditional𝑴𝜃delimited-[]Trsuperscript𝑒direct-product𝑴𝑴𝐾\Theta^{t+1}=\arg\max_{\Theta}\ Q(\Theta\mid\Theta^{t})-\lambda_{1}\mathcal{R}% (\theta)-\lambda_{2}\mathcal{R}(\phi)+\lambda_{\text{DAG }}\cdot\mathbb{E}_{% \bm{M}\sim p(\bm{M}|\theta)}\Big{[}\text{Tr}\Big{(}e^{\bm{M}\odot\bm{M}}\Big{)% }-K\Big{]}.roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT italic_Q ( roman_Θ ∣ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_R ( italic_θ ) - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_R ( italic_ϕ ) + italic_λ start_POSTSUBSCRIPT DAG end_POSTSUBSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT bold_italic_M ∼ italic_p ( bold_italic_M | italic_θ ) end_POSTSUBSCRIPT [ Tr ( italic_e start_POSTSUPERSCRIPT bold_italic_M ⊙ bold_italic_M end_POSTSUPERSCRIPT ) - italic_K ] .

Following Ng et al., (2020), this optimization can be solved using stochastic gradient based methods. We present results on learning DAGs from synthetic partially observed data in Appendix D.3.

Appendix D ADDITIONAL EXPERIMENTS

D.1 Target Law Recovery: Performance as a Function of Sample Size

We evaluated the performance of the target law recovery as a function of sample size. The average missingness probability was set to 0.20.20.20.2, and each missingness indicator was restricted to have at most three parents. The data was generated as described in Section 4. The results for the linear SEM are shown in Figure 5 and the results for the nonlinear SEM are shown in Figure 6.

In the linear setting (Figure 5), the performance of all methods improves as the number of training samples increases. MissNODAG performs comparably to NODAGS-Flow trained on complete data (no missing samples in the data set), even with significantly fewer samples than the baseline models. A similar trend is seen in the nonlinear case (Figure 6), where all models show improved results with larger sample sizes. For a lower edge density (ER-1),the models perform similarly. However, with a higher edge density (ER-2), the performance gap widens, particularly with smaller sample sizes.

Refer to caption
Figure 5: Results of target law recovery for linear SEM with varying training set sizes. The average missing probability was set to 0.2, and each Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has a parent set cardinality of 3.
Refer to caption
Figure 6: Results of target law recovery for nonlinear SEM with varying training set sizes. The average missing probability was set to 0.2, and each Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has a parent set cardinality of 3.

D.2 Target Law Recovery: Performance as a Function of Cardinalities for pa𝒢m(Rk)subscriptpasubscript𝒢𝑚subscript𝑅𝑘\textrm{pa}_{\mathcal{G}_{m}}(R_{k})pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )

We also evaluated target law recovery performance as a function of the parent set cardinality of the missingness indicators, which reflects the sparsity of the missingness mechanism. For this analysis, the training set size was fixed at 5000 samples. Results for the linear and nonlinear SEMs are shown in Figures 7 and 8, respectively.

In the linear case (Figure 7), the target law graph is perfectly recovered when the missingness probability is below 0.4. However, as the parent set cardinality increases, recovery performance declines. A similar trend is observed for the nonlinear SEM (Figure 8), where larger parent set cardinality also results in a noticeable drop in recovery performance.

Refer to caption
Figure 7: Results of target law recovery in linear SEM as the parent set cardinality of each Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is varied.
Refer to caption
Figure 8: Results of target law recovery in nonlinear SEM as the parent set cardinality of each Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is varied.

D.3 Target Law Recovery: Learning DAGs from Partially Observed Observational Data

Figures 9 and 10 present the results of learning DAGs from partially observed observational data. We followed the same procedure described in section 4 to generate the data, with the additional constraint that the resulting graphs are acyclic. The SEM functions used for both linear and nonlinear cases were designed to be non-contractive. Details on the learning methodolgy for DAGs is described in Appendix C. In both settings, MissNODAG outperforms the baselines, although its performance decreases compared to learning cyclic graphs from interventional data. We attribute this drop to the increased optimization complexity introduced by the DAG constraint.

Refer to caption
Figure 9: Results of target law recovery for linear SEM when the target factorizes according to a DAG, with MNAR mechanism where Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has a parent set cardinality of 3.
Refer to caption
Figure 10: Results of target law recovery for nonlinear SEM when the target factorizes according to a DAG, with MNAR mechanism where Rksubscript𝑅𝑘R_{k}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has a parent set cardinality of 3.

D.4 Data Application: Gene Perturbation

Here we present an experiment focused on learning causal graph structure corresponding to a gene regulator network from a gene expression data with genetic interventions. In particular, we focus on the Perturb-CITE-seq dataset (Frangieh et al.,, 2021), a type of data set that allows one to study causal relations in gene networks at an unprecedented scale. It contains gene expressions taken from 218,331 melanoma cells split into three cell conditions: (i) control (57,627 cells), (ii) co-culture (73,114 cells), and (iii) interferon (INF)-γ𝛾\gammaitalic_γ (87,590 cells).

Table 2: List of genes chosen from Perturb-CITE-seq dataset (Frangieh et al.,, 2021).
STAT1 B2M LGALS3 PTMA
SSR2 CTPS1 TM4SF1 MRPL47
DNMT1 TMED10

Due to the practical and computational constraints, of the approximately 20,000 genes in the genome, we restrict ourselves to a subset of 10 genes, following the experimental setup of Sethuraman et al., (2023), summarized in Table 2. We then took all the single-node interventions corresponding to the 10 genes. Each cell condition is treated as a separate data set over which the models were trained separately.

Since the data set does not provide a ground truth causal graph, it is not possible to directly compare the performance using SHD. Instead, we compare the performance of the causal discovery methods based on its predictive performance over unseen interventions. To that end, we perform a 90-10 split on the three data sets. The smaller set is treated as the test set, which is then used for performance comparison between MissNODAGS and the baselines. As a performance metric, we use the predicted negative log-likelihood (NLL) over the test set after training the models for 100 epochs.

Refer to caption
Figure 11: Predictive performance over unseen interventions on Perturb-CITE-seq Frangieh et al., (2021) data.

In all the three cell conditions, MissNODAG outperforms the baseline methods, with the difference being the most significant when the average missingness probability is low. MissNODAG matches the performance of NODAGS-Flow trained on clean data when the missingness probability is 0.1 and for the case of control cell condition, the performance of MissNODAG is comparable to clean data untill the missingness probability is 0.2.

Appendix E IMPLEMENTATION DETAILS

In this section, we describe the implementation details of the MissNODAG framework and the baseline models used for performance comparisons.

MissNODAG. We implemented our framework using the Pytorch library in Python and the code used in running the experiments can be found in the codes folder within the supplementary materials. The code for the proposed model can be found at https://github.com/muralikgs/missnodag.

Starting with an initialization of the model parameters Θ0superscriptΘ0\Theta^{0}roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, we alternate between the E-step and M-step until the parameters converge. In the E-step, Algorithm 1 is used for imputing the missing data, followed by maximizing the expected likelihood of the non-missing nodes in the M-step. We follow the same setup as Sethuraman et al., (2023) for modeling the causal functions, i.e., neural networks (NN) along with dependency mask with entries parameterized by Gumbel-softmax distribution, and for computing the log-determinant of the Jacobian, i.e., power series expansion followed by Hutchinson trace estimator. Poisson distribution is used for psubscript𝑝p_{\mathbb{N}}italic_p start_POSTSUBSCRIPT blackboard_N end_POSTSUBSCRIPT for sampling the number of terms in the expansion to reduce the bias introduced while limiting the number of terms in the power series expansion of log-determinant of the Jacobian, see section 3.2. The final objective in the M-step is maximized using Adam optimizer (Kingma and Ba,, 2014).

The learning rate in all our experiments was set to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. The neural network models used in our experiments contained one multi-layer perceptron layer. No nonlinearities were added to the neural networks for the linear SEM experiments. We used tanh activation for the nonlinear SEM experiments and ReLU activation for the experiments on the perturb-CITE-seq data set. The regularization constant λ𝜆\lambdaitalic_λ was set to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the synthetic experiments and 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for the perturb-CITE-seq experiments. All experiments were performed on NVIDIA RTX6000 GPUs.

Baselines. For the baseline NODAGS-Flow, we modify the code base provided by Sethuraman et al., (2023) to use the imputed samples for maximizing the likelihood. The hyperparameters of NODAGS-Flow was set to the values described in the previous subsection.

MissForest imputation is performed using the publicly available python library missingpy. We use the codebase provided by Muzellec et al., (2020) for optimal transport imputation, and the codebase provided by the authors was used for MissDAG (Gao et al.,, 2022). The default parameters are used for Missforest, optimal transport imputation, and MissDAG. The codes for all the baselines can be found inside the codes folder in the supplementary materials.