1 INTRODUCTION

MissNODAG: Differentiable Learning of Cyclic Causal Graphs
from Incomplete Data

Muralikrishnna G. Sethuraman Razieh Nabi Faramarz Fekri

Georgia Institute of Technology Emory University Georgia Institute of Technology

Abstract

Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data missing not at random. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.

1 INTRODUCTION

Causal discovery, the process of identifying causal relationships from data, is fundamental across scientific domains such as biology, economics, and medicine (Spirtes et al.,, 2000; Sachs et al.,, 2005; Zhang et al.,, 2013; Segal et al.,, 2005; Imbens and Rubin,, 2015). Understanding these relationships is crucial for predicting how systems respond to interventions, enabling informed decision-making in complex systems (Solus et al.,, 2017; Sulik et al.,, 2017; Sethuraman et al.,, 2023). Traditionally, causal relationships are modeled using directed graphs, where nodes represent variables, and directed edges encode cause-effect relationships.

Existing causal discovery methods are typically divided into two main categories: constraint-based and score-based approaches. Constraint-based methods, such as the PC algorithm (Spirtes et al.,, 2000; Triantafillou and Tsamardinos,, 2015; Heinze-Deml et al.,, 2018), infer the causal structure by enforcing conditional independencies observed in the data, though they often struggle with scalability due to the large number of required conditional independence tests. Score-based methods, such as the GES algorithm (Meek,, 1997; Hauser and Bühlmann,, 2012), optimize a penalized score function, like the Bayesian Information Criterion, over the space of candidate graphs, usually employing greedy search techniques. There also exist hybrid methods which combine elements of both approaches, leveraging conditional independence tests alongside score optimization (Tsamardinos et al.,, 2006; Solus et al.,, 2017; Wang et al.,, 2017). Recent advances have introduced differentiable discovery methods, such as the NOTEARS algorithm (Zheng et al.,, 2018), which frames learning of a directed acyclic graph (DAG) as a continuous optimization problem, enabling scalable and efficient solutions via gradient-based methods. Following NOTEARS, several extensions have been developed for learning DAGs under various assumptions in observational settings (Yu et al.,, 2019; Ng et al.,, 2020, 2022; Zheng et al.,, 2020; Lee et al.,, 2019).

Most causal discovery methods make one or both of the following assumptions: (i) the data is fully observed, and (ii) the underlying graph is acyclic. However, real-world systems often violate these assumptions. Biological systems, such as gene regulatory networks, and socio-economic processes frequently exhibit feedback loops (cycles) (Freimer et al.,, 2022), while missing data is common in practical applications (Getzen et al.,, 2023). These complexities significantly limit the applicability of standard causal discovery methods.

Missing data mechanisms are classified into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Little and Rubin,, 2019). One common approach to dealing with missing data involves discarding incomplete samples or excluding variables with missing data (Carter,, 2006; Van den Broeck et al.,, 2015; Strobl et al.,, 2018), which is only suitable when data is MCAR or when missingness rate is negligible. Otherwise, it leads to performance degradation as the missingness increases. Another common approach is imputation-based methods where the missing data is first imputed before applying causal discovery algorithm on the data. Some notable imputation algorithms include multivariate imputation by chained equations (MICE) (White et al.,, 2011), MissForest (Stekhoven and Bühlmann,, 2012), optimal transport (Muzellec et al.,, 2020), and a few deep learning based approaches (Li et al.,, 2019; Luo et al.,, 2018). However, imputation-based methods typically assume that data is MAR, which can introduce bias when data is actually MNAR–a common occurrence in practice (Singh,, 1997; Wang et al.,, 2020; Kyono et al.,, 2021; Gao et al.,, 2022). Research addressing MNAR includes (Gain and Shpitser,, 2018), which uses reweighted observed cases as input to the PC algorithm alongside a weighted correlation matrix. Additionally, Tu et al., (2019) extends the PC algorithm by incorporating corrections to account for both MAR and certain cases of MNAR, while also learning the underlying missingness mechanisms.

Furthermore, while the acyclicity assumption simplifies computations by factorizing joint distributions into conditional densities, many real-world systems feature cyclic relationships. Feedback loops are common in both biological systems and socio-economic processes (Sachs et al.,, 2005; Freimer et al.,, 2022). Several approaches have been developed to relax the acyclicity assumption, allowing for cyclic causal graphs. For example, early work by Richardson, (1996) extended the constrained-based framework to account for cycles, Lacerda et al., (2008) generalized the ICA based causal discovery for linear non-Gaussian cyclic graphs. More recent score-based methods for learning cyclic graphs include (Huetter and Rigollet,, 2020; Améndola et al.,, 2020; Mooij and Heskes,, 2013; Drton et al.,, 2019). Additionally, methods such as those proposed by Hyttinen et al., (2012) and Huetter and Rigollet, (2020) focus on learning cyclic graphs from interventional data. Sethuraman et al., (2023) further extended this line of approach to nonlinear cyclic directed graphs, eliminating the need for augmented Lagrangian-based solvers by directly modeling the data likelihood.

Contributions. In this work, we address two major limitations in causal discovery: handling informative MNAR data and accommodating cyclic relationships. We introduce MissNODAG, a novel framework that adapts an expectation-maximization (EM) procedure to learn cyclic causal graphs from incomplete interventional data, applicable to both linear and nonlinear structural equation models as well as MCAR, MAR, or MNAR missingness models. MissNODAG alternates between imputing missing values and maximizing the expected log-likelihood of the observed data, building on the approach of Gao et al., (2022). Following Sethuraman et al., (2023) and Behrmann et al., (2019), we leverage residual normalizing flows to model data likelihood. Through synthetic experiments, we show that MissNODAG outperforms state-of-the-art imputation techniques combined with causal learning on partially missing interventional data.

We organize the paper as follows. In Section 2, we describe the problem setup and outline the modeling assumptions. Section 3 introduces the proposed expectation-maximization-based MissNODAG framework. In Section 4, we validate MissNODAG on various synthetic datasets. Section 5 concludes the paper. All proofs are deferred to the appendix.

2 PROBLEM SETUP

A list of notations is provided in Appendix A.

Structural Equation Model. Let $\mathcal{G}(X)$ denote a possibly cyclic causal graph with a set of vertices $X=(X_{1},\ldots,X_{K})$ , representing a vector of $K$ variables connected by directed edges. We will abbreviate $\mathcal{G}(X)$ as simply $\mathcal{G}$ , when the vertex set is clear from the given context. We assume the following structural equation model (SEM) with additive noise terms to capture the functional relationships between variables in $\mathcal{G}$ (Bollen,, 1989; Pearl, 2009a, ):

X_{k}=f_{k}\big{(}\textrm{pa}_{\mathcal{G}}(X_{k})\big{)}+\epsilon_{k},\quad k% =1,\ldots,K,

(1)

where $\textrm{pa}_{\mathcal{G}}(X_{k})$ denotes the parents of $X_{k}$ in $\mathcal{G}$ , that is, $\textrm{pa}_{\mathcal{G}}(X_{k})=\{X_{\ell}\in X:X_{\ell}\to X_{k}\}$ . We assume that self-loops (edges of the form $X_{k}\to X_{k}$ ) are absent in $\mathcal{G}(X)$ , as this could lead to model identifiability issues (Hyttinen et al.,, 2012). The function $f_{k}$ describes the relationship between $X_{k}$ and its parents, with $\epsilon_{k}$ as the exogenous noise term, assumed to be mutually independent (no unmeasured confounders), and collected as $\epsilon=(\epsilon_{1},\ldots,\epsilon_{K})$ . Let $\mathrm{F}(X)$ collect the functions $f_{k}(\textrm{pa}_{\mathcal{G}}(X_{k}))$ , for all $k$ . The structural equations in (1) can be written as follows:

X=\mathrm{F}(X)+\epsilon.

(2)

Let id denote the identity map, so $(\textrm{id}-\mathrm{F})$ maps $X$ to $\epsilon$ . We assume that this mapping is bijective, ensuring the existence of $(\textrm{id}-\mathrm{F})^{-1}$ , and that both $(\textrm{id}-\mathrm{F})$ and $(\textrm{id}-\mathrm{F})$ are differentiable. The former ensures that there is a unique $X$ for a given $\epsilon$ , thus, we can express $X$ as $X=(\textrm{id}-\mathrm{F})^{-1}(\epsilon)$ . This assumption is needed for our developments in Section 3.2, and is naturally satisfied when the underlying graph is acyclic (Mooij and Heskes,, 2013; Sethuraman et al.,, 2023).

Intervention operations can be readily incorporated into (2). In this work, we focus exclusively on surgical, or hard, interventions (Spirtes et al.,, 2000; Pearl, 2009a, ). Graphically, this corresponds to removing all incoming edges to the intervened variable. Following similar notational convention in (Hyttinen et al.,, 2012; Sethuraman et al.,, 2023), we can decompose $X$ into disjoint sets, $X=X_{\mathcal{I}}\cup X_{\mathcal{O}}$ , where $X_{\mathcal{I}}$ represents the set of intervened variables in an interventional experiment, and $X_{\mathcal{O}}$ represents the set of purely observed variables. Let $\mathbf{D}\in\{0,1\}^{K\times K}$ be a diagonal matrix where $D_{kk}=1$ if $X_{k}\in X_{\mathcal{O}}$ , and $0$ otherwise. Under this setting, the SEM in (2) is now modified to:

X=\mathbf{D}\mathrm{F}(X)+\mathbf{D}\epsilon+C,

(3)

where $C$ denotes a vector of size $K$ representing intervention assignments for variables in $X$ . Specifically, $C_{k}=X_{k}$ if $X_{k}\in X_{\mathcal{I}}$ , and $C_{k}=0$ otherwise. Let $\epsilon_{\mathcal{O}}$ denote the exogenous noise terms corresponding to variables in $X_{\mathcal{O}}$ . Let $p_{\epsilon_{\mathcal{O}}}(\epsilon_{\mathcal{O}})$ and ${p}_{X_{\mathcal{I}}}(X_{\mathcal{I}})$ be the joint probability densities of $\epsilon_{\mathcal{O}}$ and $X_{\mathcal{I}}$ , respectively. We thus have,

p_{X}(X)=p_{X_{\mathcal{I}}}(X_{\mathcal{I}})\ p_{\epsilon_{\mathcal{O}}}\big{% (}\epsilon_{\mathcal{O}}\big{)}\ \big{|}\text{det}\ J_{(\textrm{id}-\mathbf{D}% \mathrm{F})}(X)\big{|},

(4)

where $\text{det}\,J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)$ denotes the determinant of the vector-valued Jacobian matrix of the function $(\textrm{id}-\mathbf{D}\mathrm{F})$ at $X$ . See a proof in Appendix B.1.

Missing Data Model. Given sampled data on $X$ , let $R=(R_{1},\dots,R_{K})$ be the vector of binary missingness indicators with $R_{k}=1$ if $X_{k}$ is observed and $R_{k}=0$ if $X_{k}$ is missing. We only observe a coarsened version of $X_{k}$ , denoted by $Y_{k}$ , which is deterministically defined as $Y_{k}=X_{k}$ when $R_{k}=1$ , and $Y_{k}=\,\,?$ if $R_{k}=0$ . Let $Y=(Y_{1},\ldots,Y_{K})$ denote the coarsened variables. Additionally, we have access to $S=(S_{1},\ldots,S_{K})$ where $S_{k}$ is a binary indicator of intervention, such that $S_{k}=0$ if $X_{k}$ is intervened on (i.e., $X_{k}\in X_{\mathcal{I}}$ ), and $S_{k}=1$ otherwise. We assume we have $n$ i.i.d. copies of $(Y,R,S)$ , and the dataset is denoted by $\mathcal{D}=\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n}$ , where $y^{(i)},r^{(i)},s^{(i)}$ represent the $i$ -th observed values of $Y,R,S$ .

We define a missing data model as the collection of distributions over the variables $X,R,Y$ . By chain rule of probabilities, we can express $p(X,R,Y)$ as $p(X)p(R|X)p(Y|X,R)$ . We refer to $p(X)$ as the target law, $p(R|X)$ as the missingness mechanism, $p(X,R)$ as the full law, while $p(Y|X,R)$ is the coarsening mechanism, which is deterministically defined. Borrowing ideas from the graphical models of missing data (Mohan et al.,, 2013; Nabi et al.,, 2022), we use graphs to encode assumptions about $p(X,R,Y)$ . Specifically, we assume that the relations between variables in the target law $p(X)$ are directed and can include cycles, and the missingness mechanism $p(R|X)$ factorizes according to a DAG, where $\textrm{pa}_{\mathcal{G}}(R_{k})$ can only be a subset of $X$ and $R\setminus R_{k}$ . Finally, due to deterministic relations, $Y_{k}$ has only two parents $R_{k}$ and $X_{k}$ . We denote these graphs by $\mathcal{G}_{m}(V)$ , where $V=(X,R,Y)$ . Two examples of missing data graphs (or $m$ -graphs), with $K=3$ substantive variables, are provided in Figure 1; deterministic relations are drawn in gray to distinguish them from probabilistic ones.

Figure 1: Example

m

-graphs with three variables illustrating: (a) An MNAR mechanism considered in our MissNODAG framework; (b) An MNAR mechanism where

R

s are connected and the full law is identifiable.

Graphically, an MCAR mechanism has no incoming edges into any missingness indicator in $R$ , a MAR mechanism has parents of missingness indicators that are fully observed, and an MNAR mechanism involves missingness indicators with parents in $X$ . Identifying the full or target law in an $m$ -graph with MNAR mechanisms from observational data is not always possible. Previous work has extensively studied identification in graphical models of missing data (Bhattacharya et al.,, 2020; Nabi et al.,, 2020; Mohan and Pearl,, 2021; Nabi et al.,, 2022). Nabi et al., (2020) have shown that the full law in an $m$ -graph is identified if and only if there are no edges of the form $X_{k}\to R_{k}$ (no self-censoring) and $X_{j}\to R_{k}\leftarrow R_{j}$ (no colluders).

Given partially observed data from a set of interventional experiments, our objective is to learn the underlying full law that generated the sample. Specifically, this involves learning both the underlying target law and the missingness mechanism.

3 THE MissNODAG FRAMEWORK

We assume the target law $p(X)$ and the missingness mechanism $p(R|X)$ are parameterized by finite vectors $\theta$ and $\phi$ , respectively. Thus, we write the full law $p(X,R)$ as $p(X,R|\theta,\phi)=p(X|\theta)\ p(R|X,\phi).$ In order to learn the full law, we proceed by maximizing the log-likelihood of the observed data law.

Let $\Gamma_{i}=\{k:r_{k}^{(i)}=1\}$ and $\Omega_{i}=\{k:r_{k}^{(i)}=0\}$ represent the sets of indices for the variables that are observed and missing, respectively, in the $i$ -th sample; thus $y^{(i)}=x_{\Gamma_{i}}^{(i)}$ . Consequently, the observed data law $p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}\big{)}$ can be written down as

\displaystyle p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}\mid\theta,\phi\big{)}\!=\!% \int\!p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)},r^{(i)}\mid\theta,\phi% \big{)}\ dx_{\Omega_{i}}^{(i)}.

(5)

This integration is generally intractable due to marginalization over missing variables and the lack of a closed-form solution. We address this intractability in maximizing the observed data likelihood when data is generated from an $m$ -graph (Section 2). First, we provide an overview of the MissNODAG framework, followed by details on computing the log-likelihood for missing and observed variables, and discuss imputing the missing variables under linear and nonlinear SEMs.

3.1 The Overall Procedure

Assuming the data $\mathcal{D}=\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n}$ is generated from an experiment where the full law $p(X,R)$ is represented via $m$ -graphs, our goal is to learn the entire $m$ -graph structure by maximizing, $\tilde{\mathcal{L}}(\mathcal{D},\theta,\phi)$ , a regularized log-likelihood of the observed data law:

\max_{\theta,\phi}\ \sum_{i=1}^{n}\log p\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|% \theta,\phi\big{)}-\lambda_{1}\mathcal{R}(\theta)-\lambda_{2}\mathcal{R}(\phi),

(6)

where $\mathcal{R}(\cdot)$ is a regularization function that promotes sparsity and $\lambda_{1},\lambda_{2}$ are regularization parameters.

As discussed, computing $p(x_{\Gamma_{i}}^{(i)},r^{(i)}|\theta,\phi)$ is generally intractable, with no closed-form solution. However, (6) can be solved using the iterative penalized expectation-maximization (EM) method (Chen et al.,, 2014). Unlike imputation methods that directly sample missing values, the EM algorithm starts with an initial parameter $\Theta^{0}=(\theta^{0},\phi^{0})$ and alternates between the following two steps at iteration $t$ until convergence:

E-step: Use the current estimates of the model parameters, $\Theta^{t}=(\theta^{t},\phi^{t})$ , and the non-missing data to compute the expected log-likelihood of the full data, denoted by $Q(\Theta|\Theta^{t})$ , and given by:

\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\sim p(\cdot|x_{\Gamma_{i}}^{(i)% },r^{(i)},\Theta^{t})}\Big{[}\log p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^% {(i)},r^{(i)}\mid\Theta\big{)}\Big{]}.

(7)

M-step: Maximize $Q(\Theta|\Theta^{t})$ , computed in the E-step, with respect to $\Theta$ :

\Theta^{t+1}=\arg\max_{\Theta}\ Q(\Theta\mid\Theta^{t})-\lambda_{1}\mathcal{R}% (\theta)-\lambda_{2}\mathcal{R}(\phi).

(8)

We use stochastic gradient-based solvers to solve the maximization problem, alternating between updating the parameters of the target law, $\theta$ , and the parameters of the missingness mechanism, $\phi$ . Note that,

\sum_{i=1}^{n}\log p(x_{\Gamma_{i}}^{(i)},r^{(i)}\mid\Theta)\geq Q(\Theta\mid% \Theta^{t})-\text{const}.

(9)

This inequality indicates that we maximize a lower bound of the log-likelihood as we update the parameters during the M-step. More details on the derivation of (9) and on the convergence analysis of MissNODAG are provided in Appendix B.3.

Identifiability and data requirements. For a family of interventions, maximizing (8) with complete data (assuming no missingness) is equivalent to maximizing the data likelihood. In the linear SEM with infinite samples, Sethuraman et al., (2023) showed that the recovered graph belongs to the same interventional quasi-equivalence class (Ghassami et al.,, 2020) as the ground truth. With single-node interventions for all nodes, the equivalence class collapses to the true causal graph (Hyttinen et al.,, 2012). In the presence of missing data, the EM algorithm does not directly optimize the observed data likelihood, limiting convergence to a stationary point. However, our experiments indicate that MissNODAG performs similarly to NODAGS-Flow trained on complete data (assuming no missingness), especially when single-node interventions cover all nodes and the missing probability is low (see Section 4). If the target law factorizes according to a DAG, identifiability from observational data is possible with modifications; see Appendix C.

3.2 Computational Details of the E-step

Computing the expected log-likelihood in the E-step can be challenging for a directed (cyclic) graph. First, let’s look at $\log p\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)},r^{(i)}|\Theta\big{)}$ in (7), which equals:

\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},x_{\Omega_{i}}^{(i)}|\theta\big{)}+\log p% \big{(}r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\big{)}.

(10)

3.2.1 Target Law, $\log p_{X}\big{(}X\mid\theta\big{)}$

As per (4), computing $\log p_{X}(X|\theta)$ in (10) requires:

\log\big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}.

(11)

In the worst case, the Jacobian matrix may require gradient calls in the order of $K^{2}$ . To that end, following Sethuraman et al., (2023), we employ contractive residual flows (Behrmann et al.,, 2019; Chen et al.,, 2019) to compute the log-determinant of the Jacobian in a tractable manner.

Modeling the SEM in (2). We assume the SEM functions in $\mathrm{F}(X)$ in (2) are Lipschitz with Lipschitz constant less than one. Such functions are called contractive functions. It then follows from Banach fixed point theorem (Rudin,, 1953) that the mapping function $(\textrm{id}-\mathbf{D}\mathrm{F})$ is contractive and invertible.

We use neural networks to model the contractive functions in $\mathrm{F}(X)$ . As shown by Behrmann et al., (2019), neural networks can be constrained to be contractive during the training phase by rescaling the layer weights by their corresponding spectral norm. While contractivity is a sufficient condition for the existence of $(\textrm{id}-\mathrm{F})^{-1}$ , it is not necessary. When the underlying graph governing the target law $p(X)$ is a DAG, it is possible to have non-contractive functions in $\mathrm{F}(X)$ for which $(\textrm{id}-\mathrm{F})^{-1}$ exists; see (Sethuraman et al.,, 2023) for more details.

Naive implementation of a neural network may not produce promising results as it may introduce self-cycles. To circumvent this issue and simultaneously add sparsity penalization, we introduce a dependency mask matrix $\bm{M}\sim\{0,1\}^{K\times K}$ to explicitly encode the dependencies between the nodes in the graph, with zero diagonal entries to mask out the self-loops. Thus the SEM model $F(X)$ takes the following form

[\mathrm{F}_{\theta}(X)]_{k}=[\text{NN}_{\theta}(M_{\ast,k}\odot X)]_{k},

(12)

where NN_θ denotes a fully connected neural network function with parameters $\theta$ , $M_{\ast,k}$ denotes the $k$ -th column of $\bm{M}$ , and $\odot$ denotes the Hadamard product. The dependency mask is sampled from Gumbel-softmax distribution (Jang et al.,, 2016), $\bm{M}\sim p(\bm{M}|\theta)$ and the parameters $\theta$ are updated during the training (M-step). In this case, the sparsity penalty $\mathcal{R}(\theta)$ in (6) is set as an L1 norm, i.e., $\mathcal{R}(\theta)=\mathbb{E}_{\bm{M}\sim p(\cdot|\theta)}\|\bm{M}\|_{1}$ .

Computing the log-determinant in (11). We note that $\log\big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}=\log\big{|}% \det(\bm{I}-\mathbf{D}J_{\mathrm{F}})(X)\big{|}$ , where $\bm{I}$ is the $K\times K$ identity matrix. Thus, the log-determinant of the Jacobian matrix can be computed using an unbiased estimator based on the power series expansion introduced by Behrmann et al., (2019),

\log\Big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\Big{|}=-\sum_{m=1}^{% \infty}\frac{1}{m}\text{Tr}\Big{\{}J^{m}_{\mathbf{D}\mathrm{F}}(X)\Big{\}},

(13)

where $J_{\mathbf{D}\mathrm{F}}^{m}(X)$ denotes the Jacobian matrix to the $m$ -th power. (13) is guaranteed to converge when $\mathrm{F}(X)$ is contractive (Hall,, 2013). In practice, (13) is computed by truncating the number of terms in the summation to a finite number. This, however, introduces bias in estimating the log-determinant of the Jacobian. In order to circumvent this issue we follow the steps taken by Chen et al., (2019). The power series expansion is truncated at a random cut-off $N\sim p_{\mathbb{N}}(N)$ , where $p_{\mathbb{N}}$ is a probability distribution over the set of natural numbers $\mathbb{N}$ . Each term in the finite power series is then re-weighted to obtain the following estimator:

\log\!\big{|}\!\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)\big{|}\!=\!-% \mathbb{E}_{N}\!\!\Bigg{[}\sum_{m=1}^{N}\frac{\text{Tr}\big{\{}J_{\mathbf{D}% \mathrm{F}}^{m}(X)\big{\}}}{m\cdot P_{\mathbb{N}}(\ell\geq m)}\!\Bigg{]}\!,\!\!\!

(14)

where $P_{\mathbb{N}}$ is the cumulative density function of $p_{\mathbb{N}}$ . Gradient calls are still required in the order of $K$ . We can reduce this further by using the Hutchinson trace estimator (Hutchinson,, 1989), where $W\sim\mathcal{N}(0,\bm{I})$ :

\text{Tr}\Big{\{}J_{\mathbf{D}\mathrm{F}}^{m}(X)\Big{\}}=\mathbb{E}_{W}\Big{[}% W^{\top}J_{\mathbf{D}\mathrm{F}}^{m}(X)W\Big{]},

(15)

We note that the remainder of (4) can be efficiently computed with a forward pass of the neural network.

3.2.2 Calculating Expectation via Imputation

Combining (4), (7), (14), and (15), we finally get

\!\!\!Q(\Theta|\Theta^{t})\!\propto\!\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^% {(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};\Theta^{t}}\Bigg{\{}\!\log p\Big{(}r^{(i)}|% x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\Big{)}\\ +\log p_{\epsilon_{\mathcal{O}}}\Big{(}\epsilon_{\mathcal{O}_{i}}^{(i)}\Big{)}% -\mathbb{E}_{N,W}\Bigg{[}\sum_{m=1}^{N}\frac{W^{\top}J_{\mathbf{D}\mathrm{F}}^% {m}(x^{(i)})W}{m\cdot P_{\mathbb{N}}(\ell\geq m)}\Bigg{]}\Bigg{\}},

where $\epsilon^{(i)}=(\textrm{id}-\mathbf{D}_{i}\mathrm{F})(x^{(i)})$ , $\mathbf{D}_{i}$ is the diagonal matrix corresponding to the interventional mask for the $i$ -th sample, i.e., $\mathbf{D}_{i}=\text{diag}(s_{1}^{(i)},\ldots,s_{K}^{(i)})$ .

The expectation in the approximation for $Q(\Theta|\Theta^{t})$ generally lacks a closed-form solution. Therefore, it must be approximated by the sample mean, using samples drawn from the posterior distribution $p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^{t})$ . This presents two main challenges: (i) the posterior distribution may not have a closed form, and (ii) direct sampling may be infeasible even when the posterior distribution can be evaluated. The difficulty arises due to the presence of nonlinear relations in $\mathrm{F}(X)$ and the missingness mechanism $p(r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi^{t})$ , which may preclude straightforward sampling. Therefore, we employ rejection sampling (Koller and Friedman,, 2009) to draw samples from a proposal distribution $\mathcal{Q}(x_{\Omega_{i}}^{(i)})$ , from which samples can be readily generated.

To that end, a constant $c_{0}>0$ is chosen such that $c_{0}\mathcal{Q}(x_{\Omega_{i}}^{(i)})\geq p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i% }}^{(i)},r^{(i)},\Theta^{t})$ for all $i=1,\ldots,n$ . However, as stated earlier the posterior distribution is not readily available. Thus, from Bayes rule, we have

p(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^{t})\!=\!\frac{p(x_% {\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)}|\theta^{t})p(r^{(i)}|x_{\Omega_{i}}^{(% i)},x_{\Gamma_{i}}^{(i)},\phi^{t})}{p(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t})},

where the denominator $p(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t})$ can be evaluated using fully observed data. The first term in the numerator can be computed efficiently, as discussed in Section 3.2.1. The second term is addressed in Section 3.2.3. Before that, we explain how these calculations simplify under more restrictive models. The imputation procedure is summarized in Algorithm 1.

Algorithm 1 Impute-Rejection

1:Minibatch data

\mathcal{B}\!=\!\{y^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}

, with sampling distribution

\mathcal{Q}(x)

2:Imputed data

\tilde{\mathcal{B}}=\{\widehat{x}^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}

3:for

i=1

n_{B}

4: Sample

\tilde{x}_{\Omega_{i}}^{(i)}\sim\mathcal{Q}(x_{\Omega_{i}})

5: Pick

u\sim U[0,1]

6: if

u\leq\frac{p(\tilde{x}_{\Omega_{i}}^{(i)},y_{\Gamma_{i}}^{(i)}|\theta^{t})p(r^% {(i)}|\tilde{x}_{\Omega_{i}}^{(i)},y_{\Gamma_{i}}^{(i)},\phi^{t})}{c_{0}% \mathcal{Q}(\tilde{x}_{\Omega_{i}}^{(i)})}

then

7: Accept sample:

\widehat{x}_{\Omega_{i}}^{(i)}=\tilde{x}_{\Omega_{i}}^{(i)};\quad\widehat{x}_{% \Gamma_{i}}^{(i)}=y_{\Gamma_{i}}^{(i)}

8: end if

9:end forreturn

\tilde{\mathcal{B}}=\{\widehat{x}^{(i)},r^{(i)},s^{(i)}\}_{i=1}^{n_{B}}

Linear SEMs with MAR mechanisms. When the underlying SEM is linear with additive Gaussian noise and the missingness mechanism is MAR, it is possible to sample from the exact posterior. Under this setup, we can ignore the missingness mechanism and solely focus on the target law. Consider $X=\bm{B}^{\top}X+\epsilon$ , where $\epsilon\sim\mathcal{N}(0,\bm{\Lambda}^{-1})$ , $\bm{\Lambda}=\text{diag}(1/\sigma_{1}^{2},\ldots,1/\sigma_{K}^{2})$ and $\bm{B}$ is the weighted adjacency matrix of $\mathcal{G}$ . $X$ is also Gaussian distributed, with $X\sim\mathcal{N}(\bm{0},\bm{\Lambda}_{X}^{-1})$ where

\bm{\Lambda}_{X}=(\bm{I}-\bm{B})\bm{\Lambda}(\bm{I}-\bm{B}^{\top}).

(16)

Thus, based on the properties of the Gaussian distribution, the posterior distribution of the missing nodes, conditioned on the observed nodes, also follows a Gaussian distribution (Koller and Friedman,, 2009), i.e., $x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)}\sim\mathcal{N}(\tilde{\mu}^{(i)},% \tilde{\bm{\Lambda}}^{(i)})$ such that $\tilde{\bm{\Lambda}}^{(i)}=[\bm{\Lambda}_{X}]_{\Omega_{i},\Omega_{i}}$ , and $\tilde{\eta}^{(i)}=-\tilde{\bm{\Lambda}}^{(i)}x_{\Gamma_{i}}^{(i)}$ , where $\tilde{\eta}^{(i)}=\tilde{\bm{\Lambda}}^{(i)}\tilde{\mu}^{(i)}$ .

The conditional expectation in (7) is estimated by imputing the missing values for each data instance by sampling from the Gaussian distribution described above. In an interventional setting, $X=(X_{\mathcal{I}},X_{\mathcal{O}})$ with $X_{\mathcal{I}}\sim\mathcal{N}(\bm{0},\bm{I})$ , (16) can be modified as follows

\bm{\Lambda}_{X,\mathcal{I}}\!=\!(\bm{I}\!-\!\mathbf{D}\bm{B})(\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}\!+\!(\bm{I}\!-\!\mathbf{D}))^{-1}(\bm{I}\!-\!\mathbf{D% }\bm{B}^{\top}),

(17)

where $\bm{\Lambda}_{X,\mathcal{I}}$ denotes the inverse covariance matrix of $X$ ; see Appendix B.2 for a proof. The rest of the imputation procedure remains the same, except for using $\bm{\Lambda}_{X,\mathcal{I}}$ in place of $\bm{\Lambda}_{X}$ in the Gaussian distribution.

3.2.3 Missingness Mechanism, $\log p\big{(}R\mid X,\phi\big{)}$

In order to compute the log-likelihood in the E-step, we also need to compute $\log p\big{(}r^{(i)}|x_{\Omega_{i}}^{(i)},x_{\Gamma_{i}}^{(i)},\phi\big{)}$ , which according to $m$ -graphs can be factorized as

p(R\mid X,\phi)=\prod_{k=1}^{K}p\big{(}R_{k}\mid\text{pa}_{\mathcal{G}_{m}}(R_% {k}),\phi_{k}\big{)}.

(18)

In developing MissNODAG, we focus on a class of MNAR models, where for any $R_{k}\in R$ , $\textrm{pa}_{\mathcal{G}_{m}}(R_{k})\subseteq X\setminus X_{k}$ , i.e., there are no self-censoring edges and no connections between the missingness indicators in $R$ . This implies $R_{k}\perp X_{k},R_{-k}|\textrm{pa}_{\mathcal{G}_{m}}(R_{k})$ , for all $R_{k}\in R$ , based on the Markov properties (Pearl, 2009b, ). Here, $R_{-k}$ denotes the set $R$ excluding $R_{k}$ . This MNAR class is known as the block-parallel MNAR model (Mohan et al.,, 2013; Nabi et al.,, 2022). Figure 1(a) illustrates this MNAR mechanism with $K=3$ variables.

Under this MNAR class, $p(R|X,\phi)$ in (18) is identified as a function of the observed data by identifying each conditional factor via $p(R_{k}|\textrm{pa}_{\mathcal{G}_{m}}(R_{k}),R^{*}_{-k}=1,\phi_{k})$ , where $R^{*}_{-k}$ denotes the missingness indicators for $\textrm{pa}_{\mathcal{G}_{m}}(R_{k})$ . Each conditional factor is modeled using the expit function, with $\phi_{k}=\{w_{k},z_{k}\}$ :

p\big{(}R_{k}=0\mid\text{pa}_{\mathcal{G}_{m}}(R_{k}),\phi_{k}\big{)}\!=\!% \text{expit}\left(w_{k}^{\top}X+z_{k}\right).

(19)

Maximizing the M-step with respect to $\phi$ reduces to solving a sparsity-regularized logistic regression, with our choice of $\mathcal{R}(\phi)=\sum_{k=1}^{K}\|w_{k}\|_{1}$ .

Block-parallel MNAR is well-suited for modeling missingness mechanisms in cross-sectional studies, surveys, or retrospective studies. For example, consider a study analyzing the relationship between smoking ( $X_{1}$ ), tar accumulation in the lungs ( $X_{2}$ ), and a bronchitis diagnosis ( $X_{3}$ ), where missing entries in the data are indicated by $R_{1}$ , $R_{2}$ , and $R_{3}$ . Under the block-parallel MNAR model, $X_{2}\rightarrow R_{1}\leftarrow X_{3}$ suggests that whether an individual’s smoking status is recorded depends on their tar accumulation and bronchitis diagnosis. This could happen when smoking history is only inquired after a suspected diagnosis of tar accumulation and bronchitis. Similarly, $X_{3}\rightarrow R_{2}$ implies that a suspected bronchitis diagnosis might prompt inquiries into tar accumulation. Additionally, smokers are more likely to be tested for both tar accumulation and bronchitis, represented by $R_{2}\leftarrow X_{1}\rightarrow R_{3}$ , and detecting high tar levels in the lungs may trigger bronchitis testing, reflected by $X_{2}\rightarrow R_{3}$ . Another use case of the block-parallel model is discussed in the context of the prisoner’s dilemma by Nabi et al., (2022). This model was also explored in the study by Tu et al., (2019).

Extensions to other MNAR models. Identifiability is critical for the EM algorithm since, without it, multiple parameter sets may yield the same likelihood, preventing EM from converging. Beyond block-parallel MNAR, other identifiable MNAR models, such as those with no colluders with an example provided in Figure 1(b) (Nabi et al.,, 2020) (see Section 2), offer viable directions for extending our framework. To learn the full law $p(X,R)$ , additional constraints are required to ensure that $p(R|X)$ forms a DAG. Specifically, we can introduce a constraint on the adjacency matrix $\bm{W}\in\mathbb{R}^{K\times K}$ representing edges between missingness indicators, restricting the search space to DAGs using $\text{Tr}(e^{\bm{W}\odot\bm{W}})-K=0$ (Zheng et al.,, 2018). However, this alone does not eliminate colluders, which will require further refinement, a topic we leave for future work. If the goal is to learn only the target law and an identifiable MNAR mechanism is fixed, the EM algorithm can be simplified. In this case, after learning the parameters of the fixed missingness mechanism (e.g., through weighted estimating equations (Seaman and White,, 2013; Bhattacharya et al.,, 2020)), the optimization in (8) focuses only on the target law parameters, $\theta$ , simplifying the process even for non-block-parallel MNAR models.

4 EXPERIMENTS

The code for MissNoDAG is available at the repository: https://github.com/muralikgs/missnodag.

Using synthetic data, we compared MissNODAG to MissDAG (missdag) (Gao et al.,, 2022), an EM-based causal discovery method limited to DAGs and MAR missingness. We also tested state-of-the-art imputation methods, including optimal transport (optransport) (Muzellec et al.,, 2020), and MissForest (missforest) (Stekhoven and Bühlmann,, 2012), followed by causal graph learning from the imputed data using NODAGS-Flow (Sethuraman et al.,, 2023), see Appendix E for details on the implementation of MissNODAG and the baselines.

In all experiments, we generated cyclic directed graphs with $K=10$ nodes using the Erdős-Rényi (ER) model, varying edge density between 1 and 2 (ER-1 and ER-2). Self-loops were removed, and the $m$ -graph described in Section 3.2.3 was used to create partially missing data. MissNODAG’s initial estimate for missingness parameters was obtained via logistic regression on complete cases in the data set (samples with no missing values), then refined through the EM procedure. We evaluated MissNODAG against baselines on linear and nonlinear SEMs with MNAR and MAR mechanisms.

Linear SEM with MNAR Mechanism. The edge coefficients of the causal graph were sampled from a uniform distribution over $(-0.6,-0.25)\cup(0.25,0.6)$ , and rescaled for contractivity. The data consisted of single-node interventions with $n=500$ per intervention. Each intervened node was sampled from a standard normal distribution, and $\epsilon\sim\mathcal{N}(0,0.25^{2}\bm{I})$ . We assumed the intervened node was never missing.

We evaluated MissNODAG and the baselines using the Structural Hamming Distance (SHD), which counts the number of operations (deletion, addition, reversal) needed to match the estimated graph to the true one. All models were trained for 100 training epochs, with missingness probabilities varying from $0.1$ to $0.5$ , and experiments repeated on 10 random graphs. Figure 2(a) shows the SHD between the true and the estimated graph, focusing on the target law recovery (edges between $X_{k}$ ’s). MissNODAG outperformed all baselines for ER-1 graphs. When the missing rate is below 0.4, it matched the performance of NODAGS-Flow on complete data (assuming no missingness). A similar trend was observed for ER-2 graphs, further demonstrating MissNODAG’s efficacy. Figure 2 also shows the recovery performance of NODAGS-Flow on complete data (nodags+clean) for comparison.

Nonlinear SEM with MNAR Mechanism. The causal function for the nonlinear SEM is defined as $\mathrm{F}(X)=\tanh(\bm{W}^{\top}X).$ Non-zero entries of $\bm{W}\in\mathbb{R}^{K\times K}$ are sampled from a uniform distribution and scaled to make $\mathrm{F}$ contractive, with a Lipschitz constant of 0.9. The dataset includes single-node interventions with sample size $n=500$ , and noise terms are Gaussian.

We used neural networks with a single hidden layer and tanh activation to learn the causal function $\mathrm{F}(X)$ . The objective function (8) was optimized using Adam optimizer (Kingma and Ba,, 2014). All models were trained for 100 epochs, with the average missingness probability ranging from $0.1$ to $0.5$ . Figure 2(b) shows the mean SHD over 10 trails, where MissNODAG consistently outperforms all baselines at both edge densities, with the performance gap widening as edge density increases.

Learning Missingness Mechanism. In addition to learning the target law, MissNODAG also learns the underlying missingness mechanism, as detailed in Section 3.2.3. Figure 3 shows the error in recovering the $m$ -graph edges corresponding to the missingness mechanism (i.e., $X$ to $R$ edges), using SHD as the metric. We evaluated the performance on training sets with 5000 and 10000 samples. Each missingness indicator is restricted to have at most 3 parents. As seen in Figure 3, when the average the missing probability is low ( $<0.3$ ), MissNODAG perfectly recovers the missingness mechanism with 10000 samples, while SHD is below 2.5 with 5000 samples. Recovery performance decreases as the missingness probability increases.

Roles of Sample Size and Missingness Sparsity. We conducted additional simulations to examine MissNODAG’s performance on target law recovery as functions of sample size and missingness mechanism sparsity (controlled by the parent set cardinality of the missingness indicators). Due to space constraints, we defer these analyses to Appendices D.1 and D.2.

Linear SEM and MAR Mechanism. For MAR missingness and linear SEMs, missing values can be imputed by sampling from the exact posterior distribution of missing variables given the observed ones, as detailed in Section 3.2.2. Results are summarized in Figure 4. For ER-1, MissNODAG outperforms all baselines and matches NODAGS-Flow on clean data when the missing probability is below 0.4. For ER-2, overall performance slightly decreases, but MissNODAG continues to outperform all baselines.

DAG Learning from Observational Data. We conducted experiments on learning DAGs from partially observed observational data with MNAR missingness. See Appendix D.3 for details.

Real-data Application. We tested MissNODAG on a gene regulatory network with gene expression data from genetic interventions (Frangieh et al.,, 2021). See Appendix D.4 for details.

5 DISCUSSION

In this work, we proposed MissNODAG, a novel differentiable causal discovery framework capable of learning nonlinear cyclic causal graphs along with the missingness mechanism from incomplete interventional data. The framework employs an expectation-maximization algorithm that alternates between imputing missing values and optimizing model parameters. We demonstrated how imputation can be efficiently achieved using rejection sampling, and in the case of linear SEMs with MAR missingness, by directly sampling from the posterior distribution. One of the key strengths of MissNODAG is its ability to handle cyclic directed graphs and MNAR missingness, a significant advancement over methods that typically focus on DAGs and MAR mechanisms.

Future research directions include: (1) incorporating realistic measurement noise models in the SEMs to enhance robustness in real-world datasets, as explored in linear DAG models by Saeed et al., (2020); (2) scaling the current framework to larger graphs using low-rank models and variational inference techniques, as explored in acyclic structures by Lopez et al., (2022); (3) allowing for unobserved confounders within the modeling assumptions, as was explored in the case of latent DAGs with complete observations by Bhattacharya et al., (2021); and (4) generalizing our framework to broader classes of identifiable MNAR models, while tackling the challenges of non-identifiability in the full or target laws, as explored in DAG models by Nabi and Bhattacharya, (2023) and Guo et al., (2023).

References

Améndola et al., (2020) Améndola, C., Dettling, P., Drton, M., Onori, F., and Wu, J. (2020). Structure learning for cyclic linear causal models. In Conference on Uncertainty in Artificial Intelligence, pages 999–1008. PMLR.
Behrmann et al., (2019) Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. (2019). Invertible residual networks. In International Conference on Machine Learning, pages 573–582. PMLR.
Bhattacharya et al., (2020) Bhattacharya, R., Nabi, R., Shpitser, I., and Robins, J. M. (2020). Identification in missing data models represented by directed acyclic graphs. In Uncertainty in artificial intelligence, pages 1149–1158. PMLR.
Bhattacharya et al., (2021) Bhattacharya, R., Nagarajan, T., Malinsky, D., and Shpitser, I. (2021). Differentiable causal discovery under unmeasured confounding. In International Conference on Artificial Intelligence and Statistics, pages 2314–2322. PMLR.
Bollen, (1989) Bollen, K. A. (1989). Structural equations with latent variables, volume 210. John Wiley & Sons.
Carter, (2006) Carter, R. L. (2006). Solutions for missing data in structural equation modeling. Research & Practice in Assessment, 1:4–7.
Chen et al., (2014) Chen, L. S., Prentice, R. L., and Wang, P. (2014). A penalized em algorithm incorporating missing data mechanism for gaussian parameter estimation. Biometrics, 70(2):312–322.
Chen et al., (2019) Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. (2019). Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32.
Drton et al., (2019) Drton, M., Fox, C., and Wang, Y. S. (2019). Computation of maximum likelihood estimates in cyclic structural equation models. The Annals of Statistics, 47(2):663 – 690.
Frangieh et al., (2021) Frangieh, C. J., Melms, J. C., Thakore, P. I., Geiger-Schuller, K. R., Ho, P., Luoma, A. M., Cleary, B., Jerby-Arnon, L., Malu, S., Cuoco, M. S., et al. (2021). Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nature genetics, 53(3):332–341.
Freimer et al., (2022) Freimer, J. W., Shaked, O., Naqvi, S., Sinnott-Armstrong, N., Kathiria, A., Garrido, C. M., Chen, A. F., Cortez, J. T., Greenleaf, W. J., Pritchard, J. K., and Marson, A. (2022). Systematic discovery and perturbation of regulatory genes in human T cells reveals the architecture of immune networks. Nature Genetics, pages 1–12.
Friedman, (1998) Friedman, N. (1998). The bayesian structural em algorithm. In Conference on Uncertainty in Artificial Intelligence.
Gain and Shpitser, (2018) Gain, A. and Shpitser, I. (2018). Structure learning under missing data. In International conference on probabilistic graphical models, pages 121–132. PMLR.
Gao et al., (2022) Gao, E., Ng, I., Gong, M., Shen, L., Huang, W., Liu, T., Zhang, K., and Bondell, H. (2022). Missdag: Causal discovery in the presence of missing data with continuous additive noise models. Advances in Neural Information Processing Systems, 35:5024–5038.
Getzen et al., (2023) Getzen, E., Ungar, L., Mowery, D., Jiang, X., and Long, Q. (2023). Mining for equitable health: Assessing the impact of missing data in electronic health records. Journal of biomedical informatics, 139:104269.
Ghassami et al., (2020) Ghassami, A., Yang, A., Kiyavash, N., and Zhang, K. (2020). Characterizing distribution equivalence and structure learning for cyclic and acyclic directed graphs. In International Conference on Machine Learning, pages 3494–3504. PMLR.
Guo et al., (2023) Guo, A., Zhao, J., and Nabi, R. (2023). Sufficient identification conditions and semiparametric estimation under missing not at random mechanisms. In Uncertainty in Artificial Intelligence, pages 777–787. PMLR.
Hall, (2013) Hall, B. C. (2013). Lie Groups, Lie Algebras, and Representations, pages 333–366. Springer New York, New York, NY.
Hauser and Bühlmann, (2012) Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464.
Heinze-Deml et al., (2018) Heinze-Deml, C., Peters, J., and Meinshausen, N. (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2).
Huetter and Rigollet, (2020) Huetter, J.-C. and Rigollet, P. (2020). Estimation rates for sparse linear cyclic causal models. In Peters, J. and Sontag, D., editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 1169–1178. PMLR.
Hutchinson, (1989) Hutchinson, M. F. (1989). A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076.
Hyttinen et al., (2012) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387–3439.
Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
Jang et al., (2016) Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144.
Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Koller and Friedman, (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Kyono et al., (2021) Kyono, T., Zhang, Y., Bellot, A., and van der Schaar, M. (2021). Miracle: Causally-aware imputation via learning missing data mechanisms. Advances in Neural Information Processing Systems, 34:23806–23817.
Lacerda et al., (2008) Lacerda, G., Spirtes, P., Ramsey, J., and Hoyer, P. O. (2008). Discovering cyclic causal models by independent components analysis. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI’08, page 366–374, Arlington, Virginia, USA. AUAI Press.
Lee et al., (2019) Lee, H.-C., Danieletto, M., Miotto, R., Cherng, S. T., and Dudley, J. T. (2019). Scaling structural learning with NO-BEARS to infer causal transcriptome networks. In Pacific Symposium on Biocomputing 2020, pages 391–402. World Scientific.
Li et al., (2019) Li, S. C.-X., Jiang, B., and Marlin, B. (2019). Learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations.
Little and Rubin, (2019) Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data, volume 793. John Wiley & Sons.
Lopez et al., (2022) Lopez, R., Hütter, J.-C., Pritchard, J., and Regev, A. (2022). Large-scale differentiable causal discovery of factor graphs. Advances in Neural Information Processing Systems, 35:19290–19303.
Luo et al., (2018) Luo, Y., Cai, X., Zhang, Y., Xu, J., et al. (2018). Multivariate time series imputation with generative adversarial networks. Advances in neural information processing systems, 31.
Meek, (1997) Meek, C. (1997). Graphical Models: Selecting causal and statistical models. PhD thesis, Carnegie Mellon University.
Mohan and Pearl, (2021) Mohan, K. and Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116(534):1023–1037.
Mohan et al., (2013) Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missing data. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Mooij and Heskes, (2013) Mooij, J. M. and Heskes, T. (2013). Cyclic causal discovery from continuous equilibrium data. In Uncertainty in Artificial Intelligence.
Muzellec et al., (2020) Muzellec, B., Josse, J., Boyer, C., and Cuturi, M. (2020). Missing data imputation using optimal transport. In International Conference on Machine Learning, pages 7130–7140. PMLR.
Nabi and Bhattacharya, (2023) Nabi, R. and Bhattacharya, R. (2023). On testability and goodness of fit tests in missing data models. In Uncertainty in Artificial Intelligence, pages 1467–1477. PMLR.
Nabi et al., (2020) Nabi, R., Bhattacharya, R., and Shpitser, I. (2020). Full law identification in graphical models of missing data: Completeness results. In International conference on machine learning, pages 7153–7163. PMLR.
Nabi et al., (2022) Nabi, R., Bhattacharya, R., Shpitser, I., and Robins, J. (2022). Causal and counterfactual views of missing data models. arXiv preprint arXiv:2210.05558.
Ng et al., (2020) Ng, I., Ghassami, A., and Zhang, K. (2020). On the role of sparsity and DAG constraints for learning linear dags. Advances in Neural Information Processing Systems, 33:17943–17954.
Ng et al., (2022) Ng, I., Zhu, S., Fang, Z., Li, H., Chen, Z., and Wang, J. (2022). Masked gradient-based causal structure learning. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), pages 424–432. SIAM.
(45) Pearl, J. (2009a). Causality. Cambridge University Press, 2 edition.
(46) Pearl, J. (2009b). Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition.
Richardson, (1996) Richardson, T. (1996). A discovery algorithm for directed cyclic graphs. In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, pages 454–461.
Rudin, (1953) Rudin, W. (1953). Principles of Mathematical Analysis. McGraw-Hill Book Company, Inc., New York-Toronto-London.
Sachs et al., (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
Saeed et al., (2020) Saeed, B., Belyaeva, A., Wang, Y., and Uhler, C. (2020). Anchored causal inference in the presence of measurement error. In Conference on uncertainty in artificial intelligence, pages 619–628. PMLR.
Seaman and White, (2013) Seaman, S. R. and White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical methods in medical research, 22(3):278–295.
Segal et al., (2005) Segal, E., Pe’er, D., Regev, A., Koller, D., Friedman, N., and Jaakkola, T. (2005). Learning module networks. Journal of Machine Learning Research, 6(4).
Sethuraman et al., (2023) Sethuraman, M. G., Lopez, R., Mohan, R., Fekri, F., Biancalani, T., and Huetter, J.-C. (2023). Nodags-flow: Nonlinear cyclic causal structure learning. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6371–6387. PMLR.
Singh, (1997) Singh, M. (1997). Learning bayesian networks from incomplete data. AAAI/IAAI, 1001:534–539.
Solus et al., (2017) Solus, L., Wang, Y., Matejovicova, L., and Uhler, C. (2017). Consistency guarantees for permutation-based causal inference algorithms. arXiv preprint arXiv:1702.03530.
Spirtes et al., (2000) Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search. MIT press.
Stekhoven and Bühlmann, (2012) Stekhoven, D. J. and Bühlmann, P. (2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
Strobl et al., (2018) Strobl, E. V., Visweswaran, S., and Spirtes, P. L. (2018). Fast causal inference with non-random missingness by test-wise deletion. International journal of data science and analytics, 6:47–62.
Sulik et al., (2017) Sulik, J. J., Newlands, N. K., and Long, D. S. (2017). Encoding dependence in bayesian causal networks. Frontiers in Environmental Science, 4:84.
Triantafillou and Tsamardinos, (2015) Triantafillou, S. and Tsamardinos, I. (2015). Constraint-based causal discovery from multiple interventions over overlapping variable sets. The Journal of Machine Learning Research, 16(1):2147–2205.
Tsamardinos et al., (2006) Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78.
Tu et al., (2019) Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., and Zhang, K. (2019). Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1762–1770. PMLR.
Van den Broeck et al., (2015) Van den Broeck, G., Mohan, K., Choi, A., Darwiche, A., and Pearl, J. (2015). Efficient algorithms for bayesian network parameter learning from incomplete data. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI’15, page 161–170, Arlington, Virginia, USA. AUAI Press.
Wang et al., (2020) Wang, Y., Menkovski, V., Wang, H., Du, X., and Pechenizkiy, M. (2020). Causal discovery from incomplete data: a deep learning approach. arXiv preprint arXiv:2001.05343.
Wang et al., (2017) Wang, Y., Solus, L., Yang, K., and Uhler, C. (2017). Permutation-based causal inference algorithms with interventions. Advances in Neural Information Processing Systems, 30.
White et al., (2011) White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399.
Wu, (1983) Wu, C. F. J. (1983). On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1):95 – 103.
Yu et al., (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR.
Zhang et al., (2013) Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., and Dobrin, R. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer’s disease. Cell, 153(3):707–720.
Zheng et al., (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31.
Zheng et al., (2020) Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. (2020). Learning sparse nonparametric DAGs. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108, pages 3414–3425.

Supplementary Materials

The appendix is structured as follows. Appendix A offers a summary of the notations used throughout the manuscript for ease of reference. Appendix B contains all the proofs. Appendix C details modifications to the MissNODAG framework necessary when the search space is confined to DAGs. Appendix D provides details on three sets of additional simulations and a real-data application. Appendix E provides additional details on the implementation of the baselines and the MissNODAG framework.

Appendix A GLOSSARY

A comprehensive list of notations used in the manuscript is provided in Table 1.

Table 1: Glossary of terms and notations

Symbol	Definition	Symbol	Definition
$X$ , $x$	Variables, values	$\mathcal{G},\mathcal{G}_{m}$	Graph, m-graph
$\epsilon$	Exogenous noise terms	$k$	Graph node index variable
$R$	Missingness indicators	$K$	Total nodes in $\mathcal{G}$
$Y$	Coarsened version of $X$	$\textrm{pa}_{\mathcal{G}}(X_{k})$	Parent set of $X_{k}$ in $\mathcal{G}$
$S$	Intervention indicator vector	$\mathrm{F},f_{k}$	SEM functions
$X_{\mathcal{I}}$	Set of intervened nodes	id	Identity map
$X_{\mathcal{O}}$	Set of non-intervened nodes	$(\textrm{id}-F)$	Function mapping $X$ to $\epsilon$
$p_{\epsilon}(\epsilon)$	Exogenous noise density function	$J_{\mathrm{F}}(X)$	Jacobian matrix of $\mathrm{F}$ at $X$
$p_{X}(X),p(X,R)$	Target, full laws	$\bm{I}$	$K\times K$ Identity matrix
$p(Y\|X,R)$	Coarsening mechanism	$\mathbf{D}$	Matrix of intervention masks
$p(R\|X)$	Missingness mechanism	$\bm{B}$	Weighted adjacency matrix of linear SEM
$p(\bm{M}\|\theta)$	Gumbel-Softmax distribution for sampling $\bm{M}$	$\bm{M}$	Dependency mask for SEM function $\mathrm{F}$
$\Theta=(\theta,\phi)$	Parameters of $p_{X}(X)$ , $p(R\|X)$	$\bm{\Lambda}$	Inverse covariance matrix
$\Theta^{t}$	Parameters at EM $t$ -th iteration	$\mathbf{D}_{i}$	Intervention mask corresponding to $i$ -th sample
$\Gamma,\Omega$	Index sets for observed, missing nodes	$i$	Sample index variable
$w_{k},z_{k}$	Parameters of $p(R_{k}\mid\textrm{pa}_{\mathcal{G}_{m}}(R_{k}))$	$n$	Total sample size
$(y^{(i)},r^{(i)},s^{(i)})$	i-th sample	$n_{B}$	Mini batch size in each EM
$\bm{A}\odot\bm{B}$	Hadamard product of $\bm{A}$ and $\bm{B}$	$m$	Power series expansion index for log-determinant of Jacobian
$\\|\cdot\\|_{1}$	L1 norm	$N$	Number of power series terms
$\mathcal{Q}(\cdot)$	Proposal distribution for rejection sampling	$W$	Gaussian randon variable used for computing trace of Jacobian
$\mathcal{R}(\cdot)$	Sparsity inducing regularizer	$e^{\bm{A}}$	Matrix exponent of $\bm{A}$

Appendix B PROOFS

B.1 Joint Density of Variables $X$ : Target Law

Consider the structural equation model $X=\mathrm{F}(X)+\epsilon$ , which implies $X=(\textrm{id}-\mathrm{F})^{-1}(\epsilon)$ . Using the properties of probability density functions, the joint distribution of $X=(X_{1},\ldots,X_{K})$ is given by,

p_{X}(X)=p_{\epsilon}\big{(}(\textrm{id}-\mathrm{F})(X)\big{)}\Big{|}\det J_{(% \textrm{id}-\mathrm{F})}(X)\Big{|},

(20)

where $p_{\epsilon}(\epsilon)$ is the probability density function of the exogenous noise vector $\epsilon$ . Under an interventional setting $(X_{\mathcal{I}},X_{\mathcal{O}})$ , all incoming edges to the nodes in $X_{\mathcal{I}}$ are removed, leading to the following structural equations:

X_{k}=\begin{cases}\tilde{X}_{k}&\text{if }X_{k}\in X_{\mathcal{I}}\\ f_{k}\big{(}\textrm{pa}_{\mathcal{G}}(X_{k})\big{)}&\text{if }X_{k}\notin X_{% \mathcal{I}}\end{cases}

That is, $X_{k}$ is set to a known value $\tilde{X}_{k}$ if it is intervened upon, and the structural equation remains unchanged when $X_{k}$ is purely observed. The above equation can be written more concisely as follows:

X_{k}=d_{k}\cdot\Big{(}f_{k}(\textrm{pa}_{\mathcal{G}}(X_{k}))+\epsilon_{k}% \Big{)}+C_{k},\quad\text{for }k=1,\ldots,K.

(21)

where $d_{k}=\mathds{1}\{X_{k}\notin X_{\mathcal{I}}\}$ , and $\mathds{1}\{\cdot\}$ is the indicator function, and $C_{k}=\tilde{X}_{k}$ if $X_{k}\in X_{\mathcal{I}}$ , and $C_{k}=0$ otherwise. Let $\mathbf{D}\in\mathbb{R}^{K\times K}$ be a diagonal matrix such that $D_{kk}=d_{k}$ . Thus, (21) can now be combined for $k=1,\ldots,K$ to obtain the following equation,

X=\mathbf{D}\mathrm{F}(X)+\mathbf{D}\epsilon+C,

(22)

where $F(X)$ is defined in Section 2 and $C=(C_{1},\ldots,C_{K})$ . Thus we have, $(\textrm{id}-\mathbf{D}\mathrm{F})(X)=\mathbf{D}\epsilon+C$ , this implies that $X=(\textrm{id}-\mathbf{D}\mathrm{F})^{-1}(\mathbf{D}\epsilon+C)$ . Let $X_{\mathcal{I}}\sim p_{X_{\mathcal{I}}}(X_{\mathcal{I}})$ . By the probability rules, we can write (20) as:

p_{X}(X)=p_{X_{\mathcal{I}}}(X_{\mathcal{I}})\ p_{\epsilon_{\mathcal{O}}}(% \epsilon_{\mathcal{O}})\ \Big{|}\det J_{(\textrm{id}-\mathbf{D}\mathrm{F})}(X)% \Big{|},

(23)

where $\epsilon_{\mathcal{O}}$ is the exogenous noise terms of variables in $X_{\mathcal{O}}$ .

B.2 Precision Matrix under Interventions with MAR Missingness Mechanism

Consider the following linear SEM with Gaussian exogenous noise term,

X=\bm{B}^{\top}X+\epsilon,

(24)

where $\bm{B}\in\mathbb{R}^{K\times K}$ is the weighted adjacency matrix of $\mathcal{G}$ , and $\epsilon\sim\mathcal{N}(0,\bm{\Lambda}^{-1})$ , and $\bm{\Lambda}=\text{diag}(1/\sigma^{2}_{1},\ldots,1/\sigma_{K}^{2})$ is the inverse covariance matrix of $\epsilon$ . Thus we have, $X=(\bm{I}-\bm{B}^{\top})^{-1}\epsilon$ . From the properties of a Gaussian distribution, $X$ also follows Gaussian distribution with zero mean and covariance matrix given by,

\mathbb{E}[XX^{\top}]=\mathbb{E}\Big{[}(\bm{I}-\bm{B}^{\top})^{-1}\epsilon% \epsilon^{\top}(\bm{I}-\bm{B}^{\top})^{-\top}\Big{]}=(\bm{I}-\bm{B}^{\top})^{-% 1}\mathbb{E}[\epsilon\epsilon^{\top}](\bm{I}-\bm{B}^{\top})^{-\top}=(\bm{I}-% \bm{B}^{\top})^{-1}\bm{\Lambda}^{-1}(\bm{I}-\bm{B}^{\top})^{-\top}.

Therefore, the inverse covariance matrix of $X$ is given by $\bm{\Lambda}_{X}=(\bm{I}-\bm{B})\bm{\Lambda}(\bm{I}-\bm{B}^{\top})$ . Given an interventional setting $X=(X_{\mathcal{I}},X_{\mathcal{O}})$ , we can write (24) as:

X=\mathbf{D}\bm{B}^{\top}X+\mathbf{D}\epsilon+C,

(25)

where $\mathbf{D}$ is the intervention mask matrix, and $C$ denotes the value of the intervened nodes. Hence, $X=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}(\mathbf{D}\epsilon+C)$ . The intervened nodes are sampled from standard normal distribution, i.e., $X_{\mathcal{I}}\sim\mathcal{N}(0,\bm{I})$ . Again, $X$ is Gaussian distributed with zero mean and covariance matrix given by,

	$\displaystyle\mathbb{E}[XX^{\top}]$	$\displaystyle=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}\mathbb{E}\big{[}(\mathbf{D% }\epsilon+C)(\mathbf{D}\epsilon+C)^{\top}\big{]}(\bm{I}-\mathbf{D}\bm{B}^{\top% })^{-\top}$
		$\displaystyle=(\bm{I}-\mathbf{D}\bm{B}^{\top})^{-1}\big{(}\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}+(\bm{I}-\mathbf{D})\big{)}(\bm{I}-\mathbf{D}\bm{B}^{% \top})^{-\top}.$

Hence, the inverse covariance of $X$ under the interventional setting is given by,

\bm{\Lambda}_{X,\mathcal{I}}=(\bm{I}-\mathbf{D}\bm{B})\big{(}\mathbf{D}\bm{% \Lambda}^{-1}\mathbf{D}+(\bm{I}-\mathbf{D})\big{)}^{-1}(\bm{I}-\mathbf{D}\bm{B% }^{\top}).

B.3 Convergence Analysis

Here we provide the convergence analysis of MissNODAG. Our analysis relies on the convergence of the EM algorithm (Wu,, 1983; Friedman,, 1998). The crux of the analysis depends on establishing that the total log-likelihood of the non-missing nodes in the data set either increases or stays the same in each iteration of the algorithm. That is,

\sum_{i=1}^{n}\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t+1}\big{)% }\geq\sum_{i=1}^{n}\log p_{X}\big{(}x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta^{t}% \big{)}.

(26)

To that end, note that

\sum_{i=1}^{n}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}|\Theta)=\sum_{i=1}^{n}% \log p_{X}(x_{\Gamma_{i}}^{(i)}x_{\Omega_{i}}^{(i)},r^{(i)}|\Theta)-\sum_{i=1}% ^{n}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta).

Taking expectation with respect $x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)}$ on both side, we get

	$\displaystyle\sum_{i=1}^{n}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}\|\Theta)$	$\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\|x_{\Gamma_{i}}^{(% i)},r^{(i)},\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}\|\Theta)$
		$\displaystyle=\underbrace{\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\|x_{% \Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)}x_{\Omega_{i}}^{(i% )}\|\Theta)}_{=Q(\Theta\|\Theta^{t})}-\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{% (i)}\|x_{\Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}\|x_{\Gamma% _{i}}^{(i)},\Theta).$		(27)

The first term on the RHS in the above equation is nothing but $Q(\Theta|\Theta^{t})$ . This is maximized in the M-step, i.e., $Q(\Theta|\Theta^{t+1})\geq Q(\Theta|\Theta^{t})$ . On the other hand,

\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};% \Theta^{t}}\log\frac{p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t+1})}{p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta^% {t})}=-D_{KL}\Big{(}p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t})\big{\|}p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},% \Theta^{t+1})\Big{)}\leq 0.

Thus,

\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)};% \Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i)},\Theta% ^{t+1})\leq\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)}% ,r^{(i)};\Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}|x_{\Gamma_{i}}^{(i)},r^{(i% )},\Theta^{t}).

(28)

From combining (27) and (28), we can see that at the end of the M-step (26) is satisfied. Similar to the previous results on EM convergence (Wu,, 1983; Friedman,, 1998), MissNODAG reaches a stationary point of the optimization objective.

Appendix C LEARNING DAGS FROM INCOMPLETE OBSERVATIONAL DATA

When the causal graph associated with the target law $p_{X}(X)$ is a Directed Acyclic Graph (DAG), it is sometimes possible to recover the true causal structure from purely observational data. For example, when the exogenous noise variable $\epsilon$ is Gaussian distributed with equal variance, i.e., $\epsilon\sim\mathcal{N}(0,\sigma^{2}\bm{I})$ , then the DAG is identifiable from observational data. This, however, requires constraining the search space in the optimization problem defined by (8) to the set of all DAGs. To achieve this, we leverage the trace-exponential constraint introduced by Zheng et al., (2018). Let $\bm{M}\sim p(\bm{M}|\theta)$ ; by adding the following constraint to (8), we ensure that the resulting graph is acyclic:

\mathbb{E}_{\bm{M}\sim p(\bm{M}|\theta)}\Big{[}\text{Tr}\Big{(}e^{\bm{M}\odot% \bm{M}}\Big{)}-K\Big{]}=0,

(29)

where $e^{\bm{A}}$ denotes the matrix exponential of $\bm{A}$ , and $\odot$ signifies the Hadamard product (elementwise multiplication) of two matrices. In the current formulation, the SEM function $\mathrm{F}$ is constrained to be contractive. While this condition guarantees bijectivity of $(\textrm{id}-\mathrm{F})$ for cyclic graphs, it is overly restrictive for DAGs. Thus, inspired by Sethuraman et al., (2023), we introduce a preconditioning term $\Lambda$ to redefine $\mathrm{F}$ . This modification enables the modeling of non-contractive parent-child dependencies while maintaining efficient computation of the log-determinant of the Jacobian. For more details, refer to Sethuraman et al., (2023). Thus, the maximization objective in (8) takes the following form:

\Theta^{t+1}=\arg\max_{\Theta}\ Q(\Theta\mid\Theta^{t})-\lambda_{1}\mathcal{R}% (\theta)-\lambda_{2}\mathcal{R}(\phi)+\lambda_{\text{DAG }}\cdot\mathbb{E}_{% \bm{M}\sim p(\bm{M}|\theta)}\Big{[}\text{Tr}\Big{(}e^{\bm{M}\odot\bm{M}}\Big{)% }-K\Big{]}.

Following Ng et al., (2020), this optimization can be solved using stochastic gradient based methods. We present results on learning DAGs from synthetic partially observed data in Appendix D.3.

Appendix D ADDITIONAL EXPERIMENTS

D.1 Target Law Recovery: Performance as a Function of Sample Size

We evaluated the performance of the target law recovery as a function of sample size. The average missingness probability was set to $0.2$ , and each missingness indicator was restricted to have at most three parents. The data was generated as described in Section 4. The results for the linear SEM are shown in Figure 5 and the results for the nonlinear SEM are shown in Figure 6.

In the linear setting (Figure 5), the performance of all methods improves as the number of training samples increases. MissNODAG performs comparably to NODAGS-Flow trained on complete data (no missing samples in the data set), even with significantly fewer samples than the baseline models. A similar trend is seen in the nonlinear case (Figure 6), where all models show improved results with larger sample sizes. For a lower edge density (ER-1),the models perform similarly. However, with a higher edge density (ER-2), the performance gap widens, particularly with smaller sample sizes.

D.2 Target Law Recovery: Performance as a Function of Cardinalities for $\textrm{pa}_{\mathcal{G}_{m}}(R_{k})$

We also evaluated target law recovery performance as a function of the parent set cardinality of the missingness indicators, which reflects the sparsity of the missingness mechanism. For this analysis, the training set size was fixed at 5000 samples. Results for the linear and nonlinear SEMs are shown in Figures 7 and 8, respectively.

In the linear case (Figure 7), the target law graph is perfectly recovered when the missingness probability is below 0.4. However, as the parent set cardinality increases, recovery performance declines. A similar trend is observed for the nonlinear SEM (Figure 8), where larger parent set cardinality also results in a noticeable drop in recovery performance.

D.3 Target Law Recovery: Learning DAGs from Partially Observed Observational Data

Figures 9 and 10 present the results of learning DAGs from partially observed observational data. We followed the same procedure described in section 4 to generate the data, with the additional constraint that the resulting graphs are acyclic. The SEM functions used for both linear and nonlinear cases were designed to be non-contractive. Details on the learning methodolgy for DAGs is described in Appendix C. In both settings, MissNODAG outperforms the baselines, although its performance decreases compared to learning cyclic graphs from interventional data. We attribute this drop to the increased optimization complexity introduced by the DAG constraint.

D.4 Data Application: Gene Perturbation

Here we present an experiment focused on learning causal graph structure corresponding to a gene regulator network from a gene expression data with genetic interventions. In particular, we focus on the Perturb-CITE-seq dataset (Frangieh et al.,, 2021), a type of data set that allows one to study causal relations in gene networks at an unprecedented scale. It contains gene expressions taken from 218,331 melanoma cells split into three cell conditions: (i) control (57,627 cells), (ii) co-culture (73,114 cells), and (iii) interferon (INF)- $\gamma$ (87,590 cells).

Table 2: List of genes chosen from Perturb-CITE-seq dataset (Frangieh et al.,, 2021).

STAT1	B2M	LGALS3	PTMA
SSR2	CTPS1	TM4SF1	MRPL47
DNMT1	TMED10

Due to the practical and computational constraints, of the approximately 20,000 genes in the genome, we restrict ourselves to a subset of 10 genes, following the experimental setup of Sethuraman et al., (2023), summarized in Table 2. We then took all the single-node interventions corresponding to the 10 genes. Each cell condition is treated as a separate data set over which the models were trained separately.

Since the data set does not provide a ground truth causal graph, it is not possible to directly compare the performance using SHD. Instead, we compare the performance of the causal discovery methods based on its predictive performance over unseen interventions. To that end, we perform a 90-10 split on the three data sets. The smaller set is treated as the test set, which is then used for performance comparison between MissNODAGS and the baselines. As a performance metric, we use the predicted negative log-likelihood (NLL) over the test set after training the models for 100 epochs.

In all the three cell conditions, MissNODAG outperforms the baseline methods, with the difference being the most significant when the average missingness probability is low. MissNODAG matches the performance of NODAGS-Flow trained on clean data when the missingness probability is 0.1 and for the case of control cell condition, the performance of MissNODAG is comparable to clean data untill the missingness probability is 0.2.

Appendix E IMPLEMENTATION DETAILS

In this section, we describe the implementation details of the MissNODAG framework and the baseline models used for performance comparisons.

MissNODAG. We implemented our framework using the Pytorch library in Python and the code used in running the experiments can be found in the codes folder within the supplementary materials. The code for the proposed model can be found at https://github.com/muralikgs/missnodag.

Starting with an initialization of the model parameters $\Theta^{0}$ , we alternate between the E-step and M-step until the parameters converge. In the E-step, Algorithm 1 is used for imputing the missing data, followed by maximizing the expected likelihood of the non-missing nodes in the M-step. We follow the same setup as Sethuraman et al., (2023) for modeling the causal functions, i.e., neural networks (NN) along with dependency mask with entries parameterized by Gumbel-softmax distribution, and for computing the log-determinant of the Jacobian, i.e., power series expansion followed by Hutchinson trace estimator. Poisson distribution is used for $p_{\mathbb{N}}$ for sampling the number of terms in the expansion to reduce the bias introduced while limiting the number of terms in the power series expansion of log-determinant of the Jacobian, see section 3.2. The final objective in the M-step is maximized using Adam optimizer (Kingma and Ba,, 2014).

The learning rate in all our experiments was set to $10^{-2}$ . The neural network models used in our experiments contained one multi-layer perceptron layer. No nonlinearities were added to the neural networks for the linear SEM experiments. We used tanh activation for the nonlinear SEM experiments and ReLU activation for the experiments on the perturb-CITE-seq data set. The regularization constant $\lambda$ was set to $10^{-2}$ for the synthetic experiments and $10^{-3}$ for the perturb-CITE-seq experiments. All experiments were performed on NVIDIA RTX6000 GPUs.

Baselines. For the baseline NODAGS-Flow, we modify the code base provided by Sethuraman et al., (2023) to use the imputed samples for maximizing the likelihood. The hyperparameters of NODAGS-Flow was set to the values described in the previous subsection.

MissForest imputation is performed using the publicly available python library missingpy. We use the codebase provided by Muzellec et al., (2020) for optimal transport imputation, and the codebase provided by the authors was used for MissDAG (Gao et al.,, 2022). The default parameters are used for Missforest, optimal transport imputation, and MissDAG. The codes for all the baselines can be found inside the codes folder in the supplementary materials.

	$\displaystyle\sum_{i=1}^{n}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}\|\Theta)$	$\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\|x_{\Gamma_{i}}^{(% i)},r^{(i)},\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)},r^{(i)}\|\Theta)$
		$\displaystyle=\underbrace{\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{(i)}\|x_{% \Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Gamma_{i}}^{(i)}x_{\Omega_{i}}^{(i% )}\|\Theta)}_{=Q(\Theta\|\Theta^{t})}-\sum_{i=1}^{n}\mathbb{E}_{x_{\Omega_{i}}^{% (i)}\|x_{\Gamma_{i}}^{(i)};\Theta^{t}}\log p_{X}(x_{\Omega_{i}}^{(i)}\|x_{\Gamma% _{i}}^{(i)},\Theta).$		(27)

Abstract

1 INTRODUCTION

2 PROBLEM SETUP

3 THE MissNODAG FRAMEWORK

3.1 The Overall Procedure

3.2 Computational Details of the E-step

3.2.1 Target Law, log⁡pX⁢(X∣θ)subscript𝑝𝑋conditional𝑋𝜃\log p_{X}\big{(}X\mid\theta\big{)}roman_log italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ∣ italic_θ )

3.2.2 Calculating Expectation via Imputation

3.2.3 Missingness Mechanism, log⁡p⁢(R∣X,ϕ)𝑝conditional𝑅𝑋italic-ϕ\log p\big{(}R\mid X,\phi\big{)}roman_log italic_p ( italic_R ∣ italic_X , italic_ϕ )

4 EXPERIMENTS

5 DISCUSSION

References

Appendix A GLOSSARY

Appendix B PROOFS

B.1 Joint Density of Variables X𝑋Xitalic_X: Target Law

B.2 Precision Matrix under Interventions with MAR Missingness Mechanism

B.3 Convergence Analysis

Appendix C LEARNING DAGS FROM INCOMPLETE OBSERVATIONAL DATA

Appendix D ADDITIONAL EXPERIMENTS

D.1 Target Law Recovery: Performance as a Function of Sample Size

D.3 Target Law Recovery: Learning DAGs from Partially Observed Observational Data

D.4 Data Application: Gene Perturbation

Appendix E IMPLEMENTATION DETAILS

3.2.1 Target Law, $\log p_{X}\big{(}X\mid\theta\big{)}$

3.2.3 Missingness Mechanism, $\log p\big{(}R\mid X,\phi\big{)}$

B.1 Joint Density of Variables $X$ : Target Law