Pearls of Causality #8: Inferred Causation, $IC$, and ${IC}^*$

Published:

We will talk about IC, $IC$, and ${IC}^*$ in this post. You get the difference.

Inferred Causation

It is time to see some causal discovery algorithms. There is one last thing before that though: we need to formulate when a node $C$ is the cause of node $E$. In the fully observed case, it is the existence of a directed path from $C$ to $E$. When latents are present, then we cannot use the same definition. The extension is called the principle of inferred causation.

Given distribution $\hat{P}$, node $C$ has a causal influence on node $E$ if and only if there exists a directed path from $C$ to $E$ in every minimal latent structure consistent with $\hat{P}$. $\forall L, L' \in \mathcal{L},\ L \preceq L', \exists \theta_G : P_{[O]}(<G, \theta_G>)=\hat{P} \\ \exists p = C\rightarrow \dots \rightarrow E \in L$

Inferred Causation is a more strict requirement than the sole existence of a $C\rightarrow \dots \rightarrow E$ path. It demands such a path in every minimal latent structure consistent with $\hat{P}$.

Consistency is a necessary condition, as we need to represent $\hat{P}$. Failing to find a directed path in consistent structures means that we cannot conclude a causal influence.

Minimality is a sufficient condition, as a directed path in the minimal structure ensures the existence of the same path in the non-minimal graph. However, this does not hold in the other way around: a directed path in a non-minimal structure does not imply that there will be a directed path in the minimal structure.

The Inductive Causation $(IC)$ algorithm

After having dived into the intricate details of causal models and latent structures, we can analyze the first algorithm for causal discovery.

The Inductive Causation $(IC)$ algorithm tackles causal discovery without latent variables. Although when assuming stability, there will be a unique minimal causal structure, this only holds up to Markov equivalence.

This means that $IC$ cannot identify the ground truth graph, it will only output a pattern.

A pattern $H$ is a graph with both directed and undirected edges. In our context, it is used to express Markov equivalence classes.

The algorithm works with separating sets to determine conditional/marginal independence.

A separating set $S_{ab}$ is a subset of variables $V$ in DAG $G$ if $a\perp b | S_{ab}$ holds. It is minimal, if no vertex can be removed from $S_{ab}$ so that $a\perp b | S_{ab}$ still holds.

Algorithm

Input: $\hat{P}$, a stable distribution on a set $V$ of variables.

Output: a pattern $H(\hat{P})$ compatible with $\hat{P}$.

Steps

1. $\forall a, b \in V$, search for a set $S_{ab} : a \perp_{\hat{P}} b | S_{ab}$. Construct an undirected graph $G$ with $a-b$ if and only if $S_{ab} = \emptyset$.
2. For each $a-c-b$, check if $c \in S_{ab}$.
• If it is not , then $a\rightarrow c \leftarrow b$.
3. In the resulting PDAG (Partially Directed Acyclic Graph), orient as many of the undirected edges as possible subject to:
• No new v-structures and
• No directed cycles.

For doing causal discovery, we start with a (stable) distribution - stability is required to rule out functional independencies (independencies introduced by the parameters of the SEM but not present in the ground-truth graph - for details see PoC #7).

Acknowledging that the output is a pattern, we face the reality of causal discovery: it is not guaranteed that we able to recover the ground-truth graph. Nevertheless, we can recover it up to its Markov equivalence class.

Step 1 constructs the skeleton of the graph. This is done by carefully checking conditional independence for all combinations of nodes and separating sets. ($IC$ is agnostic to how this is done; e.g., we can use conditional independence tests) In the case of $n$ nodes, we have ${n\choose 2}$ combinations for $a,b$ with $2^{n-2}$ possible separating sets for each. Multiplying the two gives us the sad reality of $45\times 256 = 11,520$ conditional independence tests in the worst-case scenario for $n=10$ (we don’t need to make all tests if one already indicates conditional independence).

Step 2 identifies v-structures by checking whether the common neighbor $c$ of nonadjacent nodes $a, b$ is in the separating set. Because we know that $c$ is connected to both, the only explanation for $c$ not being in $S_{ab}$ is that $c$ is the middle node of the v-structure $a\rightarrow c \leftarrow b$ - i.e., it introduces dependence between $a$ and $b$ when conditioned on.

Step 3 gives us general guidelines how we should orient the remaining edges, but it does not provide concrete means for it: we cannot introduce new v-structures (as step 2 should have identified all) or directed cycles (as we look for DAGs).

Depending on the task, we may not be able to orient all edges.

The four rules of Step 3

It was shown in the literature that with four simple rules, we can orient the maximal number of edges.

1. Rule 1: Orient $a \rightarrow b - c$ into $a \rightarrow b\rightarrow c$ for nonadjacent $a$ and $c$.
2. Rule 2: Orient $a - b$ into $a \rightarrow b$ whenever there is a chain $a \rightarrow c \rightarrow b$.
3. Rule 3: Orient $a - b$ into $a \rightarrow b$ whenever there are two chains $a - c \rightarrow b$ and $a - d \rightarrow b$ such that $c$ and $d$ are nonadjacent.
4. Rule 4: Orient $a - b$ into $a \rightarrow b$ for $a - c \rightarrow d\rightarrow b - a$ such that $a, d$ are adjacent but $b, c$ are not.

Rule 1 ensures that no v-structures are added (orienting $a \rightarrow b - c$ into $a \rightarrow b\leftarrow c$ would yield a v-structure).

Rule 2 prevents directed cycles of form $a \rightarrow c\rightarrow b \rightarrow a$. Nonetheless, the $a\rightarrow b$ edge creates the v-structure $a\rightarrow b\leftarrow c$. We can resolve the contradiction by showing that this v-structure has no effect on $I(G)$: as $a\not\perp c$, conditioning on $b$ does not add any dependence.

Rule 3 also prescribes a step that eliminates directed cycles from the PDAG. Namely, the edge orientation $a\leftarrow b$ enables both $a\rightarrow c \rightarrow b \rightarrow a$ and $a \rightarrow d \rightarrow b \rightarrow a$ - but this has the cost of introducing the v-structures $a \rightarrow b \leftarrow c$ and $a \rightarrow b \leftarrow d$. Noticing that $b$ was already the middle node of the v-structure $c \rightarrow b \leftarrow d$ and that the edges $a - c$ and $a - d$ imply $a\not\perp c$ and $a\not\perp d$, we conclude that $a\rightarrow b$ does not introduce new conditional independencies.

Rule 4 resembles Rule 3 as it also eliminates directed cycles by introducing a new v-structure, which does not change conditional independencies. As a result, we get $a - c \rightarrow d\rightarrow b \leftarrow a$. As $a$ and $d$ are nonadjacent (i.e., we have $a-d$), the v-structure $a\rightarrow b \leftarrow d$ has no effect on the conditional independence of $a$ and $d$.

The $IC^*$ algorithm

The $IC$ algorithm only works when there are no latent variables. Fortunately, some simple rules are sufficient to extend the process to latent variables.

The theorem stating that each latent structure has at least one projection (discussed in PoC #7), and the principle of Inferred Causation (IC) justify the generalization. Since a causal influence is reflected as an edge in every minimal latent structure, we need to find its projection.

$IC^*$ acknowledges the uncertainty of the latents by outputting a marked pattern.

A marked pattern is a PDAG with four edge types:

1. $a\rightarrow^{*} b$: directed path from $a$ to $b$
2. $a\rightarrow b$: directed path from $a$ to $b$ or $a \leftarrow L \rightarrow b$
3. $a \leftrightarrow b$: $a \leftarrow L \rightarrow b$
4. $a-b:$ $a \leftarrow L \rightarrow b$ or $a \leftarrow b$ or $a \rightarrow b$,

where $a \leftarrow L \rightarrow b$ is confounding (unobserved common cause). Note also that $a \rightarrow b$ on the left of the colon refers to an edge type in the PDAG, including two possible scenarios (to avoid confusion, the edge from $a$ to $b$ is described as a directed path instead of using arrows).

Algorithm

Input: $\hat{P}$, a stable distribution w.r.t. $L$.

Output: a marked pattern $H(\hat{P})$ compatible with $\hat{P}$.

Steps

1. $\forall a, b \in V$, search for a set $S_{ab} : a \perp_{\hat{P}} b | S_{ab}$. Construct an undirected graph $G$ with $a-b$ if and only if $S_{ab} = \emptyset$.
2. For each $a-c-b$, check if $c \in S_{ab}$.
• If it is not , then $a\rightarrow c \leftarrow b$.
3. In the resulting PDAG (Partially Directed Acyclic Graph), orient and mark as many of the undirected edges as possible subject to two rules:
1. Rule $1{}^{}$: Add an arrowhead to $a\rightarrow b - c$ to get $a\rightarrow b\rightarrow^{} c$ for nonadjacent $a$ and $c$.
2. Rule $2{}^{*}$: Add an arrowhead to $a-b$ to get $a\rightarrow b$ if $a$ and $b$ are adjacent and there is a directed path of marked links from $a$ to $b$.

Identifying the separation sets and orienting the edges of v-structures (Steps 1 and 2) are the same as in $IC$, the difference is that $IC^*$ does not orient edges but adds arrowheads.

Rule $1^{}$ is _almost_ the same as Rule 1 in $IC$, but there are two differences: first, the $a\rightarrow c$ edge here comprises of $a\rightarrow^{} b$ and $a \leftarrow L \rightarrow b$; second, $\rightarrow^{*}$ is used to indicate the directed edge from $c$ to $b$. Thus, it includes the addiotnal case when a latent variable leaves the path between $a$ and $c$ active.

The resemblance of Rule $2^{}$ and Rule 2 is more far-fetched: not only three-tuples of nodes as a chain are considered, but also a path of any length. Though the core principle is the same: avoid directed cycles. The marked path $a\rightarrow^{} \dots \rightarrow^{*} b$ and the edge $a-b$ comprises an undirected cycle, which is broken by adding the arrowhead to $a\rightarrow b$. Again, note that this includes $a \leftarrow L \rightarrow b$ - but this does not contradict acyclicity.

Summary

In this post, we were annoyed again by the lack of creativity of the Naming Committee of Causal Terminology. Nonetheless, we encountered our first causal discovery algorithms. By comparing $IC$ and $IC^*$, we have also seen the effect of having latent variables.

Tags: