# Pearls of Causality #7: Latent Structures and Stability

** Published:**

DAGs like to play hide-and-seek. But we are more clever.

### PoC Post Series

# Stable Distributions (Faithfulness)

Before you get too enthusiastic about playing hide-and-seek with DAGs, we need to discuss something. Unfortunately, distributions can be mischievous. They want to make us believe that we have a plenty of independencies, although the underlying graph says otherwise.

Stabilityorfaithfulness(less frequently used alternatives areDAG-isomorphismandperfect-mapness) answers the question of how “reliable” an independence statement is.

*Note to self: add these names to the Causal Dictionary.*

To understand what “reliability” means, we first need to consider the two types of independence statements.

## The nature of independence

Let’s imagine that we observed $X\perp_P Y | Z$ in the distribution $P$ generated by $G$. We *could* think that this means any of the following (assume that there are only three nodes in $G$):

- $X\rightarrow Z\rightarrow Y$
- $X\leftarrow Z\rightarrow Y$
- $X\leftarrow Z\leftarrow Y$

These are

notthe only possible scenarios.

The reason is in the definition of I-maps. As it requires that $I(G) \subseteq I(P)$, it is possible that observing $X\perp_P Y | Z$ does not mean $X\perp_G Y | Z$.

But we said that $P$ is generated by $G$, so the observations we made cannot contradict all the above structures as otherwise we would have $X\not\perp_P Y | Z$.

However intuitive, this reasoning misses a point: the

nature of independencies.

We can differentiate two types of independencies:

**Structural**: when both $X\perp_G Y | Z$ and $X\perp_P Y | Z$ hold.**Functional**: when $X\not\perp_G Y | Z$ but $X\perp_P Y | Z$ holds in the distribution $P$.

From the two,

structural independenciesare straightforward. It means that $I(G)$ contains the d-separation statement that induces conditional independence in $P$.

This is when $X\perp_G Y | Z$ holds; thus, one of the above three structures is present in the graph.

On the other hand,

functional independencies cannot be read off $G$. They are encoded in the structural equations (i.e., in $f_i$).

### Example

Consider the graph in the image above. It is described by the following SEM:

- $ X = U_1$
- $ Z = aX + U_2$
- $ Y = bX + cZ + U_3 = (b + ac)X + U_2 +U_3$

Clearly, $X \not\perp_G Y$ and even $X \not\perp_G Y |Z.$ But setting $b = -ac$ makes $X$ and $Y$ *independent*. If $b=0$, then $X$ and $Y$ are *conditionally independent* given $Z$.

Thus, observing the (conditional) independence of $X$ and $Y$ is a matter of luck, as there are only very few combinations of $a,b,c$ that (conditional) independence holds.

## Definition

$P$ is a

stable distributionof a causal model $M= <G, \theta_G>$ if and only if$G$ is a perfect I-map of $P$ .\((X \perp_G Y|Z) \Leftrightarrow (X \perp_P Y|Z)\)

That is, $P$ maps the structure of $G$, and even varying the parameters $\theta_G$ does not introduce new independencies in $P$.

This remark lets us formulate the definition in an equivalent way.

Namely, a causal model $M$ generates a

stable distribution$P$ if and only if $P(<G, \theta_G>)$contains no extraneous independences. \(\forall \theta'_G : I[P(<G, \theta_G>)]\subseteq I[P(<G, \theta'_G>)]\)

That is, **only structural independencies count**. Both definitions **exclude functional independencies**. The first definition does this by postulating the $”\perp_G \rightarrow \perp_P”$ mapping to be **bijective**. For $G$ cannot reflect functional independencies, there cannot be any in $P$.

The second definition forces $\theta_G$ not to introduce any new independencies. If $\theta_G$ would generate a new indepdency, then there would exist a $\theta_G^*$ so that $[P(<G, \theta_G^*>)]\subseteq I[P(<G, \theta_G>)],$ which contradicts the definition.

Let’s close this section with my whim, i.e., terminology. To be honest, I think that the name *faithfulness* is not very descriptive of what the definition stands for. *Stability* is a bit better (at least, compared to faithfulness), as it expresses that no matter how the parameters change, the independencies in $P$ remain the same. These are the two most commonly used names for the concept. Nonetheless, the other two are much better. *DAG-isomorphism* explicitly states that we are talking about graphs with a *fixed* structure-but it still does not make the connection between $P$ and $G$.

In my opinion,perfect-mapnessis the best name. It makes very clear that $G$ and the distribution $P$ generated by $M$ have the same set of independencies.

# Latent Structures

Until now, we lived in an imaginary world with its profound simplicity. But it is time to grow up and face reality. What I am talking about is our imperfect senses. This is not just about illusions, but also about causal inference.

Namely, there will be mechanisms we do not know about (we can call these unknown unknowns). As we don’t know they exist, there is not much to do - before Steve Jobs invented the iPhone, people did not know that they *desperately* needed it. Although I enjoy philosophical topics, this post does not lead there; thus, we will discuss only *known unknowns*. That is, **latent-variable models**.

They are known unknowns as we know - or at least, suspect - that there is a mechanism in play, but we cannot observe it. For example, before discovering gravity, mankind only knew that apples fall from trees and planets rotate. People thought about possible explanations, but they did not connect the two.

In causal terms, gravity $G$ causes apple $A$ to fall down, i.e. $G \to A$. It also makes the Earth $(E)$ rotate around the Sun $(S)$, so $G\to E$ and $G\to S$. Before gravity, we did not know that all these phenomena have an unobserved common cause.

An

unobserved common causeof $X,Y$ is $Z$ when $X\leftarrow Z\rightarrow Y$ and $Z$ is not observed. $Z$ is also called aconfounder.

## Notation

Before jumping into latent structures, let’s summarize the notation:

- $G$: a DAG
- $V$: a set of nodes (vertices, thus, the $V$)
- $O$: the
*observable*subset of nodes in $V$ (i.e., $O\subseteq V$) - $P_{[O]}$: observational distribution over $O$
- $\mathcal{P}_{[O]}$: set of observational distributions over $O$
- $\theta_G$: parameters of $G$ in the causal model (these describe the SEM)
- $M = <G, \theta_G>$: a causal model
- $L$: a latent structure
- $\mathcal{L}$: a class of latent structures
- $I(P)$: the set of all conditional independencies in $P$.

## Definition

A

latent structureis a pair $L = <G, O>$ with DAG $G$ over nodes $V$ and where $O\subseteq V$ is a set of observed variables.

So $L$ is a DAG where we attach labels to the nodes that are observed. So far so good.

Why do we need latent structures?

Because they can represent our hypotheses about the world. Imagine ourselves in the 20th century. We are physicists looking for the secrets of the universe. Some of us might still think that Newtonian physics is the way to go, some of us is a supporter of Einstein’s theory of relativity, whereas others believe in string theory. Each of us can observe the same, but our mental model is different.

I.e., we have a class of latent structures that try to describe the unobservable. Such models can help formulate a more compact representation (e.g. by connecting falling apples and the rotation of the Earth via gravity).

## Latent Structure Preference

How can we compare latent structures? Which one is better?

This question leads us to the notion of **latent structure preference**.

Latent structure $L = <G, O>$ is preferred to another $L’ = <G’, O>$ (written $L \preceq L’$) if and only if $G’$ can represent

at leastthe same family of observational distributions as $G$. \(\mathcal{P}_{[O]}(<G, \theta_G>) \subseteq \mathcal{P}_{[O]}(<G', \theta'_{G'}>)\)

That is, for each $\theta_G,$ there is a $\theta’_{G’}$ such that (note the difference of $\mathcal{P}$ and $P$): \(P_{[O]} (<G', \theta'_{G'}>) = P_{[O]} (<G, \theta_G>).\)

But there can be a $\theta’_{G’}$ so that $L$ is not able to express the same observational distribution.

The order of $L$ and $L’$ is crucial in the definition.

Namely, we impose the constraint on $L’$ that it should represent *all observational distributions* of $L$, but it can be *more expressive*.

Preference is Occam’s razor of latent structures.

That is, it prefers the simplest $L$. Note that **simplicity is meant in terms of expressive power** (i.e., how big the class of distributions is that can be represented), **not in parameters number**.

We can still prefer $L$ to $L’$ - even if $L$ has more parameters - if $L’$ is more expressive than $L$.

The edge case is when both latent structures represent the same $\mathcal{P}_{[O]}$, i.e., they are **equivalent**. In terms of preference: both is preferred to the other, i.e.: \(L' \equiv L \Longleftrightarrow L \preceq L' \wedge L \succeq L'\)

Similar to inequalities, we will use the symbol $\prec$ for strict preference-i.e., excluding equivalence.

## Minimality of Latent Structures

Latent structures cannot escape our desire to search for the best. Minimality is the concept that quantifies what we are looking for.

A latent structure $L$ is

minimalin a class of latent structures $\mathcal{L}$ if and only if there is no member of $\mathcal{L}$ that is strictly preferred to $L$. \(\not\exists L' \in \mathcal{L} : L' \prec L\)

The implication is that if we find an $L’ : L’ \preceq L \implies L \equiv L’.$ This is because by the definition $L’ \not\prec L$; thus $L \preceq L’$ holds. As new now have both $L \preceq L’$ and $L’ \preceq L$, we get $L’ \equiv L$.

Minimality expresses the “efficiency” of a latent structure. That is, a minimal $L$ is the most economical in terms of expressive power.

## Consistency of Latent Structures

We can have a minimal $L$ and still fail to make use of it: while striving for an - in a sense of having a possibly small space of observational distributions - compact structure, we should ask:

Is $L$ able to represent the specific observational distribution $\hat{P}$ we care about?

Otherwise, our effort is futile. Fortunately, we can answer by checking the **consistency** of $L$ and $\hat{P}$.

A latent structure $L = <G, O>$ is

consistentwith a distribution $\hat{P}$ over $O$ if there are parameters $\theta_G$ such that $G$ represents $\hat{P}$. \(\exists \theta_G : P_{[O]}(<G, \theta_G>)=\hat{P}\)

Consistency embodies the checks and balances in contrast to minimality.

Practically, it serves as a lower bound on the expressive power of $L$.

The following scenarios are possible:

- $L$ is
**consistent**with $\hat{P}$ and**minimal**w.r.t. $\mathcal{L}$: this is the best case, as $L$ is expressive enough to represent $\hat{P}$ over $O$ and it does not waste expressive power. - $L$ is
**consistent**with $\hat{P}$ but**not minimal**w.r.t. $\mathcal{L}$: $L$ is still able to represent $\hat{P}$ over $O$, but it wastes expressive power by not being minimal. - $L$ is
**not consistent**with $\hat{P}$ but**minimal**w.r.t. $\mathcal{L}$: this configuration makes $L$ practically useless as if it cannot represent $\hat{P}$ over $O$, then minimality does not matter. - $L$ is
**not consistent**with $\hat{P}$ and**not minimal**w.r.t. $\mathcal{L}$: this is like buying a sports car for rural road use. It cannot be used properly and it is very expensive.

## Projection of Latent Structures

When latent variables enter the game, they raise the following question:

If we don’t know how many latent variables there are that influence the dependencies of the observed variables, how can we uncover the causal graph

over the observables?

This is a problem of **unknown unknowns**, namely, we don’t know what we don’t know about the latents, as - by definition - we cannot observe them.

The practical consequence *seems to be* the need for checking all possible latent structures. But there are infinitely many of them…

Fortunately, there is an alternative. It is enough to check the **projection** of latent structures. Projections have *the same dependencies over* $O$ as the original latent structure. They are characterized by the following definition:

A latent structure $L_{[O]} = <G_{[O]}, O>$ is a

projectionof another latent structure $L$ if and only if:

- Every unobservable variable of $G_{[O]}$ is a parentless common cause of exactly two nonadjacent observable variables; and
- For every stable distribution $P$ generated by $L$, there exists a stable distribution $P’$ generated by $L_{[O]}$ such that $I(P_{[O]}) = I(P’_{[O]})$

The most important point is that projections reason about the observed variables only.

We can think about a projection as a manual how to condense latent structures into a graph that has the same I-map over the observed variables.

Let’s dive into the two conditions.

### The first condition

The **first conditions** describes the set of unobserved variables in $L$ that should remain in $L_{[O]}$.

For a triple $X, Y \in O, Z \not\in O$, it prescribes $X\leftarrow Z \rightarrow Y$.

As we now have The Book of Why, we can shamelessly ask the most annoying question of small children:

But why is this the case?

We will use an indirect approach by showing that the other two constructs - namely, $X\rightarrow Z \rightarrow Y$ and $X\rightarrow Z \leftarrow Y$ - are not possible in this setting.

The good news is that this time v-structures do not complicate the problem. Contrariwise, as the middle node $Z$ is unobserved, we can eliminate it easily. Since the v-structure is not activated, $X$ and $Y$ remain independent; thus, we can omit v-structures with $Z \not\in O$ from the projection.

What about the chain $X\rightarrow Z \rightarrow Y$? The above reasoning does not work as not observing $Z$ would leave the path between $X$ and $Y$ open, meaning that they are dependent.

The main difference between the chain and the fork is that when $X$ changes, $Y$ changes only if we have the chain structure. But that would mean that considering only $O$, we would think that there is an edge between $X$ and $Y$.

The following example of an $X\rightarrow Z \rightarrow Y$ chain with $Z$ as latent variable shows the reason why: \(\begin{aligned} X &= f_X(U_X) \\ Z &= f_Z(X, U_Z) \\ Y &= f_Y(X, Z, U_Y) = f_Y(X, f_Z(X, U_Z) , U_Y) \\ &= f'_Y(X, U_Z, U_Y) = f'_Y(X, U'_Y) \\ \end{aligned}\)

Here, $U_i \perp U_j : \forall i \neq j$. By manipulating the expression of $Y$, we can bring it to a form without containing $Z$ by substituting the functional relationship of $Z$ and defining a new noise variable including both $U_Y$ and $U_Z$ (think about it as a set of noises).

This is what we get when $Z$ remains unobserved. Clearly, $Y \not\perp X$, so there exists an $X\rightarrow Y$ edge - violating the assumption of $X$ and $Y$ being non-adjacent.

The above reasoning

does not mean that there cannot be unobserved chainsin the form of $X\rightarrow Z \rightarrow Y$. It only means that having such a latent structuredoes not modify the set of independencies over $O$.

### The second condition

The second condition is easier to interpret: it states that both the original and the projected structures have the same independencies over $O$ (they are perfect I-maps of each other).

What stability here means is that both $P$ and $P’$ are also perfect I-maps of their corresponding DAGs.

That is, $P$ contains all independencies present in $G$, $P’$ the ones in $G_{[O]}$ and both distributions contain the same independencies over $O$.

### Applicability

Projections seem to be handy as they allow to sidestep the burden of searching over infinitely many configurations. Nonetheless, the definition does not state that we can apply projection in every case.

Fortunately, a theorem of causal inference helps us out by showing:

Any latent structure has at least one projection.

# Summary

Latent structures are a powerful paradigm for modeling real-world phenomena. With the notions of preference, minimality, consistency, and projection, we are able to distinguish between latent structures and we can also reason about why they provide a proper description of the independencies.