Pearls of Causality #7: Latent Structures and Stability
Published:
DAGs like to play hide-and-seek. But we are more clever.
PoC Post Series
Stable Distributions (Faithfulness)
Before you get too enthusiastic about playing hide-and-seek with DAGs, we need to discuss something. Unfortunately, distributions can be mischievous. They want to make us believe that we have a plenty of independencies, although the underlying graph says otherwise.
Stability or faithfulness (less frequently used alternatives are DAG-isomorphism and perfect-mapness) answers the question of how “reliable” an independence statement is.
Note to self: add these names to the Causal Dictionary.
To understand what “reliability” means, we first need to consider the two types of independence statements.
The nature of independence
Let’s imagine that we observed $X\perp_P Y | Z$ in the distribution $P$ generated by $G$. We could think that this means any of the following (assume that there are only three nodes in $G$):
- $X\rightarrow Z\rightarrow Y$
- $X\leftarrow Z\rightarrow Y$
- $X\leftarrow Z\leftarrow Y$
These are not the only possible scenarios.
The reason is in the definition of I-maps. As it requires that $I(G) \subseteq I(P)$, it is possible that observing $X\perp_P Y | Z$ does not mean $X\perp_G Y | Z$.
But we said that $P$ is generated by $G$, so the observations we made cannot contradict all the above structures as otherwise we would have $X\not\perp_P Y | Z$.
However intuitive, this reasoning misses a point: the nature of independencies.
We can differentiate two types of independencies:
- Structural: when both $X\perp_G Y | Z$ and $X\perp_P Y | Z$ hold.
- Functional: when $X\not\perp_G Y | Z$ but $X\perp_P Y | Z$ holds in the distribution $P$.
From the two, structural independencies are straightforward. It means that $I(G)$ contains the d-separation statement that induces conditional independence in $P$.
This is when $X\perp_G Y | Z$ holds; thus, one of the above three structures is present in the graph.
On the other hand, functional independencies cannot be read off $G$. They are encoded in the structural equations (i.e., in $f_i$).
Example
Consider the graph in the image above. It is described by the following SEM:
- $ X = U_1$
- $ Z = aX + U_2$
- $ Y = bX + cZ + U_3 = (b + ac)X + U_2 +U_3$
Clearly, $X \not\perp_G Y$ and even $X \not\perp_G Y |Z.$ But setting $b = -ac$ makes $X$ and $Y$ independent. If $b=0$, then $X$ and $Y$ are conditionally independent given $Z$.
Thus, observing the (conditional) independence of $X$ and $Y$ is a matter of luck, as there are only very few combinations of $a,b,c$ that (conditional) independence holds.
Definition
$P$ is a stable distribution of a causal model $M= <G, \theta_G>$ if and only if $G$ is a perfect I-map of $P$ . \((X \perp_G Y|Z) \Leftrightarrow (X \perp_P Y|Z)\)
That is, $P$ maps the structure of $G$, and even varying the parameters $\theta_G$ does not introduce new independencies in $P$.
This remark lets us formulate the definition in an equivalent way.
Namely, a causal model $M$ generates a stable distribution $P$ if and only if $P(<G, \theta_G>)$ contains no extraneous independences. \(\forall \theta'_G : I[P(<G, \theta_G>)]\subseteq I[P(<G, \theta'_G>)]\)
That is, only structural independencies count. Both definitions exclude functional independencies. The first definition does this by postulating the $”\perp_G \rightarrow \perp_P”$ mapping to be bijective. For $G$ cannot reflect functional independencies, there cannot be any in $P$.
The second definition forces $\theta_G$ not to introduce any new independencies. If $\theta_G$ would generate a new indepdency, then there would exist a $\theta_G^$ so that $[P(<G, \theta_G^>)]\subseteq I[P(<G, \theta_G>)],$ which contradicts the definition.
Let’s close this section with my whim, i.e., terminology. To be honest, I think that the name faithfulness is not very descriptive of what the definition stands for. Stability is a bit better (at least, compared to faithfulness), as it expresses that no matter how the parameters change, the independencies in $P$ remain the same. These are the two most commonly used names for the concept. Nonetheless, the other two are much better. DAG-isomorphism explicitly states that we are talking about graphs with a fixed structure-but it still does not make the connection between $P$ and $G$.
In my opinion, perfect-mapness is the best name. It makes very clear that $G$ and the distribution $P$ generated by $M$ have the same set of independencies.
Latent Structures
Until now, we lived in an imaginary world with its profound simplicity. But it is time to grow up and face reality. What I am talking about is our imperfect senses. This is not just about illusions, but also about causal inference.
Namely, there will be mechanisms we do not know about (we can call these unknown unknowns). As we don’t know they exist, there is not much to do - before Steve Jobs invented the iPhone, people did not know that they desperately needed it. Although I enjoy philosophical topics, this post does not lead there; thus, we will discuss only known unknowns. That is, latent-variable models.
They are known unknowns as we know - or at least, suspect - that there is a mechanism in play, but we cannot observe it. For example, before discovering gravity, mankind only knew that apples fall from trees and planets rotate. People thought about possible explanations, but they did not connect the two.
In causal terms, gravity $G$ causes apple $A$ to fall down, i.e. $G \to A$. It also makes the Earth $(E)$ rotate around the Sun $(S)$, so $G\to E$ and $G\to S$. Before gravity, we did not know that all these phenomena have an unobserved common cause.
An unobserved common cause of $X,Y$ is $Z$ when $X\leftarrow Z\rightarrow Y$ and $Z$ is not observed. $Z$ is also called a confounder.
Notation
Before jumping into latent structures, let’s summarize the notation:
- $G$: a DAG
- $V$: a set of nodes (vertices, thus, the $V$)
- $O$: the observable subset of nodes in $V$ (i.e., $O\subseteq V$)
- $P_{[O]}$: observational distribution over $O$
- $\mathcal{P}_{[O]}$: set of observational distributions over $O$
- $\theta_G$: parameters of $G$ in the causal model (these describe the SEM)
- $M = <G, \theta_G>$: a causal model
- $L$: a latent structure
- $\mathcal{L}$: a class of latent structures
- $I(P)$: the set of all conditional independencies in $P$.
Definition
A latent structure is a pair $L = <G, O>$ with DAG $G$ over nodes $V$ and where $O\subseteq V$ is a set of observed variables.
So $L$ is a DAG where we attach labels to the nodes that are observed. So far so good.
Why do we need latent structures?
Because they can represent our hypotheses about the world. Imagine ourselves in the 20th century. We are physicists looking for the secrets of the universe. Some of us might still think that Newtonian physics is the way to go, some of us is a supporter of Einstein’s theory of relativity, whereas others believe in string theory. Each of us can observe the same, but our mental model is different.
I.e., we have a class of latent structures that try to describe the unobservable. Such models can help formulate a more compact representation (e.g. by connecting falling apples and the rotation of the Earth via gravity).
Latent Structure Preference
How can we compare latent structures? Which one is better?
This question leads us to the notion of latent structure preference.
Latent structure $L = <G, O>$ is preferred to another $L’ = <G’, O>$ (written $L \preceq L’$) if and only if $G’$ can represent at least the same family of observational distributions as $G$. \(\mathcal{P}_{[O]}(<G, \theta_G>) \subseteq \mathcal{P}_{[O]}(<G', \theta'_{G'}>)\)
That is, for each $\theta_G,$ there is a $\theta’_{G’}$ such that (note the difference of $\mathcal{P}$ and $P$): \(P_{[O]} (<G', \theta'_{G'}>) = P_{[O]} (<G, \theta_G>).\)
But there can be a $\theta’_{G’}$ so that $L$ is not able to express the same observational distribution.
The order of $L$ and $L’$ is crucial in the definition.
Namely, we impose the constraint on $L’$ that it should represent all observational distributions of $L$, but it can be more expressive.
Preference is Occam’s razor of latent structures.
That is, it prefers the simplest $L$. Note that simplicity is meant in terms of expressive power (i.e., how big the class of distributions is that can be represented), not in parameters number.
We can still prefer $L$ to $L’$ - even if $L$ has more parameters - if $L’$ is more expressive than $L$.
The edge case is when both latent structures represent the same $\mathcal{P}_{[O]}$, i.e., they are equivalent. In terms of preference: both is preferred to the other, i.e.: \(L' \equiv L \Longleftrightarrow L \preceq L' \wedge L \succeq L'\)
Similar to inequalities, we will use the symbol $\prec$ for strict preference-i.e., excluding equivalence.
Minimality of Latent Structures
Latent structures cannot escape our desire to search for the best. Minimality is the concept that quantifies what we are looking for.
A latent structure $L$ is minimal in a class of latent structures $\mathcal{L}$ if and only if there is no member of $\mathcal{L}$ that is strictly preferred to $L$. \(\not\exists L' \in \mathcal{L} : L' \prec L\)
The implication is that if we find an $L’ : L’ \preceq L \implies L \equiv L’.$ This is because by the definition $L’ \not\prec L$; thus $L \preceq L’$ holds. As new now have both $L \preceq L’$ and $L’ \preceq L$, we get $L’ \equiv L$.
Minimality expresses the “efficiency” of a latent structure. That is, a minimal $L$ is the most economical in terms of expressive power.
Consistency of Latent Structures
We can have a minimal $L$ and still fail to make use of it: while striving for an - in a sense of having a possibly small space of observational distributions - compact structure, we should ask:
Is $L$ able to represent the specific observational distribution $\hat{P}$ we care about?
Otherwise, our effort is futile. Fortunately, we can answer by checking the consistency of $L$ and $\hat{P}$.
A latent structure $L = <G, O>$ is consistent with a distribution $\hat{P}$ over $O$ if there are parameters $\theta_G$ such that $G$ represents $\hat{P}$. \(\exists \theta_G : P_{[O]}(<G, \theta_G>)=\hat{P}\)
Consistency embodies the checks and balances in contrast to minimality.
Practically, it serves as a lower bound on the expressive power of $L$.
The following scenarios are possible:
- $L$ is consistent with $\hat{P}$ and minimal w.r.t. $\mathcal{L}$: this is the best case, as $L$ is expressive enough to represent $\hat{P}$ over $O$ and it does not waste expressive power.
- $L$ is consistent with $\hat{P}$ but not minimal w.r.t. $\mathcal{L}$: $L$ is still able to represent $\hat{P}$ over $O$, but it wastes expressive power by not being minimal.
- $L$ is not consistent with $\hat{P}$ but minimal w.r.t. $\mathcal{L}$: this configuration makes $L$ practically useless as if it cannot represent $\hat{P}$ over $O$, then minimality does not matter.
- $L$ is not consistent with $\hat{P}$ and not minimal w.r.t. $\mathcal{L}$: this is like buying a sports car for rural road use. It cannot be used properly and it is very expensive.
Projection of Latent Structures
When latent variables enter the game, they raise the following question:
If we don’t know how many latent variables there are that influence the dependencies of the observed variables, how can we uncover the causal graph over the observables?
This is a problem of unknown unknowns, namely, we don’t know what we don’t know about the latents, as - by definition - we cannot observe them.
The practical consequence seems to be the need for checking all possible latent structures. But there are infinitely many of them…
Fortunately, there is an alternative. It is enough to check the projection of latent structures. Projections have the same dependencies over $O$ as the original latent structure. They are characterized by the following definition:
A latent structure $L_{[O]} = <G_{[O]}, O>$ is a projection of another latent structure $L$ if and only if:
- Every unobservable variable of $G_{[O]}$ is a parentless common cause of exactly two nonadjacent observable variables; and
- For every stable distribution $P$ generated by $L$, there exists a stable distribution $P’$ generated by $L_{[O]}$ such that $I(P_{[O]}) = I(P’_{[O]})$
The most important point is that projections reason about the observed variables only.
We can think about a projection as a manual how to condense latent structures into a graph that has the same I-map over the observed variables.
Let’s dive into the two conditions.
The first condition
The first conditions describes the set of unobserved variables in $L$ that should remain in $L_{[O]}$.
For a triple $X, Y \in O, Z \not\in O$, it prescribes $X\leftarrow Z \rightarrow Y$.
As we now have The Book of Why, we can shamelessly ask the most annoying question of small children:
But why is this the case?
We will use an indirect approach by showing that the other two constructs - namely, $X\rightarrow Z \rightarrow Y$ and $X\rightarrow Z \leftarrow Y$ - are not possible in this setting.
The good news is that this time v-structures do not complicate the problem. Contrariwise, as the middle node $Z$ is unobserved, we can eliminate it easily. Since the v-structure is not activated, $X$ and $Y$ remain independent; thus, we can omit v-structures with $Z \not\in O$ from the projection.
What about the chain $X\rightarrow Z \rightarrow Y$? The above reasoning does not work as not observing $Z$ would leave the path between $X$ and $Y$ open, meaning that they are dependent.
The main difference between the chain and the fork is that when $X$ changes, $Y$ changes only if we have the chain structure. But that would mean that considering only $O$, we would think that there is an edge between $X$ and $Y$.
The following example of an $X\rightarrow Z \rightarrow Y$ chain with $Z$ as latent variable shows the reason why: \(\begin{aligned} X &= f_X(U_X) \\ Z &= f_Z(X, U_Z) \\ Y &= f_Y(X, Z, U_Y) = f_Y(X, f_Z(X, U_Z) , U_Y) \\ &= f'_Y(X, U_Z, U_Y) = f'_Y(X, U'_Y) \\ \end{aligned}\)
Here, $U_i \perp U_j : \forall i \neq j$. By manipulating the expression of $Y$, we can bring it to a form without containing $Z$ by substituting the functional relationship of $Z$ and defining a new noise variable including both $U_Y$ and $U_Z$ (think about it as a set of noises).
This is what we get when $Z$ remains unobserved. Clearly, $Y \not\perp X$, so there exists an $X\rightarrow Y$ edge - violating the assumption of $X$ and $Y$ being non-adjacent.
The above reasoning does not mean that there cannot be unobserved chains in the form of $X\rightarrow Z \rightarrow Y$. It only means that having such a latent structure does not modify the set of independencies over $O$.
The second condition
The second condition is easier to interpret: it states that both the original and the projected structures have the same independencies over $O$ (they are perfect I-maps of each other).
What stability here means is that both $P$ and $P’$ are also perfect I-maps of their corresponding DAGs.
That is, $P$ contains all independencies present in $G$, $P’$ the ones in $G_{[O]}$ and both distributions contain the same independencies over $O$.
Applicability
Projections seem to be handy as they allow to sidestep the burden of searching over infinitely many configurations. Nonetheless, the definition does not state that we can apply projection in every case.
Fortunately, a theorem of causal inference helps us out by showing:
Any latent structure has at least one projection.
Summary
Latent structures are a powerful paradigm for modeling real-world phenomena. With the notions of preference, minimality, consistency, and projection, we are able to distinguish between latent structures and we can also reason about why they provide a proper description of the independencies.