Jekyll2022-08-19T11:45:54+02:00https://rpatrik96.github.io/feed.xmlCasual CausalityPhD student in representation learning and causality @IMPRS-IS and @ELLISPatrik ReizingerHiggins et al. - Towards a Definition of Disentangled Representations2022-06-30T00:00:00+02:002022-06-30T00:00:00+02:00https://rpatrik96.github.io/posts/2022/06/dgl-higgins-towards<p>Disentanglement is a concept rooted in geometric deep learning.</p>
<h1 id="disentanglement">Disentanglement</h1>
<p>We made the case for using geometric priors in the <a href="/posts/2022/06/dgl2-ammi3-geometric-priors-i/">AMMI 03</a> post and argued for their merit for generalization. To see the relationship to modern machine learning methods, we will now focus on disentanglement in representation learning.</p>
<blockquote>
<p>Disentangled representations mean, in an intuitive sense, that the latent factors a neural network learns are <strong>semantically meaningful</strong>.</p>
</blockquote>
<p>For <strong>example</strong>, this implies that for a 3D scene with objects, the latent representation should <em>separately</em> encode size, color, shape, and position. Nonetheless, this is a vague concept: indeed, current methods include a wide range of inductive biases and conjured a diverse range of metrics. Having <strong>uncorrelated</strong> factors make sense, but <em>is that the whole picture?</em></p>
<p>For me, disentanglement <em>was</em> this vague concept a lot of people are interested in, but could not express it explicitly. After spending some time to study the essentials of geometric deep learning, I found the notions of invariance, equivariance, and symmetries useful to think about disentanglement. Of course, I was not the first: this post relies on <a href="https://arxiv.org/abs/1812.02230">(Higgins et al., 2018)</a> to provide a geometric deep learning perspective of disentanglement.</p>
<p>But first, we should be more specific than saying what we want is <em>semantically meaningful</em> latents.</p>
<p>Visually, this is what we expect: for greyscale points on the 2D plane, we want to have $x$-, $y$-position, and color as our latents
<img src="/images/posts/higgins_latent_traversal.png" alt="higgins_latent_traversal.png" /></p>
<h1 id="what-properties-should-a-disentangled-representation-have">What properties should a disentangled representation have?</h1>
<p>Our first take is guided by the DCI score, for it quantifies semantically meaningful representations based on how disentangled (modular), complete (compact), and informative (explicit) they are.</p>
<blockquote>
<p>Note that disentanglement as a component in the DCI score has an unfortunate name: for a representnation to be disentangled, we require all three components. For this reason, <strong>I will use modularity, compactness, and explicitness</strong>.</p>
</blockquote>
<h2 id="modular-disentangled">Modular (Disentangled)</h2>
<blockquote>
<p>Modularity/Disentanglement
Modularity measures whether a single latent dimension encodes no more than a single data generative factor.</p>
</blockquote>
<h3 id="example">Example</h3>
<p>When changing a latent factor $z_i$ changes only one attribute, e.g., the size of the object, then it is modular.</p>
<h3 id="counterexample">Counterexample</h3>
<p>If changing $z_i$ changes both color and size, then it’s not modular in this sense.</p>
<p>What happens when $z_1, z_2, z_3$ encode the 3D position of the object, but not in the canonical base? Is that still modular? We will return to this point later.</p>
<h2 id="compact-complete">Compact (Complete)</h2>
<blockquote>
<p>Compactness/Completeness
Compactness measures whether each data generative factor is encoded by a single latent dimension</p>
</blockquote>
<h3 id="example-1">Example</h3>
<p>Completeness requires that an attribute should only be changed if a specific $z_i$ changes. For all $z_{j\neq i}$, the attribute (e.g. color) should remain constant.</p>
<h3 id="counterexample-1">Counterexample</h3>
<p>Completeness reasons about the opposite direction than modularity. Namely, modularity is still fulfilled if both $z_i, z_j$ encodes color, but such a representation is not compact.</p>
<h2 id="explicit-informative">Explicit (Informative)</h2>
<blockquote>
<p>Explicitness/Informativeness
Explicitness measures whether the values of all of the data generative factors can be decoded from the representation using a linear transformation.</p>
</blockquote>
<p>Fortunately, latents are not rude, so no four-letter words are meant by this kind of explicitness. As <a href="https://arxiv.org/abs/1812.02230">(Higgins et al., 2018)</a> argue, this is the strongest requirement, as it addresses two points:</p>
<ol>
<li>the disentangled representation should <strong>capture all latent factors</strong>, and</li>
<li>this information should be <strong>linearly decodable</strong></li>
</ol>
<h3 id="example-2">Example</h3>
<p>In a 3D scene of a single object with a specific shape, size, position, and orientation, all of these factors correspond to latent factors such that we can extract all information by applying a linear transformation, i.e., $z_{true} = A z_{learned}$. That is, it can happen that a single $z_{learned,i}$ changes <em>multiple factors</em>, but we can find a matrix $A$ such that we get factors where modularity holds.</p>
<h3 id="counterexample-2">Counterexample</h3>
<p>Condition 1 is hurt if, e.g., color is not encoded in the latents; while condition 2 is not fulfilled if there is no such matrix $A$ that $z_{true} = A z_{learned}$ holds (e.g., there is a nonlinear mapping to $z_{true}$).</p>
<h1 id="a-geometric-approach-to-disentanglement">A geometric approach to disentanglement</h1>
<p>Let’s start with a refresher from <a href="/posts/2022/06/dgl2-ammi3-geometric-priors-i/">AMMI 03</a> post about what a symmetry is:</p>
<blockquote>
<p>A <strong>symmetry</strong> of an object is a <strong>transformation</strong> that leaves certain properties of the object <strong>invariant</strong>.</p>
</blockquote>
<p>And continue with the same <strong>example</strong> as in the paper: a grid world with</p>
<ul>
<li>a single object,</li>
<li>four movement directions,</li>
<li>a single color component (hue), and</li>
<li>a circular structure (moving off the grid to the right transfers the object the leftmost pixel/the hue spectrum is traversed from its beginning again)</li>
</ul>
<p><img src="/images/posts/higgins_grid_world.png" alt="higgins_grid_world.png" /></p>
<p>Translation and color change do not change the identity of the object, so they are the symmetries of the example, and as such, they can be thought as a symmetry group $G$. Elements $g\in G$ thus map from data space to data space as $G\times X\to X,$ leading to the conclusion that these transformations are the <em>group actions</em>. Additionally, we can create subgroups from $G$, corresponding to horizontal/vertical translation and color change.
To have a disentangled representation, we require that when, e.g., color changes, the position stays the same. Translated to the language of geometric deep learning, this means that a</p>
<blockquote>
<p><strong>disentangled group action</strong> should decompose into components for each subgroup such that it only affects its corresponding subgroup.</p>
</blockquote>
<p>The components are subgroups as they are in $G$ and when we change the corresponding factor (such as color) then we will remain within the subgroup: it does not matter how much we tinker around with color, we cannot get the position to change (throwing a paint bucket at it does not count!).</p>
<p>The first notable point is that here</p>
<blockquote>
<p>the disentangled representation is defined in terms of a <em>disentangled group action</em> of symmetry group $G$.</p>
</blockquote>
<p>Thus, the disentanglement definition from the paper becomes (it refers to vector representations as the latent space is assumed to be a vector space, i.e., we have latent vectors such that their linear combination is also a valid latent vector):</p>
<blockquote>
<p>A vector representation is called a <strong>disentangled representation</strong> with respect to a particular decomposition of a symmetry group into subgroups, if it decomposes into independent subspaces, where each subspace is affected by the action of a single subgroup, and the actions of all other subgroups leave the subspace unaffected.</p>
</blockquote>
<p>This means that the definition <strong>depends on the specific decomposition of $G$ into subgroups</strong>.</p>
<p>For example, if we define the decomposition with only two subgroups (one for position and one for color), then we <em>do not care about whether the model can disentangle horizontal and vertical position</em>. And this is a <strong>very important point</strong>.</p>
<blockquote>
<p>This definition of disentanglement provides means to fine-tune the granularity w.r.t. which we require disentanglement.</p>
</blockquote>
<p>From a practical point of view, this could lead to <em>simpler models</em> as no model capacity needs to be spent to disentangle specific factors. Furthermore, this also means that</p>
<blockquote>
<p>There is no requirement on the dimensionality of the disentangled subspace.</p>
</blockquote>
<p>That is, even if there is a multidimensional subgroup, e.g., as it comprises of correlated factors, but the corresponding group action only acts on this subspace, then it is disentangled. Such scenarios can arise in the real-world: when encoding both height and age, then they are correlated (there are no two-meter-tall babies).</p>
<p>When this is not enough, we should note that there is</p>
<blockquote>
<p>no restriction on the bases of the subgroups.</p>
</blockquote>
<p>Thus, position is not required to be described with the Cartesian coordinate axes.</p>
<p>Furthermore, if we impose the linearity constraint on the group actions for all subgroups, we arrive at a linear disentangled representation (this means that we have the matrices $\rho$ from last post’s representation definition):</p>
<blockquote>
<p>A vector representation is called a <strong>linear disentangled representation</strong> with respect to a particular decomposition of a symmetry group into subgroups, if it is a disentangled representation with respect to the same group decomposition, and the actions of all the subgroups on their corresponding subspaces are linear.</p>
</blockquote>
<h2 id="counterexample-3d-rotations">Counterexample: 3D rotations</h2>
<p>A somewhat surprising counterexample is the case of 3D rotations. Namely, as they are not commutative (see image below), they <em>cannot be disentangled</em> according to the definition of <a href="https://arxiv.org/abs/1812.02230">(Higgins et al., 2018)</a>.
Namely, the non-commutativity implies that the group actions (rotations around any of the $x$-, $y$-, or $z$-axes) affect the other group actions. Rotating around the $z$-axis means that the rotation around the $x$-axis will have a different effect.
Thus, the group actions are not disentangled, and neither is the representation.</p>
<p><img src="/images/posts/higgins_3d_rot.png" alt="higgins_3d_rot.png" /></p>
<h1 id="summary">Summary</h1>
<p><a href="https://arxiv.org/abs/1812.02230">(Higgins et al., 2018)</a> provides a principled definition of disentanglement based on group theory. The main benefit of this is that it enables us to communicate clearly about what we mean with disentangled representations. Furthermore, instead of speaking about “data generative factors” (which is a vague concept), they reason about the well-defined notion of group actions.</p>Patrik ReizingerDisentanglement is a concept rooted in geometric deep learning.Where is the nature of the relationship expressed in causal models?2022-06-28T00:00:00+02:002022-06-28T00:00:00+02:00https://rpatrik96.github.io/posts/2022/06/sem-nature-of-relationships<p>Graphs don’t tell about the nature of dependence, only about its (non-)existence.</p>
<h1 id="where-is-the-type-of-dependence-encoded-in-an-sem">Where is the type of dependence encoded in an SEM?</h1>
<p>I came accross a question about how can you express, e.g., a logical AND relationship in SEMs (Structural Equation Models). Let’s look into this.</p>
<p>Assume that you would like to visit your friend and you have a motorcycle you wish to use. To be able to undertake the journey, you need both the motorcycle and fuel (if you can afford it, anyways…).
Clearly, you require both conditions, leading to a logical AND condition. How do you describe this in the language of SEMs?</p>
<p>We have the following <em>binary</em> variables:</p>
<ul>
<li>$X$ - access to the motorcycle</li>
<li>$Y$ - access to sufficient fuel</li>
<li>$Z$ - a successful visit</li>
</ul>
<p>$Z$ depends on both $X$ and $Y$, so this will be a collider/v-structure in the form $X\to Z\leftarrow Y$.</p>
<blockquote>
<p>But this graph does not say anything about the nature of the dependency, it only says that there is some.</p>
</blockquote>
<p>This is not a bug, as graphs on their own are only designed to express the (non-)existence of a cause-effect relationship, but it will not say anything about its functional form. For that, we need to level up our game to SEMs - we need the “equation” part of the SEM.</p>
<p>In a SEM, each of $X,Y,Z$ has a corresponding function that maps from the exogenous variables $U_i$ to the observed ones.
For simplicity, assume that
\(\begin{align} X&=U_X \\
Y&=U_Y\end{align}\)</p>
<p>Now lets look into $Z$. There, the relationship will be of form
\(Z = f(X,Y, U_Z),\)
as $Z$ depends on both $X$ and $Y$. To express the logical AND relationship, we might construct $f$ as
\(Z = f(X,Y, U_Z) = X*Y*U_Z,\)
which requires that $X$ and $Y$ are both present (expressed with the multiplication; $U_Z$ is not important here), meaning that</p>
<blockquote>
<p>the nature of the relationship is expressed with the functional relationships, but not by the graph structure itself.</p>
</blockquote>Patrik ReizingerGraphs don’t tell about the nature of dependence, only about its (non-)existence.AMMI 3 Notes: Geometric priors I2022-06-20T00:00:00+02:002022-06-20T00:00:00+02:00https://rpatrik96.github.io/posts/2022/06/dgl2-ammi3-geometric-priors-i<p>In the <a href="https://rpatrik96.github.io/posts/2022/06/dgl1-foundations/">previous post</a>, we dived deep into abstract algebra to motivate why Geometric Deep Learning is an interesting topic. Now we begin the journey to show that it is also useful in practice. In summary, we know that symmetries constrain our hypothesis class, making learning simpler—indeed, they can make learning a tractable problem. How does this happen?</p>
<h1 id="error-sources-in-learning-systems">Error sources in learning systems</h1>
<p>For this to understand, we need to review the different error sources in learning systems, namely</p>
<ul>
<li><strong>approximation</strong>: although neural networks have universal approximation capabilities, as in practice we cannot have infinitely deep and wide models, the function class the network can learn is <em>constrained</em>, i.e., it might not contain the ground-truth function;</li>
<li><strong>statistical</strong>: not just our model, but our samples are also finite; thus, training probably won’t find the true function. Our hope is that the statistical error gets smaller with smaller function class</li>
<li><strong>optimization:</strong> gradient descent is capable for a lot of things, but the (numerical) optimization procedure has many fallacies from local optima to numerical problems</li>
</ul>
<h1 id="why-should-we-use-geometric-priors">Why should we use geometric priors?</h1>
<blockquote>
<p>Geometric priors in this context means exploiting the geometric structure of the data</p>
</blockquote>
<p>For example, we can exploit that translating images will not change the object represented; thus, we get the same class label—some data augmentation techniques also rely on this idea, but they are not as principled as the Geometric Deep Learning approach. We will come back to this at the end of the post.
This translation invariance is exactly what CNNs realize, leading to a simpler and smaller hypothesis class and thus <em>smaller statistical error</em> (and hopefully not increasing the approximation error—for CNNs, we know that labels are the same when images are translated, so we can be sure that the approximation error will not increase, but this can be nontrivial in more complex scenarios). Additionally, as CNNs do not care about translations, we don’t need to present images of the same object in every position; thus, we can reduce sample complexity too.</p>
<h2 id="what-are-these-geometric-structures-domains">What are these geometric structures (domains)?</h2>
<p>As the title of <a href="https://arxiv.org/abs/2104.13478">Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges</a> says, deep learning is also invaded by the 5G: grids, groups, graphs, geodesics, and gauges (this has nothing to do with the conspiracy theories, probably because much fewer people understand it). The meaning of these concept will be clarified (not everything in this post). For now, what is important is that <strong>they describe (geometric) structure</strong>.
Grids (e.g., pixel grids describing images) have an adjacency structure, i.e., all pixel has a specific set of neighbors. In the case of graphs, the edges between the nodes gives the structure. We would be fools not to exploit this structure. To refer to such structures, we will use the notion of</p>
<blockquote>
<p>A domain $\Omega$ is a set with possibly additional structure.</p>
</blockquote>
<p><img src="/images/posts/5g_example.png" alt="5g_example.png" /></p>
<p>Sometimes, <strong>the domain itself is the data we use</strong>, for example</p>
<ul>
<li><em>point clouds</em> (when the data consists only of the positions, but have not color or other attributes)</li>
<li>meshes/graphs without node/edge features (when we have a social network of people, but we do not store their age, gender, or any related data—what a utopistic thought in today’s world, isn’t it?): in this case we can use the adjacency matrix</li>
</ul>
<p>Nonetheless, often we <em>want to store more information</em>, e.g. the color of a pixel or what you have eaten for breakfast tomorrow with the obvious reason to sell it to marketers to create personalized ads for the special omelette with peanut butter and jelly you thought you can keep secret.</p>
<h2 id="how-can-we-represent-further-attributes-signals">How can we represent further attributes (signals)?</h2>
<p>For attaching other attributes to elements of a domain $\Omega$, we will apply a function, which we call the <strong>signal</strong>, mapping from elements of $\Omega$ (e.g., a pixel) to a vector space $C$ (e.g., RGB color of a pixel) and we denote the space of signals as $X={x:\Omega\to C}$ (think of this as the data space of RGB images).</p>
<blockquote>
<p>So a <strong>signal</strong> associates a vector space $C$ for each element of $\Omega$.</p>
</blockquote>
<p>$C$ does not even need to be the same for all $u$, it can be e.g. the tangent space of a specific point on a sphere.</p>
<p>What is interesting is that irrespective of the domain $\Omega$, $X$ will always be a vector space (linear combinations work as expected). Okay, it is not as interesting as we defined $C$ to be a vector space. Nonetheless, this enables us to do operations on our data (e.g., we can add images)—and as we know from our discussion on abstract algebra, operations are essential to define groups for example. Obviously, this is where today’s discussion will lead to.</p>
<h2 id="what-are-symmetries">What are symmetries?</h2>
<p>We have done the above (and the <a href="https://rpatrik96.github.io/posts/2022/06/dgl1-foundations/">previous post</a> too) to be able to describe <strong>symmetries</strong>.</p>
<blockquote>
<p><strong>Symmetries</strong> are object transformations that leave the object unchanged,</p>
</blockquote>
<p>and come in many flavors. Formulated otherwise: $g:\Omega\to\Omega$ is a symmetry if it preserves the structure of the domain.
We start by noting that when using the <strong>group action</strong> $g$ we act on an element of $\Omega$ and get back a (possibly different) element of $\Omega$. This means that the mapping is $G\times\Omega \to \Omega$ and is denoted as $(g,u)\mapsto gu$ (i.e., it associates the element $gu$ to the element $u\in\Omega$ via the symmetry $g$).</p>
<p>$g$ has the following properties ($e$ is the identity element)</p>
<ul>
<li>Composition: $g(hu)=(gh)u$</li>
<li>$eu=u$</li>
</ul>
<h3 id="example">Example</h3>
<p>An example is planar motion in $R^2$, where $g$ is described by a rotation angle $\theta$, and two translation coordinates $t_x, t_y$. Then applying $g$ on a point $u=(x,y)$ can be characterized by the mapping $((\theta, t_x, t_y), (x,y))\mapsto [R; T] (x,y,1)$, where $[R;T]$ is a shorthand for the transformation matrix that rotates $u$ by $\theta$ and translates it by $(t_x,t_y)$—the third coordinate is needed to describe this affine transformation with a single matrix.</p>
<h2 id="how-can-we-describe-all-symmetries">How can we describe all symmetries?</h2>
<p>It needed a lot of effort, but now we can make sense of it to describe groups:</p>
<blockquote>
<p>If we collect all symmetries (of a specific $\Omega$) together, we get a <strong>symmetry group</strong> $G$ with a group operation as the composition of group elements.</p>
</blockquote>
<p>From the [previousus post](https://rpatrik96.github.io/posts/2022/06/dgl1-foundations/, we know that:</p>
<ul>
<li>the identity need to be in the group</li>
<li>composition of group elements is also in the group (thus, a symmetry)</li>
<li>inverse is also a symmetry
In our case, the group elements are function (rotations for example), but generally they are just set elements.</li>
</ul>
<h2 id="how-can-we-classify-symmetry-groups">How can we classify symmetry groups?</h2>
<p>Symmetry groups can be <strong>discrete</strong> (rotating an equilateral triangle with multiples of $120^\circ$ or flipping its vertices) or <strong>continuous</strong> (rotations in $SO(3)$). They can be <strong>commutative/non-commutative</strong> (flipping the vertices of a triangle then rotating it has a different result than first rotating then flipping).</p>
<h2 id="how-can-we-apply-symmetries-on-our-data">How can we apply symmetries on our data?</h2>
<p>As these transformations (functions) $g$ act on $\Omega$ but our data lives in the signal space $X(\Omega, C)$, we need to introduce the corresponding mapping on $X$ as well. Namely, we need to be able to express symmetries not just on the pixel grid, but also in the RGB channels (the vector space $C$).</p>
<h3 id="example-1">Example</h3>
<p>Let’s look into an example of moving a bug (I am not supposing that you get a bug in your code and move it into someone else’s) in an image by translation $t=(t_x,t_y)$. When we translate the bug by 5 pixels to the right (this would mean $t_x=5, t_y=0$), then to get the pixel value of the translated image at position $u$, we need to look up the original pixel value at the position 5 pixels to the <em>left</em> of $u$, i.e., at $u-t$.</p>
<p><img src="/images/posts/group_action_inverse_bug.png" alt="group_action_inverse_bug.png" /></p>
<h3 id="symmetries-in-data-space-formula">Symmetries in data space formula</h3>
<p>This example highlights why the corresponding formula to define the symmetries of $\Omega$ acting on the signal space $X(\Omega, C)$ is
\((gx)(u) = x(g^{-1}u)\),
in our example $g^{-1}$ is applying $-t$. If you wonder what is the reason for the inverse, you don’t need to wait further: it is to satisfy the group axioms (hint: inverse element needs to be in the group—in our example, this is the relationship between the shifted pixels to the left/right).</p>
<h2 id="groups-get-into-action-how-do-we-get-to-representations">Groups get into action: how do we get to representations?</h2>
<p>Groups are abstract concepts, we need to describe them such that our computers can produce significant carbon footprints. Implicitly, we already did this (not the carbon footprint thing though): when we used matrices to describe affine transformations, we assigned a linear map to the group element of $(\theta, t_x, t_y$). Basically, this is what representations do.</p>
<blockquote>
<p>An $n$-dimensional real <strong>representation</strong> of a group $G$ is a map $\rho: G\to R^{n\times n}$ assigning an invertible matrix $\rho(g)$ to each $g\in G$ such that it satisfies the <strong>homomorphism property</strong>
\(\rho(gh) = \rho(g)\rho(h)\)</p>
</blockquote>
<h3 id="example-2">Example</h3>
<p>An example would be the following:</p>
<ul>
<li>group $G=(Z,+)$</li>
<li>domain $\Omega = Z_5 = {0,1,2,3,4}$ (a short audio signal of length five)</li>
<li>action of $g=n$ on $u\in\Omega: (n,u)\mapsto n+u$ (mod 5)</li>
<li>the representation of $X(\Omega)$ is a 5-dimensional shift matrix</li>
</ul>
<blockquote>
<p>An important conclusion is that <strong>the number of elements in $G$</strong> (in this case infinite) and the <strong>dimension of the representation are independent</strong></p>
</blockquote>
<h2 id="what-are-the-types-of-symmetries-relevant-to-machine-learning">What are the types of symmetries relevant to machine learning?</h2>
<h3 id="symmetries-of-parametrization">Symmetries of parametrization</h3>
<p><img src="/images/posts/param_symm.png" alt="param_symm.png" /></p>
<p>This type of symmetry comes from how we build our neural networks. For example, given the vector space of data (signals) $X$, outputs (e.g., labels) $Y$, and the weights $W$, we can describe our net by a mapping $X\times W\to Y$ (mapping data with the net’s weights to a label) and say that a transform $g$ is a symmetry of this parametrization (~network structure) when we get the same result by using the weights $w\in W$ as with $gw$. For example, in an MLP permuting the hidden units makes no change in the output as we add the values and addition is commutative.</p>
<h3 id="symmetries-of-the-label-function">Symmetries of the label function</h3>
<p><img src="/images/posts/label_symm.png" alt="label_symm.png" /></p>
<p>We already touched on CNNs and their invariance w.r.t. translation. In general, if the label does not change under a transformation $g: \Omega\to\Omega$, then we say that $g$ is a symmetry of the label function. The label function is simply a notation describing the mapping that associates a label in $Y$ to a data point in $X$ (denoted by $L: X\to Y$ ).
Note that here $g$ is applied on the domain, only after that comes the label function, i.e., $L\circ g $.</p>
<blockquote>
<p>This means that if we have a single data point but know all $g\in G$, then we can generate all instances of the class. Basically, <strong>learning all symmetries is what it takes to solve classification</strong> (which is a very hard problem).</p>
</blockquote>
<p>From the <a href="https://rpatrik96.github.io/posts/2022/06/dgl1-foundations/">previous post</a>, we can relate to factor groups, which describe the subgroups of a specific group that behave the same way w.r.t. the kernel of an operation. What this means for classification is that the elements of factor groups divide all samples into the respective classes. So we can think of the symmetries of the label function as a way to describe the elements of the factor group.</p>
<h2 id="why-should-we-really-use-geometric-priors">Why should we <em>really</em> use geometric priors?</h2>
<p>Because they have symmetries! And now we can describe them. For example:</p>
<ul>
<li>For <strong>sets</strong>, permuting the elements does not change the set (i.e., the structure of the domain).</li>
<li><strong>Grids</strong> (as we have seen with the image example before) have symmetries w.r.t. discrete rotations, translations, etc</li>
<li>Isometries (distance-preserving maps) leave the Euclidean space unchanged</li>
<li>Diffeormorphisms preserve the smooth structure on $\Omega$</li>
</ul>
<p>With such structures as graphs or sets, we can point out a seemingly subtle but important detail: although a graph (or a set) is an <strong>abstract concept</strong>, they need to have a <strong>practical description</strong> (how they are stored in computer memory). The consequence is that usually, we are interested in the <strong>symmetries of the description, not that of the object</strong>.</p>
<h2 id="how-can-we-exploit-symmetries">How can we exploit symmetries?</h2>
<p>We already talked about CNNs and that they are <em>invariant to translation</em>. Nonetheless, <em>invariance</em> has its fallacies. In the case of learning faces, we need to be careful <strong>not to make the intermediate representations invariant,</strong> for that can lead to unrealistic objects, e.g., with faces this would mean the right most image below.</p>
<p><img src="/images/posts/rotation_intermediate.png" alt="rotation_intermediate.png" /></p>
<blockquote>
<p>What is the answer to that? <strong>Equivariance</strong></p>
</blockquote>
<h3 id="equivariant-networks">Equivariant networks</h3>
<blockquote>
<p>Equivariance means that when transforming the input of $f$ with transform $h$ is the same as the output is transformed with the same $h$.
\(f \circ h(u) = h\circ f(u)\)</p>
</blockquote>
<p>We can build equivariant neural networks, if we have the follwoing components:</p>
<ul>
<li>feature vector spaces $X_i$</li>
<li>nonlinear maps $f_i$</li>
<li>symmetry group $G$</li>
<li>group representations $\rho_i$ for each $X_i$</li>
</ul>
<p><img src="/images/posts/equiv_nn.png" alt="equiv_nn.png" /></p>
<blockquote>
<p>We need different group representations for each $X_i$ as the same symmetry “needs to be adapted” to the new data space $X_i$.</p>
</blockquote>
<p>This leads to the definition of <strong>equivariant networks</strong>:</p>
<p>\(f_i \circ \rho_{i-1}(g) = \rho_i(g) \circ f_i,\)
which means that applying the (corresponding) representation and the nonlinear map can be <em>almost interchanged</em> (note the different indices for $\rho$). First applying the representation of layer $i-1$ and then mapping through layer $i$ should give the same result as first mapping through layer $i$ then applying $\rho_i$.</p>
<h3 id="example-3">Example</h3>
<p>When treating images, $X_1$ is $n\times n \times 3$ (RGB channels), and assume that the first layer $f_1$ is a convolution with 64 channels. This means that we need a different representation of, e.g., translations in this 64-dimensional space (if we translate with $\rho_0$ then map with the first layer, we should get the same as if we would first map then translate with $\rho_1$). <em>The rationale is that it cannot be that a translation can be described in the exact same way in both 3 and 64 dimensions.</em></p>
<h3 id="why-is-equivariance-beneficial-for-generalization">Why is equivariance beneficial for generalization?</h3>
<p>In the end, our goal is to generalize better: we want that <strong>all samples that map to the same feature will still map to the same feature after undergoing a transformation</strong> by the group representation $\rho_1(g)$ in the input space. For this, the notion of an <strong>orbit</strong> is a useful concept:</p>
<blockquote>
<p><strong>Orbit</strong>: the manifold of a sample undergoing a transformation by each element of a group (e.g. the manifold of all rotated digits, starting from a single one—these are the curved lines in the input space in the figure below).</p>
</blockquote>
<p><img src="/images/posts/orbit.png" alt="orbit.png" /></p>
<p>Indeed, this is what equivariant nets are capable of (see, e.g., <a href="https://arxiv.org/abs/1902.04615">this paper</a> on CNNs).</p>
<p>Wait a second! We were talking about transformed samples, can’t we achieve the same generalization properties with data augmentation (instead of the toil required to derive the theory and design for equivariant models)?</p>
<blockquote>
<p>No, data augmentation is inferior to equivariant networks.</p>
</blockquote>
<p>For example, data-augmentation implements a constraint for the whole network (i.e., when augmenting the samples, we do not prescribe constraints for specific layers), but equivariance imposes a layer-wise constraint. And it scales better to large groups.</p>
<h1 id="summary">Summary</h1>
<p>We dived deep into geometric priors to describe symmetries in the domain (data), parametrization, and labels with the goal to design efficient models exploiting this inductive bias. In the end, we did exactly that with equivariant networks and understand why equivariance is beneficial (principled way to generalization).</p>
<h2 id="acknowledgement">Acknowledgement</h2>
<p>This post was created from the <a href="https://www.youtube.com/watch?v=fWBrupgU4X8&list=PLn2-dEmQeTfQ8YVuHBOvAhUlnIPYxkeu3&index=3&t=5s&ab_channel=MichaelBronstein">AMMI Course on Geometric Deep Learning - Lecture 3</a> held by <a href="https://twitter.com/TacoCohen">Taco Cohen</a>. Mistakes are my own.</p>Patrik ReizingerIn the previous post, we dived deep into abstract algebra to motivate why Geometric Deep Learning is an interesting topic. Now we begin the journey to show that it is also useful in practice. In summary, we know that symmetries constrain our hypothesis class, making learning simpler—indeed, they can make learning a tractable problem. How does this happen?Mathematical foundations for Geometric Deep Learning2022-06-02T00:00:00+02:002022-06-02T00:00:00+02:00https://rpatrik96.github.io/posts/2022/06/dgl1-math-groupthink<p>Yes, abstract algebra is actually useful for machine learning.</p>
<h1 id="introduction">Introduction</h1>
<p>Machine learning operates on images, text, speech and much more. We intuitively understand that they include structure, but for most of us, this is where our knowledge stops. With the emergence of geometric deep learning, there is an increased need to understand the invariances, symmetries in the data.</p>
<p>When I started my B.Sc. in electrical engineering at the Budapest University of Technology and Economics in 2015, the curriculum of our “Introduction to Computer Science” class was changed; it no longer included abstract algebra. Looking into the assigned book, I thought this to be a reasonable decision, as I could not imagine myself using groups, rings, or bodies. I still think that this was a reasonable decision for <em>most</em> students, but I have realized that I missed a great opportunity to understand a deeper level of mathematics, and a way to connect to the real-world.</p>
<p>The name “abstract algebra” and connections to the real world seem to be controversial, but I think they are not. Though having a “rotation” is more abstract than a matrix in the mathematical sense, it is a concept we can easily relate to. When I think about rotation matrices, I always associate the physical rotation in three dimensions to have an easy-to-grasp mental concept for the mathematical description. You might object that this only works in 3D. Yes and no: though I only have access to the <em>real-world</em> meaning of rotations in 3D, but through this I have a <em>general</em> idea about what rotations are, so I am not baffled when I hear about 100-dimensional objects being rotated.</p>
<p>Geoffrey Hinton’s sarcastic remark also highlights our brain’s capacity to handle complex scenarios:</p>
<blockquote>
<p>To deal with a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it.</p>
</blockquote>
<!-- **fig about the relatoions of different objects** -->
<p>In abstract algebra, we deal with sets and equip them with different relationships and properties, which we will describe with the help of operations. Operations take a number of elements from a set and map they to another element in the set. We can distinguish unary (acts on one element), binary (acts on two elements), tertiary, etc. operations.</p>
<p>This is the same concept you know from programming, so we can think of them as functions with a given number of inputs. Negating a boolean is a unary operator (as it changes the value of a single variable, it is a function with one input), but adding two numbers is binary (two inputs, but still one output).</p>
<p>In mathematics, we can describe such functions as operations mapping from a sequence of elements of a set $S$ (i.e., we take 2 elements from $S$ for a binary operator) to another element in $S$.</p>
<p>In the case of addition (for integers in this example), the Python type hints describe $S$, and we can see that our function maps $S\times S \to S$, meaning that the first parameter <code class="language-plaintext highlighter-rouge">x</code> comes from the set <code class="language-plaintext highlighter-rouge">int</code>, and so does <code class="language-plaintext highlighter-rouge">y</code>. As a result, we get another <code class="language-plaintext highlighter-rouge">int</code> with the value <code class="language-plaintext highlighter-rouge">x+y</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">x</span><span class="p">:</span><span class="nb">int</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span><span class="nb">int</span><span class="p">)</span><span class="o">-></span><span class="nb">int</span><span class="p">:</span>
<span class="k">return</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span>
</code></pre></div></div>
<blockquote>
<p>An operation is a function,</p>
</blockquote>
<p>and as one, it can have different properties. These properties are the instructions how we can apply the operators (functions), and have a role in what symmetries are present in $S$.</p>
<ul>
<li><strong>Associativity:</strong> the binary operator * is associative if it admits the rearrangement of parenthesis (i.e., the order of carrying out the operations). Thus, $a*(b*c)=(a*b)*c$.</li>
<li><strong>Commutativity:</strong> the binary operator * is commutative if its arguments are exchangeable (i.e., if they can switch place). Thus, $a*b=b*a$.</li>
</ul>
<blockquote>
<p>* is used as a symbol for an arbitrary operator, it does not necessarily mean multiplication.</p>
</blockquote>
<p>Abstract algebraic structures are defined by the set $S$, an operator/operators and their properties. They form a hierarchy, where we get more and more valid operations. This means that more complex structures admit more “exotic” functions. Intuitively, this equals of having richer “representations” (in the machine learning sense).</p>
<blockquote>
<p>Before advancing through this hierarchy, we should ponder the question: if we can have richer representations, why would be satisfied with simpler ones? That is, why do we need simpler algebraic structures?</p>
</blockquote>
<p>Complexity has its price. Thinking in terms of neural networks, a larger model is more powerful, but is harder to train, whereas the smaller one is faster, but less expressive. A convolutional network is invariant to translations, which is useful for images, but could be useless or harmful in other contexts. As we would not choose a huge model for MNIST (we would risk overfitting), we select the simplest algebraic structure that admits the properties we want. In the following, we describe our choices.</p>
<h1 id="groups">Groups</h1>
<p>Groups are already special, they have specific prperties (see below). To provide some perspective, we start with a very simple structure comprising of a set and a simple associative operation. It is not much, though it deserves its own name:</p>
<blockquote>
<p>A set $S$ is an <strong>associative semigroup</strong> if it has an associative operator *. If * is also commutative, we call it a <strong>commutative (or abelian) semigroup</strong>.</p>
</blockquote>
<p>$n\times n$ matrices are a good example for associative semigroups. They can describe linear transformations, such as rotations, translations, so even this simple structure is powerful. Though they are not <strong>abelian</strong> as matrix multiplication is generally not commutative. This means that selecting the set of matrices that commute w.r.t. * is a smaller set.</p>
<p>Having defined an algebraic structure, we can interpret what we mean by the operator mapping $S\times S\to S$: multiplying two $n\times n$ matrices will also be an $n\times n$ matrix. That is, $S$ is closed w.r.t. *. This is not the case for all operators. If our operator is the subtraction, then $S=\mathbb{N}$ is not a group (w.r.t -), as $(7-9)\not\in\mathbb{N},$ which violates that the result of the operator (the output of the function) should be in the same set. In this case, <em>enlarging</em> $S$ is the solution. If we choose $S=\mathbb{Z}$, then $S$ is closed w.r.t. subtraction.</p>
<p>If we demand that the $S$ contains a unit element (identity) w.r.t. the operator *, then we call $S$ a <strong>monoid</strong>. It is important that we speak about “identity <em>w.r.t.</em> the operator *”, as we will see that more complex algebraic structures can have multiple operators and so multiple identity elements. An example is the identity matrix $I_n$ for the set of $n\times n$ matrices and the matrix multiplication *. In this case, we can call $I_n$ a <em>multiplicative unit element</em> to differentiate it from other possible unit elements.</p>
<blockquote>
<p>A set $S$ is a <strong>group</strong> with the operator * if:</p>
<ul>
<li>* is associative</li>
<li>has an identity element $i$ s.t. $\forall x \in S : x*i = i*x = x$</li>
<li>the inverse of each element is unique and is also in $S$, i.e., $\forall x\in S \implies y \in S : x*y=y*x=i$</li>
</ul>
</blockquote>
<p>The leap from monoids to groups is the existence of the inverse element. Though this can seem as an unimportant feature, it is not. Namely, having the inverse means in $S$ means that we can undo the operation (think about this as the option of Ctrl+Z/Cmd+Z on your laptop). That is, we can answer the question: <em>What was the starting point before applying a specific element?</em> In the machine learning perspective, if we assume that our data is generated by latent factors, then we will be able to recover them. As for semigroups, we can define commutative/abelian groups if * is also commutative.</p>
<h2 id="grouping-groups">Grouping groups</h2>
<p>When we have images with different shapes, positions, and colors it is useful to build categories like triangles, circles, and squares. These are <strong>subgroups</strong> of the original group. That is, if our group $G$ contains vectors with elements $[x; y; angle; shape, color]$ and the operator * combines the features (e.g., translates the object) then a subgroup is a set of elements where some coordinates of the feature vector are fixed. Triangles are elements where $[x; y; angle;shape=triangle, color]$ and so on.</p>
<blockquote>
<p>So a subgroup is a subset $S$ of the elements with the same operator as the group $G$. This is like a specific 2D plane in 3D space.</p>
</blockquote>
<p>Note that subgroups need to contain the identity of the group to have inverses.</p>
<p>Subgroups are useful as they correspond to how we would categorize objects. We think about triangles, circles, and squares as distinct objects. If we would like to cover the whole space of objects, i.e., to get a description of <em>all</em> 2D planes covering 3D space, we need the concept of <strong>cosets</strong>.</p>
<blockquote>
<p><strong>Cosets</strong> describe a set of subgroups $S$ that cover the original group $G$.</p>
</blockquote>
<p>For 2D planes parallel to the $x-y$ plane of the Cartesian coordinate system, this means having all translations along the $z$ axis. This idea is generalized by taking an element of $g\in G$ and applying the group operator * on $g$ and <em>all</em> $s\in S$. That is, we shift all points of the plane ($S$) with $g$. The concept is captured mathematically as $S*g,$ where this means that we take <em>all</em> $s\in S$ and apply * with a <em>single</em> $g\in G$ (this yields a single 2D plane, shifted by the vector $g$ for all $s\in S$); then you repeat it for <em>every</em> $g$ (to cover the whole 3D space). We can generate subgroups both as $s*g$ (right coset) and $g*s$ (left coset), but we will only focus on cases when both are the same, which we will call <strong>normal subgroups</strong>. An intuitive way to think about this is to compare this property to commutativity.</p>
<blockquote>
<p>How do we benefit from dividing groups into smaller entities besides having a more intuitive description?</p>
</blockquote>
<p>This enables us to express certain symmetries. Take rotations of objects for example. They are a describe a subset of objects with different orientation. So we write a function to to render objects with all rotations. As our computational power is finite, we need to define a step size for the angle. Make it to $1^\circ$. In this case, the rotation element $R_1$ (“rotate by $1^\circ$”) <em>generates a subgroup</em> (as our group contains other features such as position, shape, etc.) containing $R_1, R_2, \dots, R_{359}, R_{360}$. You need all 360 elements, otherwise the group operator (multiplying the rotation matrices) would create elements not in the subgroup. Moreover, this subgroup is <strong>cyclic</strong>. When we apply $R_1$ consecutively, we will not get an infinite amount of different elements as $R_{360}=R_0$ (the identity). We call the smallest number of applying the <em>generating element</em> $R_1$ and getting back the identity as the <strong>order</strong> of the subgroup.</p>
<p>Normal subgroups can be used to define another group, called <strong>factor or quotient group</strong>, denoted by $G/S$. The name “quotient” comes from an analogy to division: as quotient groups are sets of cosets of $S$, this means that a quotient describes a “clustering” of $G$ according to $S$. Namely:</p>
<ul>
<li>$S$ has its cosets w.r.t $G$ that cover $G$ with non-overlapping subgroups;</li>
<li>$G/S$ collects all such subgroups together;</li>
<li>the implication is that the order of the quotient group, $|G/S|$ is the number of cosets of $S$.</li>
</ul>
<p>The last point illustrates the additional information conveyed by quotient groups compared to plain old division: division only gives the order (i.e., the quotient), but quotient groups provide the elements too. An example is taking the positive integers as $G$ with addition as the group operation and defining the normal subgroup as the numbers that are the multiple of e.g. $7$. This means that $G/S$ will give the integers modulo $7$, i.e., it divides all positive integers into $7$ clusters, those with the remainder $0,1,2,3,4,5,6$ w.r.t. division by $7$.</p>
<p>Regarding technical details, the group structure follows from the properties of normal subgroups, namely, that the left and right cosets are the same. For $S*g = g*S $, we can rewrite this as $S=g^{-1}*S*g$. From the equivalence of left and right cosets follows that $S$ is the unit element of the factor group, since $S*(g*S)=S*(S*g)$ and $S*S=S$; thus $S*(g*S)=S*(S*g)=S*g$. By left-multiplying with $S$, we can notice that $S*S=g^{-1}*S*g=S,$ so we have an inverse too.</p>
<p>From a machine learning perspective, we can see the merit of factor groups, as they can express how different elements in a (data) set are grouped together, e.g., this is what we want when clustering data.</p>
<h2 id="expressing-group-equivalence-isomorphism">Expressing group equivalence (isomorphism)</h2>
<p>We can describe the same group with different representations. If we have an image, it does not change its meaning if we select the top left or the bottom right pixel as the origin of our coordinate system. For we can find a bijective mapping that transforms the coordinates from one frame to the other. This notion, which we call <strong>isomorphism</strong>, is important as it reduces the number of different sets (as we only need to take care of those that are not isomorphic to each other, e.g., we don’t need to have all coordinate systems for our images).</p>
<blockquote>
<p>Formally, two groups $G_1, G_2$ are <strong>isomorphic</strong> if there is a <em>bijective</em> mapping $\phi: G_1 \to G_2$ such that $ \forall x,y \in G_1 : \phi(x)*\phi(y) = \phi(x*y)$, where * is the group operation.</p>
</blockquote>
<p>The definition says that if we apply the group operation to two elements in $G_1$, then map the resulting group element to $G_2$, we get the same result as applying the group operation of $G_2$ to the elements that are mapped to $G_2$. Going back to our representation learning example, let’s assume that the operator in $G_1$ (the latent space) “combines the features” (similar to + for numbers, e.g., if $x$ describes a red triangle and $y$ a blue triangle, then $x*y$ is a purple triangle); and the operator of $G_2$ does the same for the images (in this example, we can think of adding the matrices representing the images). Translating the definition to this example means that if we combine the features “red triangle” and “blue triangle” (e.g., both are vectors with two elements, indicating RGB color and shape) and mapping this feature vector to an image is equivalent to combining the <em>images</em> of a blue and a red triangle.</p>
<h2 id="homomorphism">Homomorphism</h2>
<p>We already have defined <em>isomorphisms</em> that map between two groups with a bijective mapping, but this is a strong constraint as it requires that each element of $G_1$ is mapped to a single distinct element of $G_2$. <strong>Homomorphisms</strong> generalize these mappings by omitting the bijectivity constraint, leading to the definition:</p>
<blockquote>
<p>The mapping between two groups $G_1, G_2$ is a <strong>homomorphisms</strong> if there is a mapping $\phi: G_1 \to G_2$ defined for each $x\in G_1$ such that $ \forall x,y \in G_1 : \phi(x)*\phi(y) = \phi(x*y)$, where * is the group operation.</p>
</blockquote>
<p>Although we cannot invert a homomorphism generally the property $ \phi(x) * \phi(y) = \phi(x*y) $ still means that we preserve the group structure. In representation learning, we come across a similar concept when using Latent Variable Models (LVMs), where a small-dimensional latent vector describes high-dimensional observations (such as in the example above with factors such as color and shape as latents and the image as the high-dimensional observation).</p>
<p>When training Variational AutoEncoders (VAEs), it can happen that we experience <strong>posterior collapse</strong>, i.e., some elements in the latent space do not capture useful information, they are white noise. Intuitively, this relates to the concept of <strong>kernel</strong> (not the Linux one though):</p>
<blockquote>
<p>The kernel of a homomorphism $\phi: G_1 \to G_2$ is the set of elements in $G_1$ that map to the unit element in $G_2$ and is denoted by $Ker(\phi)$.</p>
</blockquote>
<p>We might think of the collapsed latents in VAEs as the kernel of the mapping to the observation space, since they do not contain information so they get mapped to a blurry image (whether that can be called a unit element is not trivial, but I am reasoning only on an intuitive level).</p>
<blockquote>
<p>The <strong>image</strong> of $\phi$ is the set of elements in $G_2$ that can be produced by mapping elements of $G_1$ via the homomorphism $\phi(g_1) : g_1 \in G_1$ and is denoted by $Im(\phi)$.</p>
</blockquote>
<p>In VAE language, these are the images you can generate.</p>
<p>Interestingly the quotient group $G/Ker(\phi)$ is isomorphic to $Im(\phi)$. That is, if we divide $G$ into cosets according to $Ker(\phi)$ (these are the clusters that “behave” in the same way w.r.t. the kernel) then this grouping is equivalent to taking the elements of $Im(\phi)$. Namely, each coset will be mapped to the <em>same</em> element of $Im(\phi)$. This means that the latent space can be divided into specific clusters yielding the same image; the importance of which is that it defines a sort of symmetry/invariance in the latent space showing that some changes in the latents do not affect the generated image.</p>
<!-- ## Topological groups
# Rings
# Grids
# Bodies
-->
<h1 id="summary">Summary</h1>
<p>This post was quite heavy introducing a lot of mathematical concepts and notation, but hopefully it provided some intuition why abstract algebra is useful for the geometry of deep learning. Namely, it describes symmetries, and that is what we are after.</p>Patrik ReizingerYes, abstract algebra is actually useful for machine learning.LaTeX tricks2022-05-22T00:00:00+02:002022-05-22T00:00:00+02:00https://rpatrik96.github.io/posts/2022/05/latex-tricks<p>Improve typesetting and save space in your submissions, who does not want that?</p>
<h1 id="preamble">Preamble</h1>
<p>An academic paper is not just a messenger of hopefully ground-breaking results, but also a story and a visual manifesto. If it’s badly-formatted, there are dangling words in almost empty lines, inconsistent notation, readers might give up with reading. On the other side, if everything is nice but this results in a too-long article, then besides risking a desk-rejection (or the payment of extra fees for exceeding the page limit), readers will be less enthusiastic about facing so much text. Your results can be fascinating, if no one reads them, you failed your goal.</p>
<p>There are several best practices to ensure that your submission looks professional, eases the reader’s task, and fits into the page limit. I will assume that you are using LaTeX (you should), and provide my two cents on what I found especially useful.</p>
<h1 id="typesetting">Typesetting</h1>
<p>You want that your text looks great and saves space. This is how you do it.</p>
<h2 id="fighting-almost-empty-lines">Fighting almost-empty lines</h2>
<p>When the last few words of a sentence start a new line, it will look awkward and will waste you a lot of space. The solution is to instruct LaTeX to squeeze the words together <em>a bit</em>. This can be done with the command <code class="language-plaintext highlighter-rouge">\looseness-1</code> what you place in front of a paragraph and enjoy the result.</p>
<h2 id="keeping-in-text-equations-together">Keeping in-text equations together</h2>
<p>Equations should be kept together, i.e., they should not be spread across lines when provided in-text. The easy fix is to but curly braces around them. So <code class="language-plaintext highlighter-rouge">$y=f(x)$</code> can be split, but <code class="language-plaintext highlighter-rouge">${y=f(x)}$</code> cannot.</p>
<h2 id="more-compact-fractions">More compact fractions</h2>
<p>In-text fractions can take up lot of space and destroy the homogeneity of the paragraph by requiring more space between lines. A possible solution is using <code class="language-plaintext highlighter-rouge">\usepackage{nicefrac} </code> and the <code class="language-plaintext highlighter-rouge">\nicefrac{}{}</code> command, which will save you space.</p>
<h2 id="more-compact-lists">More compact lists</h2>
<p>Both the <code class="language-plaintext highlighter-rouge">enumerate</code> and <code class="language-plaintext highlighter-rouge">itemize</code> environments waste a lot of space between lines by default. So much space also suggests less coherence of the items in the list. With the <code class="language-plaintext highlighter-rouge">nolistsep</code> option, the spacing between lines is reduced—the <code class="language-plaintext highlighter-rouge">leftmargin=*</code> option will save some more space by starting the items right at the left (pun intended).</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">\begin{itemize}</span>
<span class="c">% [nolistsep]</span>
[nolistsep,leftmargin=*]
<span class="k">\item</span> ...
<span class="k">\item</span> ...
<span class="nt">\end{itemize}</span>
</code></pre></div></div>
<h2 id="appendix-only-table-of-contents">Appendix-only table of contents</h2>
<p>Conference submissions practically do not allow the inclusion of a table of contents due to the page limit, but it can be helpful for the appendix. This can be done in the following way:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">% ----------------------</span>
<span class="c">%% include in preamble</span>
<span class="k">\usepackage</span><span class="na">[toc,page,header]</span><span class="p">{</span>appendix<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>minitoc<span class="p">}</span>
<span class="c">% akes the "Part I" text invisible</span>
<span class="k">\renewcommand</span> <span class="k">\thepart</span><span class="p">{}</span>
<span class="k">\renewcommand</span> <span class="k">\partname</span><span class="p">{}</span>
<span class="c">% ----------------------</span>
<span class="c">%% include in the appendix</span>
<span class="k">\addcontentsline</span><span class="p">{</span>toc<span class="p">}{</span>section<span class="p">}{</span>Appendix<span class="p">}</span> <span class="c">% Add the appendix text to the document TOC</span>
<span class="k">\part</span><span class="p">{</span>Appendix<span class="p">}</span> <span class="c">% Start the appendix part</span>
<span class="k">\parttoc</span> <span class="c">% Insert the appendix TOC</span>
</code></pre></div></div>
<h1 id="references">References</h1>
<p>LaTeX has commands such as <code class="language-plaintext highlighter-rouge">\eqref{}, \autoref{}, \ref{}</code> that work fine, though what I started to like recently is the <code class="language-plaintext highlighter-rouge">cleveref</code> package with its <code class="language-plaintext highlighter-rouge">\cref{}</code> command, as it enables redefining the name Latex uses when referencing a table, figure, or section. For example, if you would like to change referencing sections to print out “section” with an upper-case “S”, then then use the following command (the third set of curly braces is used to define the plural).</p>
<p>All you need is include this script.</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>cleveref<span class="p">}</span>
<span class="k">\crefname</span><span class="p">{</span>section<span class="p">}{</span>Section<span class="p">}{</span>Sections<span class="p">}</span>
</code></pre></div></div>
<h2 id="backlinks-for-references-for-simpler-navigation">Backlinks for references for simpler navigation</h2>
<p>It can be annoying when clicking on a reference means that somehow we need to navigate back to the same spot. Loading <code class="language-plaintext highlighter-rouge">hyperref</code> as <code class="language-plaintext highlighter-rouge">\usepackage[backref=page]{hyperref}</code> will show the page numbers for each reference they were cited at. For they are active links, going back to the original line cannot be more straightforward.</p>
<h2 id="restating-theorems">Restating theorems</h2>
<p>When using environments for theorems, remarks, and co, it can be useful to restate them in the appendix to avoid the back-and-forth to the main text. Simply copy-pasting is not a good solution as that way a different number will be assigned to the second appearance of the same claim. With the <code class="language-plaintext highlighter-rouge">thmtools</code> package, there is a solution for this:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>amsthm<span class="p">}</span> <span class="c">% to have theorem environments in the first place</span>
<span class="k">\usepackage</span><span class="p">{</span>thmtools,thm-restate<span class="p">}</span>
<span class="k">\newtheorem</span><span class="p">{</span>thm<span class="p">}{</span>Theorem<span class="p">}</span>
<span class="nt">\begin{restatable}</span><span class="p">{</span>thm<span class="p">}{</span>nameofthm<span class="p">}</span>
This is true.
<span class="nt">\end{restatable}</span>
<span class="k">\nameofthm*</span> <span class="c">% this will repeat the theorem with the same number</span>
</code></pre></div></div>
<h2 id="referencing-items-in-a-list">Referencing items in a list</h2>
<p>If you need to reference an item in a list (a common use-case is referring to e.g. claims of a theorem), the <code class="language-plaintext highlighter-rouge">enumitem</code> package can help you:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>enumitem<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>cleveref<span class="p">}</span>
<span class="k">\newlist</span><span class="p">{</span>nameofenumeration<span class="p">}{</span>enumerate<span class="p">}{</span>2<span class="p">}</span> <span class="c">% define enumeration type</span>
<span class="k">\setlist</span><span class="na">[nameofenumeration]</span><span class="p">{</span>label=<span class="p">{</span><span class="k">\normalfont</span>(<span class="k">\roman*</span>)<span class="p">}</span>,ref=<span class="k">\thetheorem</span>(<span class="k">\roman*</span>)<span class="p">}</span> <span class="c">% setup the label, it will include the number of theh theorem</span>
<span class="k">\crefname</span><span class="p">{</span>nameofenumerationi<span class="p">}{</span>property<span class="p">}{</span>properties<span class="p">}</span> <span class="c">% cleverref config</span>
<span class="nt">\begin{theorem}</span>
Theorem comes here with claims:
<span class="nt">\begin{nameofenumeration}</span>
<span class="k">\item</span> property one <span class="k">\label</span><span class="p">{</span>prop:1<span class="p">}</span>
<span class="k">\item</span> property two <span class="k">\label</span><span class="p">{</span>prop:2<span class="p">}</span>
<span class="nt">\end{nameofenumeration}</span>
<span class="nt">\end{theorem}</span>
</code></pre></div></div>
<h1 id="figure-placement">Figure placement</h1>
<p>A single figure will waste a lot of placed if put into e.g. a <code class="language-plaintext highlighter-rouge">figure</code> environment. A possible solution is to use the <code class="language-plaintext highlighter-rouge">wrapfig</code> package, which lets LaTeX to arrange text <em>around</em> the figure.</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>wrapfig<span class="p">}</span>
<span class="nt">\begin{wrapfigure}</span><span class="p">{</span>r<span class="p">}{</span>6.5cm<span class="p">}</span>
<span class="k">\centering</span>
<span class="k">\includegraphics</span>[height=7em,width=<span class="k">\textwidth</span>
]<span class="p">{</span>figures/fig.png<span class="p">}</span>
<span class="k">\caption</span><span class="p">{</span>Caption<span class="p">}</span>
<span class="k">\label</span><span class="p">{</span>figure:fig<span class="p">}</span>
<span class="nt">\end{wrapfigure}</span>
</code></pre></div></div>
<h1 id="notation-glossary">Notation, glossary</h1>
<p>It’s good practice to organize notation and abbreviation into a system. So if you need to change how you denote the input, you only need to do it at a central place.</p>
<h2 id="notation">Notation</h2>
<p>I have a separate <code class="language-plaintext highlighter-rouge">.tex</code> file for all my abbreviations including concepts such as KL Divergence or the ELBO. All these can be defined with the <code class="language-plaintext highlighter-rouge">\newacronym{}{}{}</code> command, where the three arguments are:</p>
<ol>
<li>The name you refer to the abbreviation,</li>
<li>Abbreviation short form,</li>
<li>The abbreviation written out.</li>
</ol>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newacronym</span><span class="p">{</span>ml<span class="p">}{</span>ML<span class="p">}{</span>Machine Learning<span class="p">}</span>
</code></pre></div></div>
<p>From this point on, <code class="language-plaintext highlighter-rouge">\gls{ml}</code> will print out Principal Component Analysis(PCA) for the first use, then only PCA.</p>
<ul>
<li>If you need the plural, use <code class="language-plaintext highlighter-rouge">\glspl{ml}</code>,</li>
<li>For forcing the short version <code class="language-plaintext highlighter-rouge">\acrshort{ml}</code>,</li>
<li>For forcing the long version <code class="language-plaintext highlighter-rouge">\acrlong{ml}</code>,</li>
<li>For forcing the full (i.e., both the name spelled out and the abbreviation) version <code class="language-plaintext highlighter-rouge">\acrfull{ml}</code>.</li>
</ul>
<p>Besides enforcing consistency, a list of acronyms can be created with the <code class="language-plaintext highlighter-rouge">\printacronyms</code> command—as a bonus, the acronyms will be cross-referenced, sop clicking on them will lead you to the list of acronyms. Handy, isn’t it?</p>
<h2 id="glossary">Glossary</h2>
<p>Notation is a crucial tool to refer to concepts in a short form and to formalize ideas. consistency is key here too. Fortunately, the <code class="language-plaintext highlighter-rouge">glossaries</code> package does this for us: we can organize notation into categories, then print them out such that the reader will be able to click on them and get reminded what we use the formula for.</p>
<p>This is what you need to include:</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="na">[acronym, automake, toc, nomain, nopostdot, style=tree, nonumberlist]</span><span class="p">{</span>glossaries<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>glossary-mcols<span class="p">}</span> <span class="c">% to have multiple columns</span>
<span class="k">\setglossarystyle</span><span class="p">{</span>mcolindex<span class="p">}</span>
<span class="c">%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%</span>
<span class="c">% Glossary</span>
<span class="k">\newglossary</span><span class="p">{</span>abbrev<span class="p">}{</span>abs<span class="p">}{</span>abo<span class="p">}{</span>Nomeclature<span class="p">}</span> <span class="c">%abs and abo are file extensions LaTeX will use internally for this set of formulas -- different glossaries should have different ones</span>
<span class="c">%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%</span>
<span class="c">% Top-level glossary entries</span>
<span class="k">\newglossaryentry</span><span class="p">{</span>lr<span class="p">}{</span>
name = <span class="k">\ensuremath</span><span class="p">{</span><span class="k">\alpha</span><span class="p">}</span> ,
description = <span class="p">{</span>learning rate<span class="p">}</span> ,
type = abbrev,
<span class="p">}</span>
<span class="c">% A separate category for mathematics, this will render all related notation </span>
<span class="c">% under a "Maths"" header</span>
<span class="k">\newglossaryentry</span><span class="p">{</span>math<span class="p">}{</span>type=abbrev,name=Maths,description=<span class="p">{</span><span class="k">\nopostdesc</span><span class="p">}}</span>
<span class="k">\newglossaryentry</span><span class="p">{</span>cov<span class="p">}{</span>
name = <span class="k">\ensuremath</span><span class="p">{</span><span class="k">\Sigma</span><span class="p">}</span> ,
description = <span class="p">{</span>covariance matrix<span class="p">}</span> ,
type = abbrev,
parent = math,
<span class="p">}</span>
</code></pre></div></div>
<p>For referencing the the above entries, the same <code class="language-plaintext highlighter-rouge">\gls{}</code> command is used as for acronyms. The list of notation can be included by invoking the <code class="language-plaintext highlighter-rouge">\printglossary[type=abbrev, style=tree]</code> command (this will use a hierarchical style). When including <code class="language-plaintext highlighter-rouge">\setglossarysection{subsection}</code> then both glossary and acronyms will be at the subsection level.</p>
<p>The resulting structure will be:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Nomenclature
- $\alpha$ learning rate
Maths
- $\Sigma$ covariance matrix
</code></pre></div></div>
<blockquote>
<p>If the glossaries are not showing up (especially if you are on Overleaf), check whether the files are in a folder (common when uploading a <code class="language-plaintext highlighter-rouge">.zip</code>). If yes, move everything out of the folder.</p>
</blockquote>
<h3 id="fixing-hyperref-warnings">Fixing <code class="language-plaintext highlighter-rouge">hyperref</code> warnings</h3>
<p>When using <code class="language-plaintext highlighter-rouge">\gls{}, \glspl{}, \acrshort{}, \acrlong{}, \acrfull{}</code> in a caption, <code class="language-plaintext highlighter-rouge">hyperref</code> will warn about <code class="language-plaintext highlighter-rouge">Token not allowed in a PDF string</code>. To fix this, we can redefine these commands as</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\pdfstringdefDisableCommands</span><span class="p">{</span><span class="k">\def\gls</span>#1<span class="p">{</span><#1><span class="p">}</span><span class="c">%</span>
<span class="k">\def\glspl</span>#1<span class="p">{</span><#1><span class="p">}</span><span class="c">%</span>
<span class="k">\def\acrshort</span>#1<span class="p">{</span><#1><span class="p">}</span><span class="c">%</span>
<span class="k">\def\acrlong</span>#1<span class="p">{</span><#1><span class="p">}</span><span class="c">%</span>
<span class="k">\def\acrfull</span>#1<span class="p">{</span><#1><span class="p">}</span><span class="c">%</span>
<span class="p">}</span>
</code></pre></div></div>
<p>to get rid of the warning and have more meaningful bookmarks in the pdf.</p>
<h1 id="speeding-up-compilation-time">Speeding up compilation time</h1>
<p>Working on a large project with lot of files and figure can result in you sitting in front of your monitor and reading War and Peace before you can start handling the error messages. <code class="language-plaintext highlighter-rouge">\includeonly</code> comes for the rescue, as it restricts the files processed by the compiler only to the specified ones(no leading or trailing spaces allowed).</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\includeonly</span><span class="p">{</span>a.tex,b.tex<span class="p">}</span>
</code></pre></div></div>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I learned lot of the tricks in this post from <a href="https://twitter.com/luigigres">Luigi Gresele</a> and <a href="https://twitter.com/JKugelgen">Julius von Kügelgen</a>.</p>Patrik ReizingerImprove typesetting and save space in your submissions, who does not want that?Bayesian Statistics - Techniques and Models flashcards2021-12-20T00:00:00+01:002021-12-20T00:00:00+01:00https://rpatrik96.github.io/posts/2021/12/bayes-stats2<p>It’s <em>again</em> a statistics deck.</p>
<h1 id="bayesian-statistics-techniques-and-models-flashcards">Bayesian Statistics: Techniques and Models flashcards</h1>
<p>I am now ready to pay my debt with the next flashcard deck, which accompanies a yet another statistics course, namely <a href="https://www.coursera.org/learn/mcmc-bayesian-statistics">Bayesian Statistics: Techniques and Models</a> by Matthew Heiner on Coursera.</p>
<p>The link for the cards can be found on <a href="https://ankiweb.net/shared/info/704366777">ankiweb.net</a>. After downloading the deck, you can import them into the free Anki software. You can also access the deck in my earlier <a href="/posts/2021/09/causality-resources/">resources post</a>.</p>
<p>Happy learning and stay tuned for other decks to come!</p>
<p><em>P.S.: if you find any error, please contact me to help improving the material.</em></p>Patrik ReizingerIt’s again a statistics deck.Pearls of Causality #11: Front- and Back-Door Adjustment2021-12-13T00:00:00+01:002021-12-13T00:00:00+01:00https://rpatrik96.github.io/posts/2021/12/poc11-front-back-door-adjustment<p>Two ways to shut the door before confounding enters the scene.</p>
<h3 id="poc-post-series">PoC Post Series</h3>
<ul>
<li><a href="/posts/2021/12/poc10-interventions-and-identifiability.html/">PoC #10: Interventions and Identifiability</a></li>
<li>➡️ <a href="/posts/2021/12/poc11-front-back-door-adjustment.html/">PoC #11: Front- and Back-Door Adjustment</a></li>
</ul>
<h1 id="back-door-adjustment">Back-Door Adjustment</h1>
<p>Last time we have seen how we can adjust for direct causes by giving conditions for which variables we need to observe: for calculating $P(y|do(X=x))$, we need $Y, X, Pa_X$. This post gives two more general formulas that can be applied to DAGs to test whether the adjustment conditions are satisfied.</p>
<p>The main idea behind the generalization is the fact that not only $Pa_X$) can block the incoming paths to $X$. As these paths come from the non-descendants of $X$ and the edges point toward $X$, the whole concept is thought of as having (and to screen off confounding, blocking) a <em>back door</em>.</p>
<blockquote>
<p>A variable set $Z$ satisfies the <strong>Back-Door Criterion</strong> to an ordered pair of variables $(X, Y)$ in a DAG if:</p>
<ol>
<li>nodes in $Z$ are non-descendants of $X$</li>
<li>$Z$ blocks every incoming path into $X$</li>
</ol>
</blockquote>
<p>The Back-Door Criterion makes a statement about an <em>ordered pair</em>; i.e., $Y$ is a descendant of $X$ (there is a path from $X$ to $Y$). The <strong>first condition</strong> generalizes the requirement for observing $Pa_X$, for any non-descendant of $X$ suffices to block the incoming paths into $X$ - as of the <strong>second condition</strong>.</p>
<p>That is, we are looking for a set of variables $Z$ such that every path $X \leftarrow \dots - Z - \dots - Y$ is blocked. Note that here $-$ stands for any of $\leftrightarrow, \rightarrow, \leftarrow$. The only constraint is that the path has an edge pointing into $X$. As we want to reason about the effect of $X$ on $Y$, we need to leave the paths from $X$ to $Y$ <em>unblocked</em> but all paths <em>into</em> $X$ <em>blocked</em>.</p>
<p>After understanding the Back-Door Criterion, we can apply this to calculate interventional distributions.</p>
<blockquote>
<p>If a variable set $Z$ satisfies the Back-Door Criterion relative to $(X, Y)$ then the effect of $X$ on $Y$ is given by:</p>
</blockquote>
\[P(y|do(X=x)) = \sum_z P(y|x,z)P(z)\]
<p>This is the <em>same</em> formula we had for adjusting for direct causes. Nonetheless, the scenarios where we can apply it are more general.</p>
<p>The formula can be interpreted as <em>dividing</em> the data into categories by the values of $Z$ and $X$ (this is also called <em>stratifying</em>) and calculating the weighted average of the <em>strata</em> (this is the fancy plural form expressing data categories).
By conditioning on these two variables, we make the strata independent of each other - as $Z$ blocks the Back-Door paths, conditioning on $X$ is the same as $do(X=x)$. Note that for general $Z$ this would not be the case.</p>
<h1 id="front-door-adjustment">Front-Door Adjustment</h1>
<p>The Back-Door Adjustment formula is nice, but unfortunately it is sometimes not applicable. It can be a quite strong assumption that we can observe a sufficient set of variables that block <em>all</em> Back-Door paths.</p>
<p>The intuition for the more general formula of Front-Door Adjustment comes from the <em>genius observation</em> that houses usually have a <em>front entrance</em>, not just a back one.</p>
<blockquote>
<p>A variable set $Z$ satisfies the <strong>Front-Door Criterion</strong> to an ordered pair of variables $(X, Y)$ in a DAG if:</p>
<ol>
<li>$Z$ blocks every directed path from $X$ to $Y$</li>
<li>There is no back-door path from $X$ to $Z$</li>
<li>All back-door paths from $Z$ to $Y$ are blocked by $X$</li>
</ol>
</blockquote>
<p>Let’s work through these three conditions.</p>
<ol>
<li>The <strong>first condition</strong> states the conditional independence $X\perp Y | Z$.</li>
<li>The <strong>second condition</strong> postulates that $P(z|x)=P(z|do(X=x)$ - note that the first condition says that $Z$ must be in between $X$ and $Y$, i.e., $Z$ is a descendant of $X$.</li>
<li>The <strong>third condition</strong> says that $X$ acts as a Back-Door for the effect of $Z$ on $Y$. So the effect of $Z$ on $Y$ can be calculated by the Back-Door Adjustment formula.</li>
</ol>
<p>These conditions result in a formula that applies Back-Door Adjustment twice: once for calculating the effect of $X$ on $Z$ and once for using $X$ as a Back-Door for estimating the effect of $Z$ on $Y$.</p>
<blockquote>
<p>If a variable set $Z$ satisfies the Front-Door Criterion relative to $(X, Y)$ and if $P(x,z) >0$ then the effect of $X$ on $Y$ is given by:
\(P(y|do(X=x)) = \sum_z P(z|x)\sum_{x'}P(y|x', z)P(x')\)</p>
</blockquote>
<p>The <strong>outer sum</strong> is effect of $X$ on $Z$; the second condition makes it sure that the conditional is the same as the interventional distribution. The <strong>inner sum</strong> is the effect of $Z$ on $Y$; calculated by the Back-Door Adjustment formula.</p>
<p>The requirement for a positive $P(x,z)$ distribution makes sure that the conditional $P(y|x,z)$ is well-defined - meaning that all $x,z$ combinations are yielding meaningful strata.</p>
<h1 id="summary">Summary</h1>
<p>Our endeavor to find ways to adjust for confounding resulted in two practical formulas. Now we can fight confounding. Of course, this requires that we know that confounding is present with a specific structure.</p>Patrik ReizingerTwo ways to shut the door before confounding enters the scene.Pearls of Causality #10: Interventions and Identifiability2021-12-06T00:00:00+01:002021-12-06T00:00:00+01:00https://rpatrik96.github.io/posts/2021/12/poc10-interventions-and-identifiability<p>Interventions in disguise.</p>
<h3 id="poc-post-series">PoC Post Series</h3>
<ul>
<li><a href="/posts/2021/11/poc9-causes/">PoC #9: Potential, Genuine, Temporal Causes and Spurious Association</a></li>
<li>➡️ <a href="/posts/2021/12/poc10-interventions-and-identifiability.html/">PoC #10: Interventions and Identifiability</a></li>
</ul>
<h1 id="interventions">Interventions</h1>
<p>We will revisit interventions in this post. As discussed in <a href="/posts/2021/11/poc4-causal-queries/">PoC #4</a>, interventions can provide more information than observational data only. How does this “more information” look like? Recall that for interventions we need a DAG besides the joint distribution. When we intervene, we modify the DAG by removing the incoming edges of the intervened node. This has an effect on the Markov factorization, which can be expressed in multiple ways, each offering a different interpretation.</p>
<p>Before we start, let me share a quote with you from Jonas Peter’s <a href="https://www.youtube.com/watch?v=zvrcyqcN9Wo&ab_channel=BroadInstitute">lecture on causality</a> at MIT in 2017. He calls this MUTE, the Most Useful Tautology:</p>
<blockquote>
<p>If we intervene only on $X$, we intervene only on $X$.</p>
</blockquote>
<p>This will help us as MUTE means that all other conditional distributions will not change, so it is not hopeless to calculate interventional distributions from observational data. Believe me, you will see it soon.</p>
<h2 id="causal-effect-with-do-calculus">Causal Effect with do-calculus</h2>
<p>First, let’s define a causal effect with do-calculus.</p>
<blockquote>
<p>Given disjoint sets of variables, $X$ and $Y$, the <strong>causal effect</strong> of $X$ on $Y$ is denoted by $P(y | do(X=x))$. It gives the probability of $Y = y, \forall x$ in the SEM with all incoming edges of the node $X$ and the equation $x = f(pa_x, u_x)$ <em>deleted</em> and setting $X = x$ in the remaining equations.</p>
</blockquote>
<p>This definition contains nothing new, it uses the $do$-notation to express the probability of $Y$ when we intervene on $X$. There are multiple ways to calculate and to conceptualize this causal effect, as we will see in the next sections.</p>
<h2 id="interventions-as-variables">Interventions as Variables</h2>
<p>We can think of interventions $do(X_i = x_i’)$ in a DAG with variables $X_1, \dots, X_n$ as if we flipped a switch to make $X_i := x_i’$. That is, there are two mechanisms to determine the value of $X_i$: the conditional $P(x_i|pa_i)$ and the intervention $do(Xi = x_i’)$. Of course, the ther-are-two-mechanisms-view has the same effect, it only differs in interpretation. The main advantage being that we can explicitly <strong>incorporate the intervention in a single DAG</strong>; i.e., no need to mess around with deleting edges.</p>
<p>To do this, we augment node $X_i$ in the DAG with an additional parent $F_i$, yielding $Pa_i’ = Pa \bigcup {F_i}$, where $F_i \in {do(X_i = x_i’), idle}$ - meaning that $F_i$ is the “switch between the two mechanisms” determining $X_i$.</p>
<p>The intervention is encoded via the added edge $F_i \rightarrow X_i$, yielding the conditional</p>
\[P\left(x_i | pa_i'\right) = \begin{cases}P(x_i | pa_i), \ \ \qquad\quad \mathrm{if} \ \ F_i =idle \\
0, \qquad\qquad\qquad\quad \mathrm{if} \ \ F_i = do(X_i = x_i') \wedge x_i \neq x_i' \\
1, \qquad\qquad\qquad\quad \mathrm{if} \ \ F_i = do(X_i = x_i') \wedge x_i = x_i'
\end{cases}\]
<p>The reason why we need to differentiate between $x_i \neq x_i’$ and $x_i = x_i’$ in the case of the intervention is to remain consistent (as if we set $X_i$ to $x_i’$ then all $x_i\neq x_i’$ have 0 probability).</p>
<h2 id="interventions-as-truncated-factorization">Interventions as Truncated Factorization</h2>
<p>Having discussed the effect of an intervention, we can now express the joint distribution in the case of $do(X_i = x_i’)$. The straightforward way is to start from the Markov factorization $\prod_{j} P(x_j|pa_j)$ and leave out the factor $P(x_i | pa_i)$. We can do this as by intervening on $X_i$ we have $P(x_i | pa_i, do(X_i = x_i’))= P(x_i’ | pa_i, do(X_i = x_i’))=1$</p>
\[P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}\prod_{j\neq i} P(x_j|pa_j), \quad \mathrm{if} \ \ x_i = x_i' \\
0, \qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i
\end{cases}\]
<blockquote>
<p>This expression shows the <a href="/posts/2021/10/poc1-dags-d-sep/">ICM Pinciple</a> at work: only the mechanisms intervened on changes, everything else remains the same.</p>
</blockquote>
<h3 id="compound-interventions">Compound Interventions</h3>
<p>This notation can also handle <strong>compound interventions</strong>, i.e., when we intervene on multiple variables at the same time. If we denote the set of variables we intervene on with $S$, then we can write</p>
\[P\left(x_1, \dots, x_n | do(S=s)\right) = \begin{cases}\prod_{i : X_i \not\in S}P(x_i|pa_i), \qquad\quad \mathrm{if} \ \ X \mathrm{\ consistent\ with\ } S\\
0, \qquad\quad \ \ \mathrm{otherwise}
\end{cases}\]
<h2 id="interventions-and-the-preinterventional-distribution">Interventions and the Preinterventional Distribution</h2>
<p>It’s also interesting to figure out the relationship between the interventional and the original <em>(preinterventional)</em> distribution. The expression follows from the truncated factorization by extending the expression with $\frac{P(x_i’|pa_i)}{P(x_i’|pa_i)}$ and then noticing that we have all factors of the joint in the nominator. The nominator will be the joint distribution <em>before the intervention</em> $P\left(x_1, \dots, x_n\right), \mathrm{s.t.} \ x_i = x_i’$ - thus the name <em>preinterventional</em> distribution. The denominator will be the actor $P(x_i’|pa_i)$. Note that we need to tie $x_i = x_i’$, as otherwise the expression would be inconsistent with $do(X_i = x_i’$).</p>
\[P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}\dfrac{P\left(x_1, \dots, x_n\right)}{P(x_i'|pa_i)}, \ \qquad\quad \mathrm{if} \ \ x_i = x_i' \\
0, \qquad\qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i
\end{cases}\]
<blockquote>
<p>Besides satisfying our intrinsic strive for mathematical diversity and beauty, this expression makes the difference clear between interventions and conditioning. Except when intervening on leaf nodes - i.e., when $Pa_i = \emptyset$ -, where $P(x_i’|pa_i=\emptyset) = P(x_i|pa_i=\emptyset, do(X_i = x_i’))=P(x_i’)$ -, when both are the same (cf. Causal Bayesian Networks in <a href="/posts/2021/11/poc4-causal-queries/">PoC #4</a>).</p>
</blockquote>
<p>When conditioning on $X_i = x_i’$, then what we do can be thought as a two-step process:</p>
<ol>
<li><strong>Reduction</strong> of the probability distribution (dropping the entries in the joint inconsistent with $X_i \neq x’_i$).</li>
<li><strong>Renormalization</strong> of the probabilities to get a distribution.</li>
</ol>
<blockquote>
<p>This means that conditioning distributes the probability mass over <strong>all remaining values</strong> (i.e., where in the joint we have $X_i = x_i’$) <strong>equally</strong> in the sense that the same normalizing factor is applied in each case.</p>
</blockquote>
<p>The situation could not have been more different when we intervene on $X_i$.</p>
<blockquote>
<p>In the interventional case, each excluded point (where $x_i \neq x_i’$) transfers its probability to a <strong>subset of points</strong> sharing the same value of $pa_i$. That is, depending on $pa_i$, <strong>different normalization constants</strong> are applied.</p>
</blockquote>
<h3 id="example">Example</h3>
<p>Assume that we have a graph $X\rightarrow Y\rightarrow Z$ with all variables being binary. We need to ensure that the conditional independence hold in the joint distribution (to ensure that the Markov factorization is the one implied by the graph, or more precisely, that the graph is a perfect I-map), we need the marginal for $X$:</p>
<table>
<thead>
<tr>
<th>$X$</th>
<th>$P(X)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0$</td>
<td>$0.6$</td>
</tr>
<tr>
<td>$1$</td>
<td>$0.4$</td>
</tr>
</tbody>
</table>
<p>The second ingredient is the conditional for $Y$ ensuring $P(Y|X) = P(Y|X,Z)$:</p>
<table>
<thead>
<tr>
<th>$X$</th>
<th>$Y$</th>
<th>$P(Y|X)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0$</td>
<td>$0$</td>
<td>$0.8$</td>
</tr>
<tr>
<td>$0$</td>
<td>$1$</td>
<td>$0.2$</td>
</tr>
<tr>
<td>$1$</td>
<td>$0$</td>
<td>$0.5$</td>
</tr>
<tr>
<td>$1$</td>
<td>$1$</td>
<td>$0.5$</td>
</tr>
</tbody>
</table>
<p>And the third one is the conditional for $Z$ ensuring $P(Z|Y) = P(Z|X,Y)$:</p>
<table>
<thead>
<tr>
<th>$Y$</th>
<th>$Z$</th>
<th>$P(Z|Y)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0$</td>
<td>$0$</td>
<td>$0.9$</td>
</tr>
<tr>
<td>$0$</td>
<td>$1$</td>
<td>$0.1$</td>
</tr>
<tr>
<td>$1$</td>
<td>$0$</td>
<td>$0.3$</td>
</tr>
<tr>
<td>$1$</td>
<td>$1$</td>
<td>$0.7$</td>
</tr>
</tbody>
</table>
<p>That is, the probabilities populate the following table with eight entries:</p>
<table>
<thead>
<tr>
<th>$X$</th>
<th>$Y$</th>
<th>$Z$</th>
<th>$P(X,Y,Z)$</th>
<th>$P(X,Y,Z|Y=1)$</th>
<th>$P(X,Y,Z|do(Y=1))$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0$</td>
<td>$0$</td>
<td>$0$</td>
<td>$0.432$</td>
<td>$0$</td>
<td>$0$</td>
</tr>
<tr>
<td>$0$</td>
<td>$0$</td>
<td>$1$</td>
<td>$0.048$</td>
<td>$0$</td>
<td>$0$</td>
</tr>
<tr>
<td>$0$</td>
<td>$1$</td>
<td>$0$</td>
<td>$0.036$</td>
<td>$0.036/0.32=0.1125$</td>
<td>$0.036/0.2=0.6*0.3=0.18$</td>
</tr>
<tr>
<td>$0$</td>
<td>$1$</td>
<td>$1$</td>
<td>$0.084$</td>
<td>$0.084/0.32=0.2625$</td>
<td>$0.084/0.2=0.6*0.7=0.42$</td>
</tr>
<tr>
<td>$1$</td>
<td>$0$</td>
<td>$0$</td>
<td>$0.18$</td>
<td>$0$</td>
<td>$0$</td>
</tr>
<tr>
<td>$1$</td>
<td>$0$</td>
<td>$1$</td>
<td>$0.02$</td>
<td>$0$</td>
<td>$0$</td>
</tr>
<tr>
<td>$1$</td>
<td>$1$</td>
<td>$0$</td>
<td>$0.06$</td>
<td>$0.06/0.32=0.1875$</td>
<td>$0.06/0.5=0.4*0.3=0.12$</td>
</tr>
<tr>
<td>$1$</td>
<td>$1$</td>
<td>$1$</td>
<td>$0.14$</td>
<td>$0.14/0.32=0.4375$</td>
<td>$0.14/0.5=0.4*0.7=0.28$</td>
</tr>
</tbody>
</table>
<p>We can calculate the marginal for $P(Y=1) = 0.036+0.084+0.06+0.14 =0.32$ that we will need for calculating interventions from the preinterventional distribution. I included in the fourth column <em>(you thought a CS guy will start indexing from 1?)</em> the probabilities when conditioning on $Y=1$, whereas the fifth column includes the interventional probabilities (calculated both from the preinterventional distribution and with the truncated factorization).</p>
<p>Notice that in the interventional case, we divide the probabilities with different values (depending on the parent of $Y$, i.e., $X$). A sanity check is that in both the conditioning and interventional cases the probabilities add up to 1.</p>
<h2 id="interventions-as-conditioning">Interventions as Conditioning</h2>
<p>Although generally intervening on $X_i$ is different from conditioning on $X_i$, we can use conditioning to express the intervention as well.</p>
<p>We start from the joint distribution, then by using the chain rule of Bayesian networks, we “extract” $P(x_i’|pa_i)$ and $P(pa_i$). As the intervention makes $P(x_i’|pa_i) =1$, we can simplify the expression:
\(\begin{align*}
P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) &= P\left(x_1, \dots, x_n|x_i',pa_i\right)P(x_i'|pa_i)P(pa_i) \\
&= P\left(x_1, \dots, x_n|x_i',pa_i\right)P(pa_i)
\end{align*}\)</p>
<p>Our manipulation requires that $x_i = x_i’$, so the resulting expression includes two cases:
\(P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}P\left(x_1, \dots, x_n|x_i',pa_i\right)P(pa_i), \ \ \quad \mathrm{if} \ \ x_i = x_i' \\
0, \qquad\qquad\qquad\qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i
\end{cases}\)</p>
<blockquote>
<p>If you now focus on the formulas we came up with in this section, you will realize that they express interventions <strong>without</strong> using any interventional distribution. This means that in specific cases, we are able to <strong>calculate the effect of an interventions from observational distributions</strong>.</p>
</blockquote>
<h1 id="adjustment-for-direct-causes">Adjustment for Direct Causes</h1>
<p>We will use the last formulation - interventions as conditioning - to calculate the effect of an intervention from observational data. This is called <strong>adjustment for direct causes</strong>.</p>
<blockquote>
<p>Let $PA_i$ the set of direct causes (parents) of $X_i$ and $Y$ be any set, disjoint of ${X_i \bigcup PA_i}$. The effect of the intervention $do(X_i = x_i’) $ on $Y$ is</p>
</blockquote>
\[P\left(y | do(X_i=x'_i)\right) = \sum_{pa_i}P(y|x_i',pa_i)P(pa_i)\]
<p>Additionally, we need to marginalize over $Pa_i$ as we are interested in $P\left(y | do(X_i=x’_i)\right)$. We can use this formula as $Pa_i$ screen off any effect on $X$ coming from other nondescendants of $X$ - i.e., if we know the value $Pa_i = pa_i$, then all other variables do not matter, we have the information to determine the value/distribution for $X$.</p>
<h1 id="identifiability-of-causal-effects">Identifiability of Causal Effects</h1>
<p>We have seen that we can do some black magic with the observational distributions to get the effect of an intervention. However, this is not always possible. In this section, we will get acquainted with the formal notion of identifiability, then discuss conditions for causal effect identifiability.</p>
<h2 id="identifiability">Identifiability</h2>
<p><strong>Identifiability</strong> in a general sense states that <strong>some quantity</strong> (intervention, likelihood, mean, etc.) <strong>can be computed uniquely</strong>.</p>
<blockquote>
<p>Let $Q(M)$ be any computable quantity of a model $M$. $Q$ is identifiable in a model class $\mathcal{M}$ if, for any model pairs $M_1,M_2 \in \mathcal{M}$ it holds that $Q(M_1) = Q(M_2)$ whenever $P_{M_1}(v) = P_{M_2}(v)$.</p>
</blockquote>
<p>The definition implies that we have an “identifiability mapping” from probability distributions of models to the space of a quantity $h(M,v) : P_{M}(v) \rightarrow Q(M)$ where the same $P_M(v)$ values map to the same $Q(M)$.</p>
<p>The definition can be extended to the case when there are hidden variables, then we use thr <em>observed</em> subset of $P_M(v)$.</p>
<h2 id="causal-effect-identifiability">Causal Effect Identifiability</h2>
<p>For causal effects, identifiability is defined as follows:</p>
<blockquote>
<p>The causal effect of $X$ on $Y$ is identifiable from a graph $G$ if $P(y | do(X=x))$ can be computed <strong>uniquely</strong> from any positive probability distribution of the <strong>observed variables</strong>, i.e $P_{M_1}(y | do(X=x))=P_{M_2}(y | do(X=x))$ for every pair of modeIs $M_1$ and $M_2$ with $P_{M_1}(v) = P_{M_2} (v) > 0$ and $G (M_1) = G (M_2) = G$.</p>
</blockquote>
<p>Again, uniqueness is the key in the definition - the positivity assumption is required to exclude edge cases (e.g., when dividing by 0). The identifiability of $P(y | do(X=x))$ ensures inferring the effect $do(X = x)$ on $Y$ from:</p>
<ol>
<li>passive observations, given by the observational distribution $P(v)$; and</li>
<li>the causal graph $G$, which specifies (qualitatively) the mechanisms/parent-child relationships</li>
</ol>
<p>This definition mirrors the adjustment for direct causes, where we used knowledge both from observations and the graph.</p>
<h2 id="causal-effect-identifiability-in-markovian-models">Causal Effect Identifiability in Markovian Models</h2>
<p>Looking into a more specialized model class, namely, Markovian Models (where we have a DAG and the noises are <em>jointly</em> independent), we can state the following result:</p>
<blockquote>
<p>In a Markovian causal model $M = <G, \theta_G>$ with a subset $V$ of all variables being observed, the causal effect $P(y|do(X=x))$ is identifiable whenever ${X\bigcup Y\bigcup Pa_X}\subseteq V$.</p>
</blockquote>
<p>We need observability of $X, Y, Pa_X$ to use the adjustment for direct causes. This is required to calculate the quantities in the adjustment formula above.</p>
<blockquote>
<p>When all variables are measured (i.e., when we are <em>extremeley lucky</em>), the causal effect can be calculated via the truncated factorization.</p>
</blockquote>
<h1 id="summary">Summary</h1>
<p>This post opened up the door into the most intricate details of calculating interventions, discussing various ways and interpretations. At the end, we also touched on the topic on identifiability.</p>
<p>However, we have not yet covered the case of confounding - an undoubtedly more realistic scenario. You can probably figure out what comes next then.</p>Patrik ReizingerInterventions in disguise.Pearls of Causality #9: Potential, Genuine, Temporal Causes and Spurious Association2021-11-29T00:00:00+01:002021-11-29T00:00:00+01:00https://rpatrik96.github.io/posts/2021/11/poc9-causes<p>Hitting the nail on its arrowhead, a.k.a. when does $X$ cause $Y$?</p>
<h3 id="poc-post-series">PoC Post Series</h3>
<ul>
<li><a href="/posts/2021/11/poc8-ic-ic-ic">PoC #8: Inferred Causation, $IC$, and ${IC}^*$</a></li>
<li>➡️ <a href="/posts/2021/11/poc9-causes/">PoC #9: Potential, Genuine, Temporal Causes and Spurious Association</a></li>
<li><a href="/posts/2021/12/poc10-interventions-and-identifiability.html/">PoC #10: Interventions and Identifiability</a></li>
</ul>
<h1 id="time-agnostic-relationships">Time-agnostic Relationships</h1>
<p>Causal inference in practice is often ambiguous: depending on the available data, we might only be able to identify the graph up to its Markov equivalence class. The presence of latent variables increases the space of possible structures; thus, it is worth defining a set of concepts to express the different variants of cause-effect relationships.</p>
<p>For this, we will start from $IC^*$ and will quantify the edges from a causal perspective. The algorithm results in a PDAG with four edge types, as discussed in <a href="/posts/2021/11/poc8-ic-ic-ic">PoC #8</a>. We can think about those as expressing our “certainty” whether a relationship is causal:</p>
<ol>
<li>$a\rightarrow^{*} b$: $a$ causes $b$</li>
<li>$a\rightarrow b$: $a$ <em>potentially</em> causes $b$</li>
<li>$a \leftrightarrow b$: $a$ and $b$ are <em>spuriously associated</em></li>
<li>$a-b:$ we have no clue whether the relationship is causal or spurious (the direction is also unknown)</li>
</ol>
<p>In the following, <strong>adjacency</strong> is used to emphasize that there is an edge between $a$ and $b$ (including $\leftrightarrow$, thus, stretching the traditional edge concept as this is a shorthand for a latent common cause $a\leftarrow L \rightarrow b$); whereas a <strong>context</strong> $S$ means a set of variable assignments (only w.r.t. the observed variables - latent values cannot be assigned as they are unobservable; $S$ can also be the empty set). Context-specific independence $(\perp_c)$ is a shorthand for $X \perp Y | Z, C=c$.</p>
<blockquote>
<p>Time-agnosticity in the section title refers to the fact that the below concepts are general, they do not exploit temporal information.</p>
</blockquote>
<p>That will be considered in the next section and will make our life easier.</p>
<blockquote>
<p>From now on, we will only use edges as in DAGs (edge types 1 and 3, but without a marking).</p>
</blockquote>
<h2 id="potential-cause">Potential Cause</h2>
<p>The first level of causally useful relationships fall into the category of potential causes - this can be thought as a <strong>necessary condition</strong> of $X$ causing $Y$. If $X$ is not a potential cause of $Y$, then $X$ cannot cause $Y$.</p>
<blockquote>
<p>$X$ is a <strong>potential cause</strong> of $Y$ (that is inferable from $P$) if:</p>
<ol>
<li>$X$ and $Y$ are dependent in every context ($X \not\perp_c Y$).</li>
<li>There exists a variable $Z$ and a context $S:$
<ol>
<li>$X \perp Z | S$</li>
<li>$Y \not\perp Z | S$</li>
</ol>
</li>
</ol>
</blockquote>
<p>The definition requires adjacency (including latent common causes) to ensure that $X$ and $Y$ are dependent irrespective what we condition on.</p>
<blockquote>
<p>That is, the definition excludes indirect causes - i.e., when $X$ influences $Y$ via <em>observable</em> mediator nodes such as $X\rightarrow Z \rightarrow Y$ or $X \leftrightarrow Z \leftrightarrow Y$.</p>
</blockquote>
<p><img src="/images/posts/potential_cause.svg" alt="Potential cause" /></p>
<p>The reason why the pair $X$ is considered a potential cause is the exclusion of $X\leftarrow Y$; thus leaving only the options:</p>
<ul>
<li>$X \rightarrow Y$</li>
<li>$X \leftrightarrow Y$</li>
</ul>
<p><img src="/images/posts/potential_cause_counterex.svg" alt="Potential cause counterexample" /></p>
<p>Namely, if $X\leftarrow Y$ would be possible, then that would mean that $X \not\perp Z | S$ - as there is an edge, and so an active path, between $Y$ and $X$.</p>
<blockquote>
<p>If $X$ is a potential cause of $Y$, then $X$ either causes $Y$ or they have the same parent. That is, <strong>potential causes are a superset of genuine causes and spurious associations</strong>.</p>
</blockquote>
<h2 id="genuine-cause">Genuine Cause</h2>
<p>Potential causes are useful to restrict the space of causes, but genuine causal relationships are the most important for us. For real causal relationships are more restrictive, we expect to have more conditions.</p>
<blockquote>
<p>$X$ is the <strong>genuine cause</strong> of $Y$ if there exists a variable $Z$ such that either:</p>
<ol>
<li>$X$ and $Y$ are dependent in any context ($X \not\perp_c Y $) and $\exists$ context $S$:
<ol>
<li>$Z$ is a potential cause of $X$</li>
<li>$Y \not\perp Z | S$</li>
<li>$Y \perp Z | S \cup X$</li>
</ol>
</li>
<li>$X$ and $Y$ are in a <a href="https://en.wikipedia.org/wiki/Transitive_closure">transitive closure</a> of criterion 1.</li>
</ol>
</blockquote>
<p>Our ultimate goal is to identify genuine causes and we use the potential cause definition as a stepping stone. The trick here is that when reasoning about the causal relationship of $X$ and $Y$, a third variable $Z$ will be used as a potential cause.</p>
<p>By requiring $Z$ to be the potential cause of $X$, we imply either</p>
<ul>
<li>$Z \rightarrow X$ or</li>
<li>$Z \leftrightarrow X$</li>
</ul>
<p><img src="/images/posts/genuine_cause.svg" alt="Genuine cause" /></p>
<p>Additionally, conditions 2 and 3 rule out $X \leftarrow Y$ and $X \leftrightarrow Y$ (context-specific dependence means there is an $X-Y$ edge), as they require that the effect of $Z$ on $Y$ is mediated by $X$ (otherwise, conditioning on $X$ could not block the path). Thus, the latent common cause scenario $X \leftrightarrow Y$ is infeasible.</p>
<p>The mention of the transitive closure means that $X$ is also a genuine cause of $Y$ if $X$ is a genuine cause of $W$ and $W$ is a genuine cause of $Y$.</p>
<h2 id="spurious-association">Spurious Association</h2>
<p>When a potential cause fails to fulfill the conditions of genuine causes, then the relationship is spurious.</p>
<blockquote>
<p>$X$ and $Y$ are <strong>spuriously associated</strong> if</p>
<ol>
<li>They are dependent in some context ($X \not\perp_c Y$) and</li>
<li>$\exists$ <em>other</em> variables $ Z_1, Z_2$ and contexts $S_1, S_2$:
<ol>
<li>$X \not\perp Z_1 | S_1$</li>
<li>$Y \perp Z_1 | S_1$</li>
<li>$X \perp Z_2 | S_2$</li>
<li>$Y \not\perp Z_2 | S_2$</li>
</ol>
</li>
</ol>
</blockquote>
<p><img src="/images/posts/spurious.svg" alt="Genuine cause" /></p>
<p>The first requirement is that $X$ and $Y$ should be dependent in some context, as otherwise we could not think of them having any association between them.</p>
<blockquote>
<p>The existence of <strong>other</strong> variables $Z_1, Z_2$ and contexts $S_1, S_2$ refers to <strong>other than $X$ and $Y$ for the variables and other than the context where $X$ and $Y$ are dependent for the contexts</strong>.</p>
</blockquote>
<p>As you can see in the figure above, we cannot use $S_1$ or $S_2$ to make $X$ and $Y$ dependent - but conditioning on $Z_1$ and/or $Z_2$ makes $X \not\perp Y$.</p>
<p>When parsing the definition of spurious associations further, we can think of applying condition 2 (with its two requirements) of potential causes twice. $X \not\perp Z_1 | S_1$ and $Y \perp Z_1 | S_1$ rule out $X\leftarrow Y$, while $Y \not\perp Z_2 | S_2$ and $X \perp Z_2 | S_2$ exclude $X\rightarrow Y$. So the only remaining option is $X \leftrightarrow Y$, meaning that there is no causal influence between $X$ and $Y$.</p>
<h1 id="time-dependent-relationships">Time-dependent Relationships</h1>
<p>When we have temporal information (such as sensor measurements), the above definitions simplify. Intuitively, $X$ can only be a cause of $Y$ is $X$ precedes $Y$.</p>
<h2 id="potential-causation-with-temporal-information">Potential Causation with Temporal Information</h2>
<blockquote>
<p>$X$ is a <strong>potential cause</strong> of $Y$ if:</p>
<ol>
<li>$X$ precedes $Y$</li>
<li>$X$ and $Y$ are dependent in every context ($X \not\perp_c Y$).</li>
</ol>
</blockquote>
<p>The temporal information is used to exclude $X\leftarrow Y$. This agrees with our everyday intuition and makes causal inference simpler as there is no need for conditional independence tests with a third variable $Z$. The second condition requires the adjacency of $X$ and $Y$ (a latent common cause is still possible).</p>
<p>An alternative option to context-specific dependence that we will use for temporal genuine causes is requiring that</p>
<blockquote>
<p>Context $S$ precedes $X$,</p>
</blockquote>
<p>meaning that $S$ is not in the path $X-\dots-Y$ (as $S$ precedes $X$ precedes $Y$); thus, it is not the reason why $Z$ is dependent on $Y$ but not on $X$ (cf. <a href="#potential-cause">potential causes</a>). Note that <strong>temporal precedence is still required between $X$ and $Y$</strong>.</p>
<h2 id="genuine-causation-with-temporal-information">Genuine Causation with Temporal Information</h2>
<blockquote>
<p>$X$ is a <strong>genuine cause</strong> of $Y$ if there exists a variable $Z$ and a context $S$, both occurring before $X$, such that:</p>
<ol>
<li>$X$ precedes $Y$</li>
<li>$Y \not\perp Z | S$</li>
<li>$Y \perp Z | S\cup X$</li>
</ol>
</blockquote>
<p>In this definition, temporal precedence requires $Z$ and $S$ preceding $X$ preceding $Y$ - so the only possible causal direction is $Z$ causing $X$ causing $Y$-, the two remaining conditions are the same as in the <a href="#genuine-cause">non-temporal definition</a>. Note that the context-specific dependence of $X$ and $Y$ is not required - equivalently, we do not postulate in advance the existence of an $X-Y$ edge.</p>
<p>The context-dependence of $X$ and $Y$ is implied by the second and third conditions. They exclude both $Z$ and $S$ as a cause of $Y$, as otherwise it would not be possible for $X$ to block the dependence of $Z$ on $Y$.</p>
<h2 id="spurious-association-with-temporal-information">Spurious Association with Temporal Information</h2>
<blockquote>
<p>$X$ and $Y$ are <strong>spuriously associated</strong> if $\exists S : X \not\perp Y |S$ and $\exists Z$ such that:</p>
<ol>
<li>$X$ precedes $Y$</li>
<li>$Y \perp Z | S$</li>
<li>$X \not\perp Z | S$</li>
</ol>
</blockquote>
<ul>
<li>no genuine cause as that would require dependent Z and Y</li>
</ul>
<p>In this case, condition 1 excludes $Y$ as a cause of $X$, while conditions 2 and 3 exclude $X$ as a cause of $Y$ (these are analogous to the sub-conditions of condition two in the definition of <a href="#potential-cause">potential causes</a>). Namely, having a genuine causal relationship would require $Y \not\perp Z | S$.</p>
<h1 id="summary">Summary</h1>
<p>This time we made a deep dive into the taxonomy of causality zoo. Being able to understand the differences between genuine, potential causes, and spurious associations is crucial to reason about the output of a causal inference algorithm.</p>Patrik ReizingerHitting the nail on its arrowhead, a.k.a. when does $X$ cause $Y$?Pearls of Causality #8: Inferred Causation, $IC$, and ${IC}^*$2021-11-22T00:00:00+01:002021-11-22T00:00:00+01:00https://rpatrik96.github.io/posts/2021/11/poc8-ic-ic-ic<p>We will talk about IC, $IC$, and ${IC}^*$ in this post. You get the difference.</p>
<h3 id="poc-post-series">PoC Post Series</h3>
<ul>
<li><a href="/posts/2021/11/poc7-latents-stability/">PoC #7: Latents and Inferred Causation</a></li>
<li>➡️ <a href="/posts/2021/11/poc8-ic-ic-ic/">PoC #8: Inferred Causation, $IC$, and ${IC}^*$</a></li>
<li><a href="/posts/2021/11/poc9-causes/">PoC #9: Potential, Genuine, Temporal Causes and Spurious Association</a></li>
</ul>
<h1 id="inferred-causation">Inferred Causation</h1>
<p>It is time to see some causal discovery algorithms. There is one last thing before that though: we need to formulate when a node $C$ is the cause of node $E$. In the fully observed case, it is the existence of a directed path from $C$ to $E$. When latents are present, then we cannot use the same definition. The extension is called the principle of <strong>inferred causation</strong>.</p>
<blockquote>
<p>Given distribution $\hat{P}$, node $C$ has a causal influence on node $E$ if and only if there exists a directed path from $C$ to $E$ in <em>every minimal latent structure consistent with</em> $\hat{P}$.
\(\forall L, L' \in \mathcal{L},\ L \preceq L', \exists \theta_G : P_{[O]}(<G, \theta_G>)=\hat{P} \\
\exists p = C\rightarrow \dots \rightarrow E \in L\)</p>
</blockquote>
<p>Inferred Causation is a more strict requirement than the sole existence of a $C\rightarrow \dots \rightarrow E$ path. It demands such a path in every minimal latent structure consistent with $\hat{P}$.</p>
<p><strong>Consistency is a necessary condition,</strong> as we need to represent $\hat{P}$. Failing to find a directed path in consistent structures means that we cannot conclude a causal influence.</p>
<p><strong>Minimality is a sufficient condition,</strong> as a directed path in the minimal structure ensures the existence of the same path in the non-minimal graph. However, this does not hold in the other way around: a directed path in a non-minimal structure does not imply that there will be a directed path in the minimal structure.</p>
<h1 id="the-inductive-causation-ic-algorithm">The Inductive Causation $(IC)$ algorithm</h1>
<p>After having dived into the intricate details of causal models and latent structures, we can analyze the first algorithm for causal discovery.</p>
<p>The Inductive Causation $(IC)$ algorithm tackles causal discovery without latent variables. Although <strong>when assuming stability, there will be a unique minimal causal structure, this only holds up to Markov equivalence</strong>.</p>
<p>This means that $IC$ cannot identify the ground truth graph, it will only output a pattern.</p>
<blockquote>
<p>A <strong>pattern</strong> $H$ is a graph with both directed and undirected edges. In our context, it is used to express Markov equivalence classes.</p>
</blockquote>
<p>The algorithm works with separating sets to determine conditional/marginal independence.</p>
<blockquote>
<p>A <strong>separating set</strong> $S_{ab}$ is a subset of variables $V$ in DAG $G$ if $a\perp b | S_{ab}$ holds. It is minimal, if no vertex can be removed from $S_{ab}$ so that $a\perp b | S_{ab}$ still holds.</p>
</blockquote>
<h2 id="algorithm">Algorithm</h2>
<p><strong>Input:</strong> $\hat{P}$, a stable distribution on a set $V$ of variables.</p>
<p><strong>Output:</strong> a pattern $H(\hat{P})$ compatible with $\hat{P}$.</p>
<p><strong>Steps</strong></p>
<ol>
<li>$\forall a, b \in V$, search for a set $S_{ab} : a \perp_{\hat{P}} b | S_{ab} $. Construct an undirected graph $G$ with $a-b$ if and only if $S_{ab} = \emptyset$.</li>
<li>For each $a-c-b$, check if $c \in S_{ab}$.
<ul>
<li>If it is not , then $a\rightarrow c \leftarrow b$.</li>
</ul>
</li>
<li>In the resulting PDAG (Partially Directed Acyclic Graph), orient as many of the undirected edges as possible subject to:
<ul>
<li>No new v-structures and</li>
<li>No directed cycles.</li>
</ul>
</li>
</ol>
<p>For doing causal discovery, <strong>we start with a (stable) distribution</strong> - stability is required to rule out functional independencies (independencies introduced by the parameters of the SEM but not present in the ground-truth graph - for details see <a href="/posts/2021/11/poc6-latents-stability-ic/">PoC #7</a>).</p>
<p>Acknowledging that the output is a pattern, we face the reality of causal discovery: it is not guaranteed that we able to recover the ground-truth graph. Nevertheless, we can recover it up to its Markov equivalence class.</p>
<p><strong>Step 1</strong> constructs the skeleton of the graph. This is done by carefully checking conditional independence for all combinations of nodes and separating sets. ($IC$ is agnostic to how this is done; e.g., we can use conditional independence tests) In the case of $n$ nodes, we have ${n\choose 2}$ combinations for $a,b$ with $2^{n-2}$ possible separating sets for each. Multiplying the two gives us the sad reality of $45\times 256 = 11,520$ conditional independence tests in the worst-case scenario for $n=10$ (we don’t need to make all tests if one already indicates conditional independence).</p>
<p><strong>Step 2</strong> identifies v-structures by checking whether the common neighbor $c$ of <em>nonadjacent</em> nodes $a, b$ is in the separating set. Because we know that $c$ is connected to both, the only explanation for $c$ not being in $S_{ab}$ is that $c$ is the middle node of the v-structure $a\rightarrow c \leftarrow b$ - i.e., it introduces dependence between $a$ and $b$ when conditioned on.</p>
<p><strong>Step 3</strong> gives us general guidelines how we should orient the remaining edges, but it does not provide concrete means for it: we cannot introduce new v-structures (as step 2 should have identified all) or directed cycles (as we look for DAGs).</p>
<blockquote>
<p>Depending on the task, we may not be able to orient all edges.</p>
</blockquote>
<h3 id="the-four-rules-of-step-3">The four rules of Step 3</h3>
<p>It was shown in the literature that with four simple rules, we can orient the maximal number of edges.</p>
<ol>
<li><strong>Rule 1:</strong> Orient $a \rightarrow b - c$ into $a \rightarrow b\rightarrow c$ for nonadjacent $a$ and $c$.
<img src="/images/posts/ic_r1.svg" alt="$IC$ Rule 1" /></li>
<li><strong>Rule 2:</strong> Orient $a - b$ into $a \rightarrow b$ whenever there is a chain $a \rightarrow c \rightarrow b$.
<img src="/images/posts/ic_r2.svg" alt="$IC$ Rule 2" /></li>
<li><strong>Rule 3:</strong> Orient $a - b$ into $a \rightarrow b$ whenever there are two chains $a - c \rightarrow b$ and
$a - d \rightarrow b $ such that $c$ and $d$ are nonadjacent.
<img src="/images/posts/ic_r3.svg" alt="$IC$ Rule 3" /></li>
<li><strong>Rule 4:</strong> Orient $a - b$ into $a \rightarrow b$ for $a - c \rightarrow d\rightarrow b - a$ such that $a, d$ are adjacent but $b, c$ are not.
<img src="/images/posts/ic_r4.svg" alt="$IC$ Rule 4" /></li>
</ol>
<p><strong>Rule 1</strong> ensures that no v-structures are added (orienting $a \rightarrow b - c$ into $a \rightarrow b\leftarrow c$ would yield a v-structure).</p>
<p><strong>Rule 2</strong> prevents directed cycles of form $a \rightarrow c\rightarrow b \rightarrow a$. Nonetheless, the $a\rightarrow b$ edge creates the v-structure $a\rightarrow b\leftarrow c$. We can resolve the contradiction by showing that this v-structure has no effect on $I(G)$: as $a\not\perp c$, conditioning on $b$ does not add any dependence.</p>
<p><strong>Rule 3</strong> also prescribes a step that eliminates directed cycles from the PDAG. Namely, the edge orientation $a\leftarrow b$ enables both $a\rightarrow c \rightarrow b \rightarrow a$ and $a \rightarrow d \rightarrow b \rightarrow a$ - but this has the cost of introducing the v-structures $a \rightarrow b \leftarrow c$ and $a \rightarrow b \leftarrow d$. Noticing that $b$ was already the middle node of the v-structure $c \rightarrow b \leftarrow d$ and that the edges $a - c$ and $a - d$ imply $a\not\perp c$ and $a\not\perp d$, we conclude that $a\rightarrow b$ does not introduce new conditional independencies.</p>
<p><strong>Rule 4</strong> resembles Rule 3 as it also eliminates directed cycles by introducing a new v-structure, which does not change conditional independencies. As a result, we get $a - c \rightarrow d\rightarrow b \leftarrow a$. As $a$ and $d$ are nonadjacent (i.e., we have $a-d$), the v-structure $a\rightarrow b \leftarrow d$ has no effect on the conditional independence of $a$ and $d$.</p>
<h1 id="the-ic-algorithm">The $IC^*$ algorithm</h1>
<p>The $IC$ algorithm only works when there are no latent variables. Fortunately, some simple rules are sufficient to extend the process to latent variables.</p>
<p>The theorem stating that each latent structure has at least one projection (discussed in <a href="/posts/2021/11/poc6-latents-stability-ic/">PoC #7</a>), and the principle of Inferred Causation (IC) justify the generalization. Since a causal influence is reflected as an edge in every minimal latent structure, we need to find its projection.</p>
<p>$IC^*$ acknowledges the uncertainty of the latents by outputting a <em>marked pattern</em>.</p>
<blockquote>
<p>A <strong>marked pattern</strong> is a PDAG with four edge types:</p>
<ol>
<li>$a\rightarrow^{*} b$: directed path from $a $ to $ b$</li>
<li>$a\rightarrow b$: directed path from $a $ to $ b$ or $a \leftarrow L \rightarrow b$</li>
<li>$a \leftrightarrow b$: $a \leftarrow L \rightarrow b$</li>
<li>$a-b:$ $a \leftarrow L \rightarrow b$ or $a \leftarrow b$ or $a \rightarrow b$,</li>
</ol>
</blockquote>
<p>where $a \leftarrow L \rightarrow b$ is confounding (unobserved common cause). Note also that $a \rightarrow b$ on the left of the colon refers to an edge type in the PDAG, including two possible scenarios (to avoid confusion, the edge from $a$ to $b$ is described as a directed path instead of using arrows).</p>
<h2 id="algorithm-1">Algorithm</h2>
<p><strong>Input:</strong> $\hat{P}$, a stable distribution w.r.t. $L$.</p>
<p><strong>Output:</strong> a <strong>marked</strong> pattern $H(\hat{P})$ compatible with $\hat{P}$.</p>
<p><strong>Steps</strong></p>
<ol>
<li>$\forall a, b \in V$, search for a set $S_{ab} : a \perp_{\hat{P}} b | S_{ab} $. Construct an undirected graph $G$ with $a-b$ if and only if $S_{ab} = \emptyset$.</li>
<li>For each $a-c-b$, check if $c \in S_{ab}$.
<ul>
<li>If it is not , then $a\rightarrow c \leftarrow b$.</li>
</ul>
</li>
<li>In the resulting PDAG (Partially Directed Acyclic Graph), orient <strong>and mark</strong> as many of the undirected edges as possible subject to two rules:
<ol>
<li><strong>Rule</strong> $1{}^{<em>}$: Add an arrowhead to $a\rightarrow b - c$ to get $a\rightarrow b\rightarrow^{</em>} c$ for nonadjacent $a$ and $c$.
<img src="/images/posts/ic_star_r1.svg" alt="$IC^*$ Rule 1" /></li>
<li><strong>Rule</strong> $2{}^{*}$: Add an arrowhead to $a-b$ to get $a\rightarrow b$ if $a$ and $b$ are adjacent and there is a directed path of marked links from $a$ to $b$.
<img src="/images/posts/ic_star_r2.svg" alt="$IC^*$ Rule 2" /></li>
</ol>
</li>
</ol>
<p>Identifying the separation sets and orienting the edges of v-structures (Steps 1 and 2) are the same as in $IC$, the difference is that $IC^*$ does not orient edges but adds arrowheads.</p>
<p><strong>Rule</strong> $1^{<em>}$ is _almost_ the same as <strong>Rule 1</strong> in $IC$, but there are two differences: first, the $a\rightarrow c$ edge here comprises of $a\rightarrow^{</em>} b$ and $a \leftarrow L \rightarrow b$; second, $\rightarrow^{*}$ is used to indicate the directed edge from $c$ to $b$. Thus, it includes the addiotnal case when a latent variable leaves the path between $a$ and $c$ active.</p>
<p>The resemblance of <strong>Rule</strong> $2^{<em>} $ and <strong>Rule 2</strong> is more far-fetched: not only three-tuples of nodes as a chain are considered, but also a path of any length. Though the core principle is the same: avoid directed cycles. The marked path $a\rightarrow^{</em>} \dots \rightarrow^{*} b$ and the edge $a-b$ comprises an <em>undirected</em> cycle, which is broken by adding the arrowhead to $a\rightarrow b$. Again, note that this includes $a \leftarrow L \rightarrow b$ - but this does not contradict acyclicity.</p>
<h1 id="summary">Summary</h1>
<p>In this post, we were annoyed <em>again</em> by the lack of creativity of the Naming Committee of Causal Terminology. Nonetheless, we encountered our first causal discovery algorithms. By comparing $IC$ and $IC^*$, we have also seen the effect of having latent variables.</p>Patrik ReizingerWe will talk about IC, $IC$, and ${IC}^*$ in this post. You get the difference.