I came across the Rotating Features paper (oral at NeurIPS 2023) at the ELLIS Doctoral Symposium in Helsinki. The paper proposes a structured latent space for object-centric learning. In the past months, I have thought about the role of structure in representation learning, and now I am sharing my thoughts.
The idea comes from the temporal correlation hypothesis (the figure is from the Complex Autoencoder paper). This posits how neurons modulate their firing frequency and relative timing.
A property’s (latent factor’s) presence modulates firing frequency. High frequency means the property is present.
Different neurons’ fire with a time shift to express the object to which a property belongs. Neurons representing properties of the same object will be firing at (approximately) the same time.
Generally, latent variable models use a scalar for each latent factor. The problem is that a scalar has only a single property: its magnitude. To turn the temporal correlation hypothesis into a representation, we need more. Previous work by the authors suggests using complex numbers, i.e., 2d vectors for each latents. Vectors have magnitude and orientation, so we are done, right?
Not so fast. The Complex Autoenoder (CAE) by the same authors should have done the trick, but it cannot generalize to arbitrarily many objects. E.g., it fails for 10. Intuitively, a problem is that the vectors cannot be spread “very far” on a circle, especially with more objects.
Rotating Features use a hypersphere instead, generalizing the CAE. Intuitively, using a high-dimensional latent space could help on its own. In high dimensions, random vectors with e.g., i.i.d. Gaussian coordinates will be almost surely orthogonal. This means you can spread them out to get distinct clusters for each object’s features. Theoretically, a complex number (as in the CAE) should be sufficient. I suspect that when the vectors are too close (in terms of angular distance), the neural network mistakes them to belong to the same object.
It would be interesting whether performance improves with different inductive biases. Namely, the network uses a so-called binding mechanism that ensures that similar features are processed together. As the authors have shown, the binding mechanism is sensitive to angular distance. That is, it benefits from a higher-dimensional hypersphere, where vectors can be distributed further apart.
Experimental results show clear improvements, but I will focus on the main message:
If you peel back the idea, it comes back to learning a structured representation, where structure is an inductive bias.
And I believe this is quintessential. What most latent variable models models use is an Euclidean latent space. But that fails even on simple data sets like dSprites. WHat causes the problem are discrete (shape) or cyclic (orientation) features. That is, when the topology of the feature does not match the latent space.
You might object that you can represent orientation on the real line in $[0;2\pi)$, and you are right. So you can even have identifiability guarantees. Problems start to come up when you want to measure (angular) distance-see the example in our extended abstract. Suppose you use the real line that comes by default by the Euclidean metric (a.k.a. L2-norm). Well, will that correctly say that an orientation of 0 is closer to $2\pi-\varepsilon$ or $0.1$? Nope!
So you need some structure. Empirical studies have already shown that simply using more components to represent a single latent factor helps. It should not come as a surprise, e.g., consider orientation. You could use an angle $\theta$, or you can use real coordinates (parametrizing a circle, embedded in $\mathbb{R}^{2}$) as $(\cos \theta; \sin \theta)$ . The network may learn to parameterize the circle even without a cosine-sine parametrization. But you cannot do this with a scalar component.
To summarize, Rotating Features relies on a simple idea by introducing structure into the latent space to learn an object-centric representation. I mean it as a compliment: simple is elegant, but not necessarily easy.
Is there a useful structural inductive bias for the problem you want to solve?
]]>Generally speaking, interviews for Ph.D. positions, internships in machine learning are generally structured into three parts:
It is not all you need for a machine learning interview, but reading and comprehending each concept in the freely available Mathematics for Machine Learning book is what can lay the foundations on the math and fundamental machine learning side. Regarding mathematical concepts, you need to be fluent at least in:
Since neural networks consist of matrices and tensors, it is essential to be aware of the corresponding operations—knowledge of matrices is a must, tensors are a nice-to-have, but you need to have an intuition that tensor can be thought of “generalized matrices into higher dimensions”. Matrix knowledge includes how linear equation systems and matrices correspond, what are the computational aspects of matrix algebra (cost) and how matrices can express linear transformations—unfortunately, bonus points cannot be collected for knowing the eponymous film series.
Linear equations can be thought as (intersecting) hyperplanes in a vector space, so characterization of a vector space is essential. When we connect matrices—what we do since they can describe the equation system—to vector spaces, we also want to measure angles and distances, so being aware of norms, inner products is also a must.
When we talk about matrices, decompositions such as Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) cannot be neglected. Here the key is to understand what these decompositions mean, and also what we can do with them (low-rank approximations and even unsupervised representation learning in simple cases).
This figure from the MML book can be very helpful to comprehend all things matrix philogeny:
Understanding Bayes’s theorem is the bread and butter for several machine learning algorithms. Besides helping with the Monty Hall Problem, it is a fundament of a broad range of methods aiming to infer unmeasured quantities. However, not everything is Bayesian estimation: Maximum Likelihood and Maximum A Posterior methods are also frequently used. Factorization, independence and the latter’s relation to (un)correlatedness are further essential concepts, since assumptions on distributions generally are about them. To juggle with distributions, we need to distinguish them by names such as marginal, conditional, or joint; to manipulate them (read: to use Bayes’s theorem), we need the Sum and teh Product rules. Be also aware of the special role of the Gaussian, and exponential families (to get conjugate priors for Bayes estimation, leading to closed-form solutions) can also become handy. Staying with the Gaussian, its prominence in the Central Limit Theorem is the basis for using the Gaussian assumption on distributions. When we want to transform probability densities, then the change of variables formula will be our tool. There are properties (sufficient statistics) that can describe distributions concisely, the mean and variance are such (they are sufficient to describe a Gaussian) and they have interesting properties. It is also worth knowing that they are instantiations of (central) moments, which can be thought as a general family of descriptors for probability distributions. A pinch of information theory, i.e., quantities related to the information content of a random variable, are used in several fields of machine learning. (Differential) Entropy is the most important concept, but cross entropy and mutual information are also essential.
I would also recommend some online courses, particularly the Bayesian Statistics Specialization on Coursera - with flashcards for the first two courses available on my blog; here for the first, and here for the second course. Additionally, there are related aspects in the Probabilistic Graphical Methods Specialization on the same website, with flashcards by yours truly for PGM1, PGM2, and PGM3.
Analysis is mostly a tool for optimization in the context of machine learning - if you happen to be a theorist, it can be much more, but let’s stick to the fundamentals. Knowing what the derivative is, what it means (also for vector valued functions) is essential to understand optimization methods and the Taylor approximation. So the Jacobian and the Hessian should be a trusted acquaintance. The former is also required for the change of variables formula mentioned above.
Since machine learning evolves around mostly gradient-based optimizers, Stochastic Gradient Descent (SGD) makes the top of the list. Taking a step back, you should be aware about the family of first-order methods, and why they are popular (computationally low cost). But beware of the caveats: local optima, setting the step size and co. So as a follow-up, ponder why second-order methods should (they are aware of the curvature) and should not (calculating the Hessian is costly) be used. When we need to incorporate prior knowledge/constraints, then a Lagrange-multiplier will come handy. It is also good to know why we love convex optimization (i.e., to wonder about the good old days when problems were convex, as ML problems will not be convex in most cases). It might also be useful to know about the Karush-Kuhn-Tucker (KKT) conditions, which collect necessary conditions for the solutions of constrained optimization problems, including problems with inequality constraints in nonlinear programs (a synonym for optimization problem). To transition towards ML-related topics, here you should know optimizers such as ADAM, Nesterov momentum, or RMSProp - the key is they use an averaging procederure (‘momentum’) to incorporate previous update(s). As a final twist, knowing conceptually how modern ML frameworks implement gradient calculation (automatic differentiation) also belongs to the good-to-know facts.
The hilariously-titled All the Math You Missed (But Need to Know for Graduate School) also looks promising, but as far as I can tell from skimming it, it is for the next level.
The categories of machine learnin (supervised, unsupervised, and reinforcement learning are the three main categories, but self-supervised and semi-supervised learning also belongs) are a must.
For supervised methods, classification and regression are the categories you need to be aware of, including what loss functions (cross entropy vs mean squared error) are used. Additionally, the Support Vector Machine (SVM) with its hinge loss also often comes up. Flavors of regression (linear, polynomial) are also prevalent.
For unsupervised, PCA from linear algebra is a trusted friend, but it needs to be accompanied by k-means and Gaussian Mixture Models (GMMs), including how the latter two relate to each other (GMM is “soft” k-means).
Reinforcement learning: model-free and model-based, offline and online RL are useful categories to keep in mind. To my best knowledge, these will mostly come up during an interview if you want to work in the field of reinforcement learning.
Nitty-gritty details of what coulf (and will) go wrong during training and how to fight them also comprises the essential toolbox of a machine learning engineer/researcher, including data preparation, architecture design. This is to find a trade-off between underfitting and overfitting.
Of course, they are the field-specific knowledge like ResNets, Transformers, CNNs for computer vision. Here generally what is expected to provide a high-level, intuitive understanding of modern/state-of-the-art methods, but it is generally not required to comb through arXiv digest daily.
Know machine learning frameworks (PyTorch and TensorFlow are the big two, but JAX is on the rise as far as I can tell), and you can shine if you can compare them. If you use any of those, then you should prepare to state why you use it.
Additionally, there are Easter eggs where you can show your commitment—even better if you can showcase it on your GitHub profile. I am talking about the dreaded triad of
For Python, realpython.com is my go-to resource, you can learn about all these concepts there.
Let it be research or programming, be honest and show your commitment and enthusiasm. If you have fields, you are interested in, say so. If you have a relevant project (plan), definitely talk about it.
Remember, think with the head of the interviewer.
]]>We made the case for using geometric priors in the AMMI 03 post and argued for their merit for generalization. To see the relationship to modern machine learning methods, we will now focus on disentanglement in representation learning.
Disentangled representations mean, in an intuitive sense, that the latent factors a neural network learns are semantically meaningful.
For example, this implies that for a 3D scene with objects, the latent representation should separately encode size, color, shape, and position. Nonetheless, this is a vague concept: indeed, current methods include a wide range of inductive biases and conjured a diverse range of metrics. Having uncorrelated factors make sense, but is that the whole picture?
For me, disentanglement was this vague concept a lot of people are interested in, but could not express it explicitly. After spending some time to study the essentials of geometric deep learning, I found the notions of invariance, equivariance, and symmetries useful to think about disentanglement. Of course, I was not the first: this post relies on (Higgins et al., 2018) to provide a geometric deep learning perspective of disentanglement.
But first, we should be more specific than saying what we want is semantically meaningful latents.
Visually, this is what we expect: for greyscale points on the 2D plane, we want to have $x$-, $y$-position, and color as our latents
Our first take is guided by the DCI score, for it quantifies semantically meaningful representations based on how disentangled (modular), complete (compact), and informative (explicit) they are.
Note that disentanglement as a component in the DCI score has an unfortunate name: for a representnation to be disentangled, we require all three components. For this reason, I will use modularity, compactness, and explicitness.
Modularity/Disentanglement Modularity measures whether a single latent dimension encodes no more than a single data generative factor.
When changing a latent factor $z_i$ changes only one attribute, e.g., the size of the object, then it is modular.
If changing $z_i$ changes both color and size, then it’s not modular in this sense.
What happens when $z_1, z_2, z_3$ encode the 3D position of the object, but not in the canonical base? Is that still modular? We will return to this point later.
Compactness/Completeness Compactness measures whether each data generative factor is encoded by a single latent dimension
Completeness requires that an attribute should only be changed if a specific $z_i$ changes. For all $z_{j\neq i}$, the attribute (e.g. color) should remain constant.
Completeness reasons about the opposite direction than modularity. Namely, modularity is still fulfilled if both $z_i, z_j$ encodes color, but such a representation is not compact.
Explicitness/Informativeness Explicitness measures whether the values of all of the data generative factors can be decoded from the representation using a linear transformation.
Fortunately, latents are not rude, so no four-letter words are meant by this kind of explicitness. As (Higgins et al., 2018) argue, this is the strongest requirement, as it addresses two points:
In a 3D scene of a single object with a specific shape, size, position, and orientation, all of these factors correspond to latent factors such that we can extract all information by applying a linear transformation, i.e., $z_{true} = A z_{learned}$. That is, it can happen that a single $z_{learned,i}$ changes multiple factors, but we can find a matrix $A$ such that we get factors where modularity holds.
Condition 1 is hurt if, e.g., color is not encoded in the latents; while condition 2 is not fulfilled if there is no such matrix $A$ that $z_{true} = A z_{learned}$ holds (e.g., there is a nonlinear mapping to $z_{true}$).
Let’s start with a refresher from AMMI 03 post about what a symmetry is:
A symmetry of an object is a transformation that leaves certain properties of the object invariant.
And continue with the same example as in the paper: a grid world with
Translation and color change do not change the identity of the object, so they are the symmetries of the example, and as such, they can be thought as a symmetry group $G$. Elements $g\in G$ thus map from data space to data space as $G\times X\to X,$ leading to the conclusion that these transformations are the group actions. Additionally, we can create subgroups from $G$, corresponding to horizontal/vertical translation and color change. To have a disentangled representation, we require that when, e.g., color changes, the position stays the same. Translated to the language of geometric deep learning, this means that a
disentangled group action should decompose into components for each subgroup such that it only affects its corresponding subgroup.
The components are subgroups as they are in $G$ and when we change the corresponding factor (such as color) then we will remain within the subgroup: it does not matter how much we tinker around with color, we cannot get the position to change (throwing a paint bucket at it does not count!).
The first notable point is that here
the disentangled representation is defined in terms of a disentangled group action of symmetry group $G$.
Thus, the disentanglement definition from the paper becomes (it refers to vector representations as the latent space is assumed to be a vector space, i.e., we have latent vectors such that their linear combination is also a valid latent vector):
A vector representation is called a disentangled representation with respect to a particular decomposition of a symmetry group into subgroups, if it decomposes into independent subspaces, where each subspace is affected by the action of a single subgroup, and the actions of all other subgroups leave the subspace unaffected.
This means that the definition depends on the specific decomposition of $G$ into subgroups.
For example, if we define the decomposition with only two subgroups (one for position and one for color), then we do not care about whether the model can disentangle horizontal and vertical position. And this is a very important point.
This definition of disentanglement provides means to fine-tune the granularity w.r.t. which we require disentanglement.
From a practical point of view, this could lead to simpler models as no model capacity needs to be spent to disentangle specific factors. Furthermore, this also means that
There is no requirement on the dimensionality of the disentangled subspace.
That is, even if there is a multidimensional subgroup, e.g., as it comprises of correlated factors, but the corresponding group action only acts on this subspace, then it is disentangled. Such scenarios can arise in the real-world: when encoding both height and age, then they are correlated (there are no two-meter-tall babies).
When this is not enough, we should note that there is
no restriction on the bases of the subgroups.
Thus, position is not required to be described with the Cartesian coordinate axes.
Furthermore, if we impose the linearity constraint on the group actions for all subgroups, we arrive at a linear disentangled representation (this means that we have the matrices $\rho$ from last post’s representation definition):
A vector representation is called a linear disentangled representation with respect to a particular decomposition of a symmetry group into subgroups, if it is a disentangled representation with respect to the same group decomposition, and the actions of all the subgroups on their corresponding subspaces are linear.
A somewhat surprising counterexample is the case of 3D rotations. Namely, as they are not commutative (see image below), they cannot be disentangled according to the definition of (Higgins et al., 2018). Namely, the non-commutativity implies that the group actions (rotations around any of the $x$-, $y$-, or $z$-axes) affect the other group actions. Rotating around the $z$-axis means that the rotation around the $x$-axis will have a different effect. Thus, the group actions are not disentangled, and neither is the representation.
(Higgins et al., 2018) provides a principled definition of disentanglement based on group theory. The main benefit of this is that it enables us to communicate clearly about what we mean with disentangled representations. Furthermore, instead of speaking about “data generative factors” (which is a vague concept), they reason about the well-defined notion of group actions.
]]>I came accross a question about how can you express, e.g., a logical AND relationship in SEMs (Structural Equation Models). Let’s look into this.
Assume that you would like to visit your friend and you have a motorcycle you wish to use. To be able to undertake the journey, you need both the motorcycle and fuel (if you can afford it, anyways…). Clearly, you require both conditions, leading to a logical AND condition. How do you describe this in the language of SEMs?
We have the following binary variables:
$Z$ depends on both $X$ and $Y$, so this will be a collider/v-structure in the form $X\to Z\leftarrow Y$.
But this graph does not say anything about the nature of the dependency, it only says that there is some.
This is not a bug, as graphs on their own are only designed to express the (non-)existence of a cause-effect relationship, but it will not say anything about its functional form. For that, we need to level up our game to SEMs - we need the “equation” part of the SEM.
In a SEM, each of $X,Y,Z$ has a corresponding function that maps from the exogenous variables $U_i$ to the observed ones. For simplicity, assume that \(\begin{align} X&=U_X \\ Y&=U_Y\end{align}\)
Now lets look into $Z$. There, the relationship will be of form \(Z = f(X,Y, U_Z),\) as $Z$ depends on both $X$ and $Y$. To express the logical AND relationship, we might construct $f$ as \(Z = f(X,Y, U_Z) = X*Y*U_Z,\) which requires that $X$ and $Y$ are both present (expressed with the multiplication; $U_Z$ is not important here), meaning that
]]>the nature of the relationship is expressed with the functional relationships, but not by the graph structure itself.
For this to understand, we need to review the different error sources in learning systems, namely
Geometric priors in this context means exploiting the geometric structure of the data
For example, we can exploit that translating images will not change the object represented; thus, we get the same class label—some data augmentation techniques also rely on this idea, but they are not as principled as the Geometric Deep Learning approach. We will come back to this at the end of the post. This translation invariance is exactly what CNNs realize, leading to a simpler and smaller hypothesis class and thus smaller statistical error (and hopefully not increasing the approximation error—for CNNs, we know that labels are the same when images are translated, so we can be sure that the approximation error will not increase, but this can be nontrivial in more complex scenarios). Additionally, as CNNs do not care about translations, we don’t need to present images of the same object in every position; thus, we can reduce sample complexity too.
As the title of Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges says, deep learning is also invaded by the 5G: grids, groups, graphs, geodesics, and gauges (this has nothing to do with the conspiracy theories, probably because much fewer people understand it). The meaning of these concept will be clarified (not everything in this post). For now, what is important is that they describe (geometric) structure. Grids (e.g., pixel grids describing images) have an adjacency structure, i.e., all pixel has a specific set of neighbors. In the case of graphs, the edges between the nodes gives the structure. We would be fools not to exploit this structure. To refer to such structures, we will use the notion of
A domain $\Omega$ is a set with possibly additional structure.
Sometimes, the domain itself is the data we use, for example
Nonetheless, often we want to store more information, e.g. the color of a pixel or what you have eaten for breakfast tomorrow with the obvious reason to sell it to marketers to create personalized ads for the special omelette with peanut butter and jelly you thought you can keep secret.
For attaching other attributes to elements of a domain $\Omega$, we will apply a function, which we call the signal, mapping from elements of $\Omega$ (e.g., a pixel) to a vector space $C$ (e.g., RGB color of a pixel) and we denote the space of signals as $X={x:\Omega\to C}$ (think of this as the data space of RGB images).
So a signal associates a vector space $C$ for each element of $\Omega$.
$C$ does not even need to be the same for all $u$, it can be e.g. the tangent space of a specific point on a sphere.
What is interesting is that irrespective of the domain $\Omega$, $X$ will always be a vector space (linear combinations work as expected). Okay, it is not as interesting as we defined $C$ to be a vector space. Nonetheless, this enables us to do operations on our data (e.g., we can add images)—and as we know from our discussion on abstract algebra, operations are essential to define groups for example. Obviously, this is where today’s discussion will lead to.
We have done the above (and the previous post too) to be able to describe symmetries.
Symmetries are object transformations that leave the object unchanged,
and come in many flavors. Formulated otherwise: $g:\Omega\to\Omega$ is a symmetry if it preserves the structure of the domain. We start by noting that when using the group action $g$ we act on an element of $\Omega$ and get back a (possibly different) element of $\Omega$. This means that the mapping is $G\times\Omega \to \Omega$ and is denoted as $(g,u)\mapsto gu$ (i.e., it associates the element $gu$ to the element $u\in\Omega$ via the symmetry $g$).
$g$ has the following properties ($e$ is the identity element)
An example is planar motion in $R^2$, where $g$ is described by a rotation angle $\theta$, and two translation coordinates $t_x, t_y$. Then applying $g$ on a point $u=(x,y)$ can be characterized by the mapping $((\theta, t_x, t_y), (x,y))\mapsto [R; T] (x,y,1)$, where $[R;T]$ is a shorthand for the transformation matrix that rotates $u$ by $\theta$ and translates it by $(t_x,t_y)$—the third coordinate is needed to describe this affine transformation with a single matrix.
It needed a lot of effort, but now we can make sense of it to describe groups:
If we collect all symmetries (of a specific $\Omega$) together, we get a symmetry group $G$ with a group operation as the composition of group elements.
From the [previousus post](https://rpatrik96.github.io/posts/2022/06/dgl1-foundations/, we know that:
Symmetry groups can be discrete (rotating an equilateral triangle with multiples of $120^\circ$ or flipping its vertices) or continuous (rotations in $SO(3)$). They can be commutative/non-commutative (flipping the vertices of a triangle then rotating it has a different result than first rotating then flipping).
As these transformations (functions) $g$ act on $\Omega$ but our data lives in the signal space $X(\Omega, C)$, we need to introduce the corresponding mapping on $X$ as well. Namely, we need to be able to express symmetries not just on the pixel grid, but also in the RGB channels (the vector space $C$).
Let’s look into an example of moving a bug (I am not supposing that you get a bug in your code and move it into someone else’s) in an image by translation $t=(t_x,t_y)$. When we translate the bug by 5 pixels to the right (this would mean $t_x=5, t_y=0$), then to get the pixel value of the translated image at position $u$, we need to look up the original pixel value at the position 5 pixels to the left of $u$, i.e., at $u-t$.
This example highlights why the corresponding formula to define the symmetries of $\Omega$ acting on the signal space $X(\Omega, C)$ is \((gx)(u) = x(g^{-1}u)\), in our example $g^{-1}$ is applying $-t$. If you wonder what is the reason for the inverse, you don’t need to wait further: it is to satisfy the group axioms (hint: inverse element needs to be in the group—in our example, this is the relationship between the shifted pixels to the left/right).
Groups are abstract concepts, we need to describe them such that our computers can produce significant carbon footprints. Implicitly, we already did this (not the carbon footprint thing though): when we used matrices to describe affine transformations, we assigned a linear map to the group element of $(\theta, t_x, t_y$). Basically, this is what representations do.
An $n$-dimensional real representation of a group $G$ is a map $\rho: G\to R^{n\times n}$ assigning an invertible matrix $\rho(g)$ to each $g\in G$ such that it satisfies the homomorphism property \(\rho(gh) = \rho(g)\rho(h)\)
An example would be the following:
An important conclusion is that the number of elements in $G$ (in this case infinite) and the dimension of the representation are independent
This type of symmetry comes from how we build our neural networks. For example, given the vector space of data (signals) $X$, outputs (e.g., labels) $Y$, and the weights $W$, we can describe our net by a mapping $X\times W\to Y$ (mapping data with the net’s weights to a label) and say that a transform $g$ is a symmetry of this parametrization (~network structure) when we get the same result by using the weights $w\in W$ as with $gw$. For example, in an MLP permuting the hidden units makes no change in the output as we add the values and addition is commutative.
We already touched on CNNs and their invariance w.r.t. translation. In general, if the label does not change under a transformation $g: \Omega\to\Omega$, then we say that $g$ is a symmetry of the label function. The label function is simply a notation describing the mapping that associates a label in $Y$ to a data point in $X$ (denoted by $L: X\to Y$ ). Note that here $g$ is applied on the domain, only after that comes the label function, i.e., $L\circ g $.
This means that if we have a single data point but know all $g\in G$, then we can generate all instances of the class. Basically, learning all symmetries is what it takes to solve classification (which is a very hard problem).
From the previous post, we can relate to factor groups, which describe the subgroups of a specific group that behave the same way w.r.t. the kernel of an operation. What this means for classification is that the elements of factor groups divide all samples into the respective classes. So we can think of the symmetries of the label function as a way to describe the elements of the factor group.
Because they have symmetries! And now we can describe them. For example:
With such structures as graphs or sets, we can point out a seemingly subtle but important detail: although a graph (or a set) is an abstract concept, they need to have a practical description (how they are stored in computer memory). The consequence is that usually, we are interested in the symmetries of the description, not that of the object.
We already talked about CNNs and that they are invariant to translation. Nonetheless, invariance has its fallacies. In the case of learning faces, we need to be careful not to make the intermediate representations invariant, for that can lead to unrealistic objects, e.g., with faces this would mean the right most image below.
What is the answer to that? Equivariance
Equivariance means that when transforming the input of $f$ with transform $h$ is the same as the output is transformed with the same $h$. \(f \circ h(u) = h\circ f(u)\)
We can build equivariant neural networks, if we have the follwoing components:
We need different group representations for each $X_i$ as the same symmetry “needs to be adapted” to the new data space $X_i$.
This leads to the definition of equivariant networks:
\(f_i \circ \rho_{i-1}(g) = \rho_i(g) \circ f_i,\) which means that applying the (corresponding) representation and the nonlinear map can be almost interchanged (note the different indices for $\rho$). First applying the representation of layer $i-1$ and then mapping through layer $i$ should give the same result as first mapping through layer $i$ then applying $\rho_i$.
When treating images, $X_1$ is $n\times n \times 3$ (RGB channels), and assume that the first layer $f_1$ is a convolution with 64 channels. This means that we need a different representation of, e.g., translations in this 64-dimensional space (if we translate with $\rho_0$ then map with the first layer, we should get the same as if we would first map then translate with $\rho_1$). The rationale is that it cannot be that a translation can be described in the exact same way in both 3 and 64 dimensions.
In the end, our goal is to generalize better: we want that all samples that map to the same feature will still map to the same feature after undergoing a transformation by the group representation $\rho_1(g)$ in the input space. For this, the notion of an orbit is a useful concept:
Orbit: the manifold of a sample undergoing a transformation by each element of a group (e.g. the manifold of all rotated digits, starting from a single one—these are the curved lines in the input space in the figure below).
Indeed, this is what equivariant nets are capable of (see, e.g., this paper on CNNs).
Wait a second! We were talking about transformed samples, can’t we achieve the same generalization properties with data augmentation (instead of the toil required to derive the theory and design for equivariant models)?
No, data augmentation is inferior to equivariant networks.
For example, data-augmentation implements a constraint for the whole network (i.e., when augmenting the samples, we do not prescribe constraints for specific layers), but equivariance imposes a layer-wise constraint. And it scales better to large groups.
We dived deep into geometric priors to describe symmetries in the domain (data), parametrization, and labels with the goal to design efficient models exploiting this inductive bias. In the end, we did exactly that with equivariant networks and understand why equivariance is beneficial (principled way to generalization).
This post was created from the AMMI Course on Geometric Deep Learning - Lecture 3 held by Taco Cohen. Mistakes are my own.
]]>Machine learning operates on images, text, speech and much more. We intuitively understand that they include structure, but for most of us, this is where our knowledge stops. With the emergence of geometric deep learning, there is an increased need to understand the invariances, symmetries in the data.
When I started my B.Sc. in electrical engineering at the Budapest University of Technology and Economics in 2015, the curriculum of our “Introduction to Computer Science” class was changed; it no longer included abstract algebra. Looking into the assigned book, I thought this to be a reasonable decision, as I could not imagine myself using groups, rings, or bodies. I still think that this was a reasonable decision for most students, but I have realized that I missed a great opportunity to understand a deeper level of mathematics, and a way to connect to the real-world.
The name “abstract algebra” and connections to the real world seem to be controversial, but I think they are not. Though having a “rotation” is more abstract than a matrix in the mathematical sense, it is a concept we can easily relate to. When I think about rotation matrices, I always associate the physical rotation in three dimensions to have an easy-to-grasp mental concept for the mathematical description. You might object that this only works in 3D. Yes and no: though I only have access to the real-world meaning of rotations in 3D, but through this I have a general idea about what rotations are, so I am not baffled when I hear about 100-dimensional objects being rotated.
Geoffrey Hinton’s sarcastic remark also highlights our brain’s capacity to handle complex scenarios:
To deal with a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it.
In abstract algebra, we deal with sets and equip them with different relationships and properties, which we will describe with the help of operations. Operations take a number of elements from a set and map they to another element in the set. We can distinguish unary (acts on one element), binary (acts on two elements), tertiary, etc. operations.
This is the same concept you know from programming, so we can think of them as functions with a given number of inputs. Negating a boolean is a unary operator (as it changes the value of a single variable, it is a function with one input), but adding two numbers is binary (two inputs, but still one output).
In mathematics, we can describe such functions as operations mapping from a sequence of elements of a set $S$ (i.e., we take 2 elements from $S$ for a binary operator) to another element in $S$.
In the case of addition (for integers in this example), the Python type hints describe $S$, and we can see that our function maps $S\times S \to S$, meaning that the first parameter x
comes from the set int
, and so does y
. As a result, we get another int
with the value x+y
.
def add(x:int, y:int)->int:
return x+y
An operation is a function,
and as one, it can have different properties. These properties are the instructions how we can apply the operators (functions), and have a role in what symmetries are present in $S$.
* is used as a symbol for an arbitrary operator, it does not necessarily mean multiplication.
Abstract algebraic structures are defined by the set $S$, an operator/operators and their properties. They form a hierarchy, where we get more and more valid operations. This means that more complex structures admit more “exotic” functions. Intuitively, this equals of having richer “representations” (in the machine learning sense).
Before advancing through this hierarchy, we should ponder the question: if we can have richer representations, why would be satisfied with simpler ones? That is, why do we need simpler algebraic structures?
Complexity has its price. Thinking in terms of neural networks, a larger model is more powerful, but is harder to train, whereas the smaller one is faster, but less expressive. A convolutional network is invariant to translations, which is useful for images, but could be useless or harmful in other contexts. As we would not choose a huge model for MNIST (we would risk overfitting), we select the simplest algebraic structure that admits the properties we want. In the following, we describe our choices.
Groups are already special, they have specific prperties (see below). To provide some perspective, we start with a very simple structure comprising of a set and a simple associative operation. It is not much, though it deserves its own name:
A set $S$ is an associative semigroup if it has an associative operator *. If * is also commutative, we call it a commutative (or abelian) semigroup.
$n\times n$ matrices are a good example for associative semigroups. They can describe linear transformations, such as rotations, translations, so even this simple structure is powerful. Though they are not abelian as matrix multiplication is generally not commutative. This means that selecting the set of matrices that commute w.r.t. * is a smaller set.
Having defined an algebraic structure, we can interpret what we mean by the operator mapping $S\times S\to S$: multiplying two $n\times n$ matrices will also be an $n\times n$ matrix. That is, $S$ is closed w.r.t. *. This is not the case for all operators. If our operator is the subtraction, then $S=\mathbb{N}$ is not a group (w.r.t -), as $(7-9)\not\in\mathbb{N},$ which violates that the result of the operator (the output of the function) should be in the same set. In this case, enlarging $S$ is the solution. If we choose $S=\mathbb{Z}$, then $S$ is closed w.r.t. subtraction.
If we demand that the $S$ contains a unit element (identity) w.r.t. the operator *, then we call $S$ a monoid. It is important that we speak about “identity w.r.t. the operator *”, as we will see that more complex algebraic structures can have multiple operators and so multiple identity elements. An example is the identity matrix $I_n$ for the set of $n\times n$ matrices and the matrix multiplication *. In this case, we can call $I_n$ a multiplicative unit element to differentiate it from other possible unit elements.
A set $S$ is a group with the operator * if:
- * is associative
- has an identity element $i$ s.t. $\forall x \in S : x*i = i*x = x$
- the inverse of each element is unique and is also in $S$, i.e., $\forall x\in S \implies y \in S : x*y=y*x=i$
The leap from monoids to groups is the existence of the inverse element. Though this can seem as an unimportant feature, it is not. Namely, having the inverse means in $S$ means that we can undo the operation (think about this as the option of Ctrl+Z/Cmd+Z on your laptop). That is, we can answer the question: What was the starting point before applying a specific element? In the machine learning perspective, if we assume that our data is generated by latent factors, then we will be able to recover them. As for semigroups, we can define commutative/abelian groups if * is also commutative.
When we have images with different shapes, positions, and colors it is useful to build categories like triangles, circles, and squares. These are subgroups of the original group. That is, if our group $G$ contains vectors with elements $[x; y; angle; shape, color]$ and the operator * combines the features (e.g., translates the object) then a subgroup is a set of elements where some coordinates of the feature vector are fixed. Triangles are elements where $[x; y; angle;shape=triangle, color]$ and so on.
So a subgroup is a subset $S$ of the elements with the same operator as the group $G$. This is like a specific 2D plane in 3D space.
Note that subgroups need to contain the identity of the group to have inverses.
Subgroups are useful as they correspond to how we would categorize objects. We think about triangles, circles, and squares as distinct objects. If we would like to cover the whole space of objects, i.e., to get a description of all 2D planes covering 3D space, we need the concept of cosets.
Cosets describe a set of subgroups $S$ that cover the original group $G$.
For 2D planes parallel to the $x-y$ plane of the Cartesian coordinate system, this means having all translations along the $z$ axis. This idea is generalized by taking an element of $g\in G$ and applying the group operator * on $g$ and all $s\in S$. That is, we shift all points of the plane ($S$) with $g$. The concept is captured mathematically as $S*g,$ where this means that we take all $s\in S$ and apply * with a single $g\in G$ (this yields a single 2D plane, shifted by the vector $g$ for all $s\in S$); then you repeat it for every $g$ (to cover the whole 3D space). We can generate subgroups both as $s*g$ (right coset) and $g*s$ (left coset), but we will only focus on cases when both are the same, which we will call normal subgroups. An intuitive way to think about this is to compare this property to commutativity.
How do we benefit from dividing groups into smaller entities besides having a more intuitive description?
This enables us to express certain symmetries. Take rotations of objects for example. They are a describe a subset of objects with different orientation. So we write a function to to render objects with all rotations. As our computational power is finite, we need to define a step size for the angle. Make it to $1^\circ$. In this case, the rotation element $R_1$ (“rotate by $1^\circ$”) generates a subgroup (as our group contains other features such as position, shape, etc.) containing $R_1, R_2, \dots, R_{359}, R_{360}$. You need all 360 elements, otherwise the group operator (multiplying the rotation matrices) would create elements not in the subgroup. Moreover, this subgroup is cyclic. When we apply $R_1$ consecutively, we will not get an infinite amount of different elements as $R_{360}=R_0$ (the identity). We call the smallest number of applying the generating element $R_1$ and getting back the identity as the order of the subgroup.
Normal subgroups can be used to define another group, called factor or quotient group, denoted by $G/S$. The name “quotient” comes from an analogy to division: as quotient groups are sets of cosets of $S$, this means that a quotient describes a “clustering” of $G$ according to $S$. Namely:
The last point illustrates the additional information conveyed by quotient groups compared to plain old division: division only gives the order (i.e., the quotient), but quotient groups provide the elements too. An example is taking the positive integers as $G$ with addition as the group operation and defining the normal subgroup as the numbers that are the multiple of e.g. $7$. This means that $G/S$ will give the integers modulo $7$, i.e., it divides all positive integers into $7$ clusters, those with the remainder $0,1,2,3,4,5,6$ w.r.t. division by $7$.
Regarding technical details, the group structure follows from the properties of normal subgroups, namely, that the left and right cosets are the same. For $S*g = g*S $, we can rewrite this as $S=g^{-1}*S*g$. From the equivalence of left and right cosets follows that $S$ is the unit element of the factor group, since $S*(g*S)=S*(S*g)$ and $S*S=S$; thus $S*(g*S)=S*(S*g)=S*g$. By left-multiplying with $S$, we can notice that $S*S=g^{-1}*S*g=S,$ so we have an inverse too.
From a machine learning perspective, we can see the merit of factor groups, as they can express how different elements in a (data) set are grouped together, e.g., this is what we want when clustering data.
We can describe the same group with different representations. If we have an image, it does not change its meaning if we select the top left or the bottom right pixel as the origin of our coordinate system. For we can find a bijective mapping that transforms the coordinates from one frame to the other. This notion, which we call isomorphism, is important as it reduces the number of different sets (as we only need to take care of those that are not isomorphic to each other, e.g., we don’t need to have all coordinate systems for our images).
Formally, two groups $G_1, G_2$ are isomorphic if there is a bijective mapping $\phi: G_1 \to G_2$ such that $ \forall x,y \in G_1 : \phi(x)*\phi(y) = \phi(x*y)$, where * is the group operation.
The definition says that if we apply the group operation to two elements in $G_1$, then map the resulting group element to $G_2$, we get the same result as applying the group operation of $G_2$ to the elements that are mapped to $G_2$. Going back to our representation learning example, let’s assume that the operator in $G_1$ (the latent space) “combines the features” (similar to + for numbers, e.g., if $x$ describes a red triangle and $y$ a blue triangle, then $x*y$ is a purple triangle); and the operator of $G_2$ does the same for the images (in this example, we can think of adding the matrices representing the images). Translating the definition to this example means that if we combine the features “red triangle” and “blue triangle” (e.g., both are vectors with two elements, indicating RGB color and shape) and mapping this feature vector to an image is equivalent to combining the images of a blue and a red triangle.
We already have defined isomorphisms that map between two groups with a bijective mapping, but this is a strong constraint as it requires that each element of $G_1$ is mapped to a single distinct element of $G_2$. Homomorphisms generalize these mappings by omitting the bijectivity constraint, leading to the definition:
The mapping between two groups $G_1, G_2$ is a homomorphisms if there is a mapping $\phi: G_1 \to G_2$ defined for each $x\in G_1$ such that $ \forall x,y \in G_1 : \phi(x)*\phi(y) = \phi(x*y)$, where * is the group operation.
Although we cannot invert a homomorphism generally the property $ \phi(x) * \phi(y) = \phi(x*y) $ still means that we preserve the group structure. In representation learning, we come across a similar concept when using Latent Variable Models (LVMs), where a small-dimensional latent vector describes high-dimensional observations (such as in the example above with factors such as color and shape as latents and the image as the high-dimensional observation).
When training Variational AutoEncoders (VAEs), it can happen that we experience posterior collapse, i.e., some elements in the latent space do not capture useful information, they are white noise. Intuitively, this relates to the concept of kernel (not the Linux one though):
The kernel of a homomorphism $\phi: G_1 \to G_2$ is the set of elements in $G_1$ that map to the unit element in $G_2$ and is denoted by $Ker(\phi)$.
We might think of the collapsed latents in VAEs as the kernel of the mapping to the observation space, since they do not contain information so they get mapped to a blurry image (whether that can be called a unit element is not trivial, but I am reasoning only on an intuitive level).
The image of $\phi$ is the set of elements in $G_2$ that can be produced by mapping elements of $G_1$ via the homomorphism $\phi(g_1) : g_1 \in G_1$ and is denoted by $Im(\phi)$.
In VAE language, these are the images you can generate.
Interestingly the quotient group $G/Ker(\phi)$ is isomorphic to $Im(\phi)$. That is, if we divide $G$ into cosets according to $Ker(\phi)$ (these are the clusters that “behave” in the same way w.r.t. the kernel) then this grouping is equivalent to taking the elements of $Im(\phi)$. Namely, each coset will be mapped to the same element of $Im(\phi)$. This means that the latent space can be divided into specific clusters yielding the same image; the importance of which is that it defines a sort of symmetry/invariance in the latent space showing that some changes in the latents do not affect the generated image.
This post was quite heavy introducing a lot of mathematical concepts and notation, but hopefully it provided some intuition why abstract algebra is useful for the geometry of deep learning. Namely, it describes symmetries, and that is what we are after.
]]>An academic paper is not just a messenger of hopefully ground-breaking results, but also a story and a visual manifesto. If it’s badly-formatted, there are dangling words in almost empty lines, inconsistent notation, readers might give up with reading. On the other side, if everything is nice but this results in a too-long article, then besides risking a desk-rejection (or the payment of extra fees for exceeding the page limit), readers will be less enthusiastic about facing so much text. Your results can be fascinating, if no one reads them, you failed your goal.
There are several best practices to ensure that your submission looks professional, eases the reader’s task, and fits into the page limit. I will assume that you are using LaTeX (you should), and provide my two cents on what I found especially useful.
You want that your text looks great and saves space. This is how you do it.
When the last few words of a sentence start a new line, it will look awkward and will waste you a lot of space. The solution is to instruct LaTeX to squeeze the words together a bit. This can be done with the command \looseness-1
what you place in front of a paragraph and enjoy the result.
Equations should be kept together, i.e., they should not be spread across lines when provided in-text. The easy fix is to but curly braces around them. So $y=f(x)$
can be split, but ${y=f(x)}$
cannot.
In-text fractions can take up lot of space and destroy the homogeneity of the paragraph by requiring more space between lines. A possible solution is using \usepackage{nicefrac}
and the \nicefrac{}{}
command, which will save you space.
Both the enumerate
and itemize
environments waste a lot of space between lines by default. So much space also suggests less coherence of the items in the list. With the nolistsep
option, the spacing between lines is reduced—the leftmargin=*
option will save some more space by starting the items right at the left (pun intended).
\begin{itemize}
% [nolistsep]
[nolistsep,leftmargin=*]
\item ...
\item ...
\end{itemize}
Conference submissions practically do not allow the inclusion of a table of contents due to the page limit, but it can be helpful for the appendix. This can be done in the following way:
% ----------------------
%% include in preamble
\usepackage[toc,page,header]{appendix}
\usepackage{minitoc}
% akes the "Part I" text invisible
\renewcommand \thepart{}
\renewcommand \partname{}
% ----------------------
%% include in the appendix
\addcontentsline{toc}{section}{Appendix} % Add the appendix text to the document TOC
\part{Appendix} % Start the appendix part
\parttoc % Insert the appendix TOC
LaTeX has commands such as \eqref{}, \autoref{}, \ref{}
that work fine, though what I started to like recently is the cleveref
package with its \cref{}
command, as it enables redefining the name Latex uses when referencing a table, figure, or section. For example, if you would like to change referencing sections to print out “section” with an upper-case “S”, then then use the following command (the third set of curly braces is used to define the plural).
All you need is include this script.
\usepackage{cleveref}
\crefname{section}{Section}{Sections}
It can be annoying when clicking on a reference means that somehow we need to navigate back to the same spot. Loading hyperref
as \usepackage[backref=page]{hyperref}
will show the page numbers for each reference they were cited at. For they are active links, going back to the original line cannot be more straightforward.
When using environments for theorems, remarks, and co, it can be useful to restate them in the appendix to avoid the back-and-forth to the main text. Simply copy-pasting is not a good solution as that way a different number will be assigned to the second appearance of the same claim. With the thmtools
package, there is a solution for this:
\usepackage{amsthm} % to have theorem environments in the first place
\usepackage{thmtools,thm-restate}
\newtheorem{thm}{Theorem}
\begin{restatable}{thm}{nameofthm}
This is true.
\end{restatable}
\nameofthm* % this will repeat the theorem with the same number
If you need to reference an item in a list (a common use-case is referring to e.g. claims of a theorem), the enumitem
package can help you:
\usepackage{enumitem}
\usepackage{cleveref}
\newlist{nameofenumeration}{enumerate}{2} % define enumeration type
\setlist[nameofenumeration]{label={\normalfont(\roman*)},ref=\thetheorem(\roman*)} % setup the label, it will include the number of theh theorem
\crefname{nameofenumerationi}{property}{properties} % cleverref config
\begin{theorem}
Theorem comes here with claims:
\begin{nameofenumeration}
\item property one \label{prop:1}
\item property two \label{prop:2}
\end{nameofenumeration}
\end{theorem}
A single figure will waste a lot of placed if put into e.g. a figure
environment. A possible solution is to use the wrapfig
package, which lets LaTeX to arrange text around the figure.
\usepackage{wrapfig}
\begin{wrapfigure}{r}{6.5cm}
\centering
\includegraphics[height=7em,width=\textwidth
]{figures/fig.png}
\caption{Caption}
\label{figure:fig}
\end{wrapfigure}
It’s good practice to organize notation and abbreviation into a system. So if you need to change how you denote the input, you only need to do it at a central place.
I have a separate .tex
file for all my abbreviations including concepts such as KL Divergence or the ELBO. All these can be defined with the \newacronym{}{}{}
command, where the three arguments are:
\newacronym{ml}{ML}{Machine Learning}
From this point on, \gls{ml}
will print out Principal Component Analysis(PCA) for the first use, then only PCA.
\glspl{ml}
,\acrshort{ml}
,\acrlong{ml}
,\acrfull{ml}
.Besides enforcing consistency, a list of acronyms can be created with the \printacronyms
command—as a bonus, the acronyms will be cross-referenced, sop clicking on them will lead you to the list of acronyms. Handy, isn’t it?
Notation is a crucial tool to refer to concepts in a short form and to formalize ideas. consistency is key here too. Fortunately, the glossaries
package does this for us: we can organize notation into categories, then print them out such that the reader will be able to click on them and get reminded what we use the formula for.
This is what you need to include:
\usepackage[acronym, automake, toc, nomain, nopostdot, style=tree, nonumberlist]{glossaries}
\usepackage{glossary-mcols} % to have multiple columns
\setglossarystyle{mcolindex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Glossary
\newglossary{abbrev}{abs}{abo}{Nomeclature} %abs and abo are file extensions LaTeX will use internally for this set of formulas -- different glossaries should have different ones
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Top-level glossary entries
\newglossaryentry{lr}{
name = \ensuremath{\alpha} ,
description = {learning rate} ,
type = abbrev,
}
% A separate category for mathematics, this will render all related notation
% under a "Maths"" header
\newglossaryentry{math}{type=abbrev,name=Maths,description={\nopostdesc}}
\newglossaryentry{cov}{
name = \ensuremath{\Sigma} ,
description = {covariance matrix} ,
type = abbrev,
parent = math,
}
For referencing the the above entries, the same \gls{}
command is used as for acronyms. The list of notation can be included by invoking the \printglossary[type=abbrev, style=tree]
command (this will use a hierarchical style). When including \setglossarysection{subsection}
then both glossary and acronyms will be at the subsection level.
The resulting structure will be:
Nomenclature
- $\alpha$ learning rate
Maths
- $\Sigma$ covariance matrix
If the glossaries are not showing up (especially if you are on Overleaf), check whether the files are in a folder (common when uploading a
.zip
). If yes, move everything out of the folder.
hyperref
warningsWhen using \gls{}, \glspl{}, \acrshort{}, \acrlong{}, \acrfull{}
in a caption, hyperref
will warn about Token not allowed in a PDF string
. To fix this, we can redefine these commands as
\pdfstringdefDisableCommands{\def\gls#1{<#1>}%
\def\glspl#1{<#1>}%
\def\acrshort#1{<#1>}%
\def\acrlong#1{<#1>}%
\def\acrfull#1{<#1>}%
}
to get rid of the warning and have more meaningful bookmarks in the pdf.
Working on a large project with lot of files and figure can result in you sitting in front of your monitor and reading War and Peace before you can start handling the error messages. \includeonly
comes for the rescue, as it restricts the files processed by the compiler only to the specified ones(no leading or trailing spaces allowed).
\includeonly{a.tex,b.tex}
I learned lot of the tricks in this post from Luigi Gresele and Julius von Kügelgen.
]]>I am now ready to pay my debt with the next flashcard deck, which accompanies a yet another statistics course, namely Bayesian Statistics: Techniques and Models by Matthew Heiner on Coursera.
The link for the cards can be found on ankiweb.net. After downloading the deck, you can import them into the free Anki software. You can also access the deck in my earlier resources post.
Happy learning and stay tuned for other decks to come!
P.S.: if you find any error, please contact me to help improving the material.
]]>Last time we have seen how we can adjust for direct causes by giving conditions for which variables we need to observe: for calculating $P(y|do(X=x))$, we need $Y, X, Pa_X$. This post gives two more general formulas that can be applied to DAGs to test whether the adjustment conditions are satisfied.
The main idea behind the generalization is the fact that not only $Pa_X$) can block the incoming paths to $X$. As these paths come from the non-descendants of $X$ and the edges point toward $X$, the whole concept is thought of as having (and to screen off confounding, blocking) a back door.
A variable set $Z$ satisfies the Back-Door Criterion to an ordered pair of variables $(X, Y)$ in a DAG if:
- nodes in $Z$ are non-descendants of $X$
- $Z$ blocks every incoming path into $X$
The Back-Door Criterion makes a statement about an ordered pair; i.e., $Y$ is a descendant of $X$ (there is a path from $X$ to $Y$). The first condition generalizes the requirement for observing $Pa_X$, for any non-descendant of $X$ suffices to block the incoming paths into $X$ - as of the second condition.
That is, we are looking for a set of variables $Z$ such that every path $X \leftarrow \dots - Z - \dots - Y$ is blocked. Note that here $-$ stands for any of $\leftrightarrow, \rightarrow, \leftarrow$. The only constraint is that the path has an edge pointing into $X$. As we want to reason about the effect of $X$ on $Y$, we need to leave the paths from $X$ to $Y$ unblocked but all paths into $X$ blocked.
After understanding the Back-Door Criterion, we can apply this to calculate interventional distributions.
\[P(y|do(X=x)) = \sum_z P(y|x,z)P(z)\]If a variable set $Z$ satisfies the Back-Door Criterion relative to $(X, Y)$ then the effect of $X$ on $Y$ is given by:
This is the same formula we had for adjusting for direct causes. Nonetheless, the scenarios where we can apply it are more general.
The formula can be interpreted as dividing the data into categories by the values of $Z$ and $X$ (this is also called stratifying) and calculating the weighted average of the strata (this is the fancy plural form expressing data categories). By conditioning on these two variables, we make the strata independent of each other - as $Z$ blocks the Back-Door paths, conditioning on $X$ is the same as $do(X=x)$. Note that for general $Z$ this would not be the case.
The Back-Door Adjustment formula is nice, but unfortunately it is sometimes not applicable. It can be a quite strong assumption that we can observe a sufficient set of variables that block all Back-Door paths.
The intuition for the more general formula of Front-Door Adjustment comes from the genius observation that houses usually have a front entrance, not just a back one.
A variable set $Z$ satisfies the Front-Door Criterion to an ordered pair of variables $(X, Y)$ in a DAG if:
- $Z$ blocks every directed path from $X$ to $Y$
- There is no back-door path from $X$ to $Z$
- All back-door paths from $Z$ to $Y$ are blocked by $X$
Let’s work through these three conditions.
These conditions result in a formula that applies Back-Door Adjustment twice: once for calculating the effect of $X$ on $Z$ and once for using $X$ as a Back-Door for estimating the effect of $Z$ on $Y$.
If a variable set $Z$ satisfies the Front-Door Criterion relative to $(X, Y)$ and if $P(x,z) >0$ then the effect of $X$ on $Y$ is given by: \(P(y|do(X=x)) = \sum_z P(z|x)\sum_{x'}P(y|x', z)P(x')\)
The outer sum is effect of $X$ on $Z$; the second condition makes it sure that the conditional is the same as the interventional distribution. The inner sum is the effect of $Z$ on $Y$; calculated by the Back-Door Adjustment formula.
The requirement for a positive $P(x,z)$ distribution makes sure that the conditional $P(y|x,z)$ is well-defined - meaning that all $x,z$ combinations are yielding meaningful strata.
Our endeavor to find ways to adjust for confounding resulted in two practical formulas. Now we can fight confounding. Of course, this requires that we know that confounding is present with a specific structure.
]]>We will revisit interventions in this post. As discussed in PoC #4, interventions can provide more information than observational data only. How does this “more information” look like? Recall that for interventions we need a DAG besides the joint distribution. When we intervene, we modify the DAG by removing the incoming edges of the intervened node. This has an effect on the Markov factorization, which can be expressed in multiple ways, each offering a different interpretation.
Before we start, let me share a quote with you from Jonas Peter’s lecture on causality at MIT in 2017. He calls this MUTE, the Most Useful Tautology:
If we intervene only on $X$, we intervene only on $X$.
This will help us as MUTE means that all other conditional distributions will not change, so it is not hopeless to calculate interventional distributions from observational data. Believe me, you will see it soon.
First, let’s define a causal effect with do-calculus.
Given disjoint sets of variables, $X$ and $Y$, the causal effect of $X$ on $Y$ is denoted by $P(y | do(X=x))$. It gives the probability of $Y = y, \forall x$ in the SEM with all incoming edges of the node $X$ and the equation $x = f(pa_x, u_x)$ deleted and setting $X = x$ in the remaining equations.
This definition contains nothing new, it uses the $do$-notation to express the probability of $Y$ when we intervene on $X$. There are multiple ways to calculate and to conceptualize this causal effect, as we will see in the next sections.
We can think of interventions $do(X_i = x_i’)$ in a DAG with variables $X_1, \dots, X_n$ as if we flipped a switch to make $X_i := x_i’$. That is, there are two mechanisms to determine the value of $X_i$: the conditional $P(x_i|pa_i)$ and the intervention $do(Xi = x_i’)$. Of course, the ther-are-two-mechanisms-view has the same effect, it only differs in interpretation. The main advantage being that we can explicitly incorporate the intervention in a single DAG; i.e., no need to mess around with deleting edges.
To do this, we augment node $X_i$ in the DAG with an additional parent $F_i$, yielding $Pa_i’ = Pa \bigcup {F_i}$, where $F_i \in {do(X_i = x_i’), idle}$ - meaning that $F_i$ is the “switch between the two mechanisms” determining $X_i$.
The intervention is encoded via the added edge $F_i \rightarrow X_i$, yielding the conditional
\[P\left(x_i | pa_i'\right) = \begin{cases}P(x_i | pa_i), \ \ \qquad\quad \mathrm{if} \ \ F_i =idle \\ 0, \qquad\qquad\qquad\quad \mathrm{if} \ \ F_i = do(X_i = x_i') \wedge x_i \neq x_i' \\ 1, \qquad\qquad\qquad\quad \mathrm{if} \ \ F_i = do(X_i = x_i') \wedge x_i = x_i' \end{cases}\]The reason why we need to differentiate between $x_i \neq x_i’$ and $x_i = x_i’$ in the case of the intervention is to remain consistent (as if we set $X_i$ to $x_i’$ then all $x_i\neq x_i’$ have 0 probability).
Having discussed the effect of an intervention, we can now express the joint distribution in the case of $do(X_i = x_i’)$. The straightforward way is to start from the Markov factorization $\prod_{j} P(x_j|pa_j)$ and leave out the factor $P(x_i | pa_i)$. We can do this as by intervening on $X_i$ we have $P(x_i | pa_i, do(X_i = x_i’))= P(x_i’ | pa_i, do(X_i = x_i’))=1$
\[P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}\prod_{j\neq i} P(x_j|pa_j), \quad \mathrm{if} \ \ x_i = x_i' \\ 0, \qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i \end{cases}\]This expression shows the ICM Pinciple at work: only the mechanisms intervened on changes, everything else remains the same.
This notation can also handle compound interventions, i.e., when we intervene on multiple variables at the same time. If we denote the set of variables we intervene on with $S$, then we can write
\[P\left(x_1, \dots, x_n | do(S=s)\right) = \begin{cases}\prod_{i : X_i \not\in S}P(x_i|pa_i), \qquad\quad \mathrm{if} \ \ X \mathrm{\ consistent\ with\ } S\\ 0, \qquad\quad \ \ \mathrm{otherwise} \end{cases}\]It’s also interesting to figure out the relationship between the interventional and the original (preinterventional) distribution. The expression follows from the truncated factorization by extending the expression with $\frac{P(x_i’|pa_i)}{P(x_i’|pa_i)}$ and then noticing that we have all factors of the joint in the nominator. The nominator will be the joint distribution before the intervention $P\left(x_1, \dots, x_n\right), \mathrm{s.t.} \ x_i = x_i’$ - thus the name preinterventional distribution. The denominator will be the actor $P(x_i’|pa_i)$. Note that we need to tie $x_i = x_i’$, as otherwise the expression would be inconsistent with $do(X_i = x_i’$).
\[P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}\dfrac{P\left(x_1, \dots, x_n\right)}{P(x_i'|pa_i)}, \ \qquad\quad \mathrm{if} \ \ x_i = x_i' \\ 0, \qquad\qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i \end{cases}\]Besides satisfying our intrinsic strive for mathematical diversity and beauty, this expression makes the difference clear between interventions and conditioning. Except when intervening on leaf nodes - i.e., when $Pa_i = \emptyset$ -, where $P(x_i’|pa_i=\emptyset) = P(x_i|pa_i=\emptyset, do(X_i = x_i’))=P(x_i’)$ -, when both are the same (cf. Causal Bayesian Networks in PoC #4).
When conditioning on $X_i = x_i’$, then what we do can be thought as a two-step process:
This means that conditioning distributes the probability mass over all remaining values (i.e., where in the joint we have $X_i = x_i’$) equally in the sense that the same normalizing factor is applied in each case.
The situation could not have been more different when we intervene on $X_i$.
In the interventional case, each excluded point (where $x_i \neq x_i’$) transfers its probability to a subset of points sharing the same value of $pa_i$. That is, depending on $pa_i$, different normalization constants are applied.
Assume that we have a graph $X\rightarrow Y\rightarrow Z$ with all variables being binary. We need to ensure that the conditional independence hold in the joint distribution (to ensure that the Markov factorization is the one implied by the graph, or more precisely, that the graph is a perfect I-map), we need the marginal for $X$:
$X$ | $P(X)$ |
---|---|
$0$ | $0.6$ |
$1$ | $0.4$ |
The second ingredient is the conditional for $Y$ ensuring $P(Y|X) = P(Y|X,Z)$:
$X$ | $Y$ | $P(Y|X)$ |
---|---|---|
$0$ | $0$ | $0.8$ |
$0$ | $1$ | $0.2$ |
$1$ | $0$ | $0.5$ |
$1$ | $1$ | $0.5$ |
And the third one is the conditional for $Z$ ensuring $P(Z|Y) = P(Z|X,Y)$:
$Y$ | $Z$ | $P(Z|Y)$ |
---|---|---|
$0$ | $0$ | $0.9$ |
$0$ | $1$ | $0.1$ |
$1$ | $0$ | $0.3$ |
$1$ | $1$ | $0.7$ |
That is, the probabilities populate the following table with eight entries:
$X$ | $Y$ | $Z$ | $P(X,Y,Z)$ | $P(X,Y,Z|Y=1)$ | $P(X,Y,Z|do(Y=1))$ |
---|---|---|---|---|---|
$0$ | $0$ | $0$ | $0.432$ | $0$ | $0$ |
$0$ | $0$ | $1$ | $0.048$ | $0$ | $0$ |
$0$ | $1$ | $0$ | $0.036$ | $0.036/0.32=0.1125$ | $0.036/0.2=0.6*0.3=0.18$ |
$0$ | $1$ | $1$ | $0.084$ | $0.084/0.32=0.2625$ | $0.084/0.2=0.6*0.7=0.42$ |
$1$ | $0$ | $0$ | $0.18$ | $0$ | $0$ |
$1$ | $0$ | $1$ | $0.02$ | $0$ | $0$ |
$1$ | $1$ | $0$ | $0.06$ | $0.06/0.32=0.1875$ | $0.06/0.5=0.4*0.3=0.12$ |
$1$ | $1$ | $1$ | $0.14$ | $0.14/0.32=0.4375$ | $0.14/0.5=0.4*0.7=0.28$ |
We can calculate the marginal for $P(Y=1) = 0.036+0.084+0.06+0.14 =0.32$ that we will need for calculating interventions from the preinterventional distribution. I included in the fourth column (you thought a CS guy will start indexing from 1?) the probabilities when conditioning on $Y=1$, whereas the fifth column includes the interventional probabilities (calculated both from the preinterventional distribution and with the truncated factorization).
Notice that in the interventional case, we divide the probabilities with different values (depending on the parent of $Y$, i.e., $X$). A sanity check is that in both the conditioning and interventional cases the probabilities add up to 1.
Although generally intervening on $X_i$ is different from conditioning on $X_i$, we can use conditioning to express the intervention as well.
We start from the joint distribution, then by using the chain rule of Bayesian networks, we “extract” $P(x_i’|pa_i)$ and $P(pa_i$). As the intervention makes $P(x_i’|pa_i) =1$, we can simplify the expression: \(\begin{align*} P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) &= P\left(x_1, \dots, x_n|x_i',pa_i\right)P(x_i'|pa_i)P(pa_i) \\ &= P\left(x_1, \dots, x_n|x_i',pa_i\right)P(pa_i) \end{align*}\)
Our manipulation requires that $x_i = x_i’$, so the resulting expression includes two cases: \(P\left(x_1, \dots, x_n | do(X_i=x'_i)\right) = \begin{cases}P\left(x_1, \dots, x_n|x_i',pa_i\right)P(pa_i), \ \ \quad \mathrm{if} \ \ x_i = x_i' \\ 0, \qquad\qquad\qquad\qquad\qquad\qquad\quad \mathrm{if} \ \ x_i \neq x'_i \end{cases}\)
If you now focus on the formulas we came up with in this section, you will realize that they express interventions without using any interventional distribution. This means that in specific cases, we are able to calculate the effect of an interventions from observational distributions.
We will use the last formulation - interventions as conditioning - to calculate the effect of an intervention from observational data. This is called adjustment for direct causes.
\[P\left(y | do(X_i=x'_i)\right) = \sum_{pa_i}P(y|x_i',pa_i)P(pa_i)\]Let $PA_i$ the set of direct causes (parents) of $X_i$ and $Y$ be any set, disjoint of ${X_i \bigcup PA_i}$. The effect of the intervention $do(X_i = x_i’) $ on $Y$ is
Additionally, we need to marginalize over $Pa_i$ as we are interested in $P\left(y | do(X_i=x’_i)\right)$. We can use this formula as $Pa_i$ screen off any effect on $X$ coming from other nondescendants of $X$ - i.e., if we know the value $Pa_i = pa_i$, then all other variables do not matter, we have the information to determine the value/distribution for $X$.
We have seen that we can do some black magic with the observational distributions to get the effect of an intervention. However, this is not always possible. In this section, we will get acquainted with the formal notion of identifiability, then discuss conditions for causal effect identifiability.
Identifiability in a general sense states that some quantity (intervention, likelihood, mean, etc.) can be computed uniquely.
Let $Q(M)$ be any computable quantity of a model $M$. $Q$ is identifiable in a model class $\mathcal{M}$ if, for any model pairs $M_1,M_2 \in \mathcal{M}$ it holds that $Q(M_1) = Q(M_2)$ whenever $P_{M_1}(v) = P_{M_2}(v)$.
The definition implies that we have an “identifiability mapping” from probability distributions of models to the space of a quantity $h(M,v) : P_{M}(v) \rightarrow Q(M)$ where the same $P_M(v)$ values map to the same $Q(M)$.
The definition can be extended to the case when there are hidden variables, then we use thr observed subset of $P_M(v)$.
For causal effects, identifiability is defined as follows:
The causal effect of $X$ on $Y$ is identifiable from a graph $G$ if $P(y | do(X=x))$ can be computed uniquely from any positive probability distribution of the observed variables, i.e $P_{M_1}(y | do(X=x))=P_{M_2}(y | do(X=x))$ for every pair of modeIs $M_1$ and $M_2$ with $P_{M_1}(v) = P_{M_2} (v) > 0$ and $G (M_1) = G (M_2) = G$.
Again, uniqueness is the key in the definition - the positivity assumption is required to exclude edge cases (e.g., when dividing by 0). The identifiability of $P(y | do(X=x))$ ensures inferring the effect $do(X = x)$ on $Y$ from:
This definition mirrors the adjustment for direct causes, where we used knowledge both from observations and the graph.
Looking into a more specialized model class, namely, Markovian Models (where we have a DAG and the noises are jointly independent), we can state the following result:
In a Markovian causal model $M = <G, \theta_G>$ with a subset $V$ of all variables being observed, the causal effect $P(y|do(X=x))$ is identifiable whenever ${X\bigcup Y\bigcup Pa_X}\subseteq V$.
We need observability of $X, Y, Pa_X$ to use the adjustment for direct causes. This is required to calculate the quantities in the adjustment formula above.
When all variables are measured (i.e., when we are extremeley lucky), the causal effect can be calculated via the truncated factorization.
This post opened up the door into the most intricate details of calculating interventions, discussing various ways and interpretations. At the end, we also touched on the topic on identifiability.
However, we have not yet covered the case of confounding - an undoubtedly more realistic scenario. You can probably figure out what comes next then.
]]>