[This is an article, which is something I’ve optimized for readability and transmission of ideas, opposed to notes that are the seed of an idea and something I’ve written quickly.]
The definition of causality within physics is not a settled matter, perhaps surprisingly. My understanding is that this question is studied more by philosophers than physicists, as the field of physics tends to avoid interpretational problems. That is to say, theories like relativity or quantum mechanics are mathematically well defined and make predictions, so that’s all there is to it, right? I’m not a physicist, so I will proceed to ask such questions.
I suspect that causality and information are intimately related. To initiate my pursuit to understand physical information, I am starting by trying to understand the role causality plays in physics. The SEP outlines some of the conversation and ideas around causality and physics. I haven’t read these ideas yet, but I want to take my own tabula rasa stab at the problem before reading about what other people have tried. I am familiar with Judea Pearl’s notion of causality in machine learning and statistics, which I will attempt to apply to physics below.
Causal Models
First, I’ll outline Pearl’s framework for causality. I used Causality (Pearl) and Elements of Causal Inference (Peters, Janzing, Schölkopf) to learn about this topic.
Pearl assumes the world (or some part of it) can be represented by graph, where nodes represent potential observations, and their directed edges represent causal links. For example (from Pearl):
The core idea in Pearl’s causality is the intervention, which is a modification to the graph where a node is disconnected from all incoming arrows and held fixed at some value.
An example of an intervention:
An intervention in this graph is a graph surgery (as Pearl calls it). Graph interventions correspond to real-world interventions. The intervention depicted above corresponds to someone forcing the sprinkler system to turn on (e.g. by switching the sprinkler system’s setting from auto to manual). The sprinkler state is now causally independent of everything else in the graph, because we, the experimenters, have directly determined its state (we would need to be careful to ensure our own actions are not causally linked to the system we are studying). By observing the down stream effects of this change to the graph, the causal effect of the particular node can be measured. That is the effect of , independent of other nodes like .
Generally Pearl places a probability distribution on graph node states, given by , or using shorthand, . I’ll use capital letters, , to denote graph nodes themselves (or random variables on graph nodes), and lowercase letters, , to denote a specific value that the correspond node takes on. So for example, node , the sprinkler state, could take on the values or . In the abstract, takes on some value . Sometimes I’ll introduce a “prime” tick, , to denote some other value that may be distinct from .
There is an alternative functional perspective, where each node’s value is a deterministic function of incoming values traveling along inward arrows, and an auxiliary noise input not depicted in the graph. Those noise inputs can themselves be determined (i.e. held fixed), but be pulled from an algorithmically random stream. I will stick to the deterministic perspective when I discuss physics, while recognizing that random physical processes can be viewed as deterministic but algorithmically random.
In its general form, a functional causal model consists of a set of equations of the form
where (connoting parents) stands for the set of variables that directly determine the value of and where the represent errors (or “disturbances”) due to omitted factors.
That is to say, the parents of node is the set of nodes with arrows pointing into . So in the example, because is the only node pointing into , and because both and point into . Node is not a direct parent of .
is an auxiliary input node to each which is not depicted in the graph, which makes the output value random. In my view, each value is pulled from an algorithmically random stream. Given the set of values and value , the output of the function is then able to be random.
A note about notation: It would not be correct to write which passes the nodes themselves into the function . On the other hand, is passing the values of the parent nodes and of the noise input node into the function.
Do-Operator
If is the probability measure on the initial graph (e.g. figure 1.2 above), then what is the probability measure on the modified graph after taking an intervention (e.g. figure 1.4)? Pearl uses “do”-notation, which for the example above looks like this:
This is the probability of the vector of node values given that the intervention setting node to constant value was taken. Note the notational similarity to conditional probability: . Conditionalization is a different operation on the measure than the “do”-operator, but they are mathematically related and their similar notation is justified.
For an arbitrary graph with nodes , and probability measure on node values, the conditional probability of value vector given is
whereas the probability of given that intervention was taken is (Causality, eq 3.11)
Both operations are performing a domain restriction on , in the sense that the resulting measure assigns 0 probability to all vectors where , for some constant . The difference between them is that conditionalization, , simply rescales the resulting measure by after domain restriction, whereas intervention, , re-weights every single probability independently by , where is the set of values in for the parent nodes of node .
Rewriting , we can see why multiplying by corresponds to an intervention:
by the chain rule of probability, because the graph also encodes which nodes are statistically independent, i.e. if , then .
The operation of removing the connections going into from the parents is a matter of removing the term by dividing it out.
This formulation of an intervention can be generalized further. Instead of setting to a constant value , in general, we can replace the node distribution with the new distribution where is some new set of parents, which may or may not be the empty set, or equivalent to or overlap with the old parents . If is empty, that is equivalent to making statistically independent where . We can get our constant-value intervention by choosing a delta distribution (one-hot for discrete , and Dirac delta for continuous ) which is non-zero only if . Now this general-case intervention is replacing the term with , which looks like this:
When this expression reduces to the constant-value intervention defined above.
In the functional perspective, an intervention replaces with some other function .
Causal Effect
In Causality, definition 3.2.1, Pearl defines causal effect as follows:
Let and be two disjoint sets of graph nodes. The causal effect of on is the function from the space of node values for to the space of probability measures on ,
where is some chosen vector of values for the nodes .
That is to say, the causal effect of nodes on nodes is characterized by the set of all interventions obtained setting to every possible value , where each intervention is characterized by a change in probability distribution on . That is to say, the causal effect of on is characterized by how varies for different , and compared to no intervention .
When Interventions And Conditionalization Are Equivalent
It should be obvious that when node has no parents then for all node values , because and so .
Another case is when we are only considering the marginal distribution on a subset of variables. Then the conditional distribution and intervention distribution on the Markov blanket of that subset are equivalent.
To see what I mean, let’s consider the Markov chain where and for all . Then we have
Causality For Physics
Pearl’s causality is based on the idea of the intervention, which is a kind of graph surgery.
To apply Pearl’s causality to physics, we’d need to define what an intervention does to physical processes. There are two immediate problems:
Pearl defines interventions for causal graphs, where node values are sampled i.i.d., and the nodes represent are stateless and otherwise isolated processes (aside from their arrows). Physics, on the other hand, allows for arbitrary interactions between systems, to the point where the boundaries between systems may be blurred or destroyed so that it does not even make sense to think about there being any independent components at all (think about a liquid or gas). Physical processes are not i.i.d. (the future depends on the past), and they have internal state which determines their future time evolution.
Classical physics is non-probabilistic (non-statistical Newtonian mechanics and relativity). If our notation of causality is to be suitable to all of physics, we need to apply to Newtonian mechanics, which means causality must precede probability. Therefore we need to define interventions on deterministic systems.
Pearl generally considers a graph intervention to represent an intervention that can conceivably be taken, and ideally taken recently so that the causal effect of various interventions can be empirically estimated with histograms (empirically estimate and ).
I don’t think physical plausible interventions can generalize to arbitrary physical systems. I will instead consider what I call a counterfactual intervention, which is merely a modification to a mathematical model (i.e. representation) of physics. A counterfactual intervention is hypothetical, and produces a different time-line than the “factual” time-evolution of the system. A counterfactual intervention is the answer to the question, “what would have happened if the system were in state rather than state at time ?”
If intuition serves right and the logical structure of causality lies within all theories of physics, the purpose of the counterfactual intervention is to probe those theories to make their implicit causal structures mathematically explicit.
My objective here is to define an abstract definition for theories of physics in general, define what it means to take a counterfactual intervention on a physical system (both probabilistic or non-probabilistic), and then to show the equivalence of this type of intervention to Pearl’s graph intervention above.
Abstract Physics
In any theory of physics there is a state space . In Newtonian mechanics, state is a vector of various components of the system, such as a vector of positions and momenta given by . In general state can include other kinds of degrees of freedom such as the orientation of solid bodies in 3D space. In quantum mechanics there is quantum state, and state spaces are Hilbert spaces.
A theory of physics specifies both the state space and how to solve for the time-evolution of the system given a particular state at time . The result is a complete description of a system’s time evolution through state space given as a state-function of time, , which I’ll call a trajectory. To be clear, a single trajectory is a single possible time-evolution, e.g. where .
The mathematical machinery that converts known information, e.g. the state of the system at time , varies between theories of physics and often makes use of a Lagrangian or Hamiltonian. These details can be abstracted away. In principle, for any theory of physics there is a family of time-evolution functions (also called propagators), for every time interval (both positive and negative) which maps any state at time to the state at time . Typically physics is time-symmetric, which means that is a bijection and thus invertible. Note also that does not depend on the absolute time , and so we are implicitly assuming the given theory of physics is time-translationally invariant.
The set of all trajectories is , denoting the set of all functions from to . For a given time-evolution family , there is a subset of trajectories which are valid for (or -valid),
Incorporating Probability
Suppose we want to work with some kind of statistical physics. Perhaps we are uncertain about which state the system is in, or the state is randomly chosen. We can just as easily put a probability measure on the set of trajectories.
Let be a probability measure on the set of all trajectories . Moreover, we want to require to obey the physics of and assign zero probability to physically impossible trajectories, i.e. -invalid trajectories. Specifically, should assign 0 probability to any set comprised only of -invalid trajectories, or equivalently, (if is a normalized measure).
This is not typically how statistical physics is conceived of. Normally, there is a probability measure on states at time , and time evolution time-evolves that measure. Instead, I’ve put a static global measure on entire trajectories. However, these two views are equivalent.
Let be the marginal probability measure on state space of the “system” at time . Specifically, is the unique marginal distribution of on time only, given by
for (measurable) state subsets . Then is then the time-evolution of measure , given by
Proof that : . by the definition of . by the definition of .
Proof that is uniquely determined by , so long as is a bijection and .
At time , for each , there is a unique -valid trajectory that passes through , given by the mapping . Therefore, there is a family of bijections between the -valid trajectories and state space :
for all . So where and is -valid. The inverse is then . I haven’t defined proper measure spaces on trajectories and states, so I will just assume is a measurable function.
We can derive the following relation: .
Thus for any (measurable) subset of trajectories , there is a corresponding (measurable) subset of states so that .
Interventions For Physics
Let be a state-replacement function. Usually we want to be some kind of state projection function where is a strict subset.
I will define a “do”-operator on individual trajectories which performs a surgery and outputs a modified trajectory. Specifically, given state-replacement function and time ,
The resulting trajectory is identical to prior to time , and discontinuously jumps at time to the alternative -valid trajectory starting from state . In this way, determines which trajectory “tail” to attach to the given trajectory “head” . The resulting “Frankenstein”-trajectory is usually not globally -valid, but its head and tail are guaranteed to be -valid, and thus is locally -valid everywhere except across time .
Let’s overload this “do”-operator to apply element-wise to sets, so
Starting with the measure from above, applying to the set of all trajectories induces a transformed measure , where for subsets ,
This conception of intervention is different from Pearl’s, which is a modification of the functions generating the behavior of the process in question. In my formulation, I am modifying the state of the process at some point in time, but keeping the behavior-generating-functions, i.e. , unchanged. That is to say, I am modifying systems, but not the physics, whereas Pearl is modifying the physics, so to speak.
Proof of Pearl-Equivalence
The question is whether my definition of an intervention is equivalent with Pearl’s. To prove this, I need to put my intervention in terms of Pearl’s setup.
The measure on trajectories corresponds to the measure on graph states, and the transformed measure corresponds to the transformed measure on the modified graph.
Recall that is defined by
for subsets of trajectories. I need to show that has the same form as Pearl’s general intervention.
It will help to define the following notation on trajectories: is the domain restriction of trajectory to time interval .
I will use the following short-hands:
where be an arbitrarily small positive real number.
The goal is to prove that
for some probability measure . This is a Lebesgue integral w.r.t. using measure , which for our purposes is just the expectation of the integrand w.r.t. measure . As a Riemann integral: .
It turns out that
Proof
We have, ,
where because is Markov w.r.t. time, which is guaranteed by the construction of from .
Expanding out the integrand, we have
The remainder of this proof consists of showing the following equivalences, which I’ll prove below as lemmas:
That allows us to rewrite:
Therefore
Proof of remaining lemmas:
The cases and are easy:
Because the trajectory before is unchanged, ,
Because time evolution from onward is deterministic and obeys , there is exactly one trajectory that is valid under s.t. . Thus .
Now to prove . Expanding out , we get
Since
then ,
because is the identity function.
Furthermore, ,
so .
Thus we can further expand out :
There is one problematic aspect to this equivalence. Taking another look at the first step in the proof,
we make the assumption that . Given the trajectory slice , there is only one -valid trajectory which shares the same slice, and so there is only one valid state at time to follow from . Since obeys ,
which is non-zero only if . If is itself not -valid, we can define to be an improper probability measure that is always .
I’d argue that interventions on deterministic trajectories is a limiting case of interventions on probabilistic trajectories where the transition probabilities converge to delta distributions. Then no matter what and the cancellation works.
Compatibility with modern physics
The generalized formulation of physics, using state space and time-evolution function , is compatible with classical physics and special relativity (for arbitrary choice of Lorentz frame). Is it compatible with quantum mechanics, general relativity, and beyond?
It is compatible with QM if we are time-evolving quantum state and disregarding measurement. If we wanted to model stochastic measurement outcomes, or stochastic interactions in general, then we could do that using a non-deterministic time-evolution function, i.e. is not a proper function and assigns more than one output to a given input. Alternatively, the state could contain algorithmically random data which serves as a source of random inputs for .
For special relativity, simultaneity is relative, but consistently holding to an arbitrary choice of Lorentz frame will work. Then, there is a for every Lorentz frame, and one can transform between these time-evolution functions via Lorentz boosts.
For general relativity, I am not personally clear on whether there exist global reference frames where there is a single simultaneous state of the universe, even if what is regarded as simultaneous is arbitrarily chosen. In that case, my formulation may break down. However, there should still be a causal DAG. Is it possible to topologically sort that DAG and then organize it into something like time slices? Each such slice would then correspond to a state in .