Causality For Physics

April 20, 2021

[This is an article, which is something I’ve optimized for readability and transmission of ideas, opposed to notes that are the seed of an idea and something I’ve written quickly.]

The definition of causality within physics is not a settled matter, perhaps surprisingly. My understanding is that this question is studied more by philosophers than physicists, as the field of physics tends to avoid interpretational problems. That is to say, theories like relativity or quantum mechanics are mathematically well defined and make predictions, so that’s all there is to it, right? I’m not a physicist, so I will proceed to ask such questions.

I suspect that causality and information are intimately related. To initiate my pursuit to understand physical information, I am starting by trying to understand the role causality plays in physics. The SEP outlines some of the conversation and ideas around causality and physics. I haven’t read these ideas yet, but I want to take my own tabula rasa stab at the problem before reading about what other people have tried. I am familiar with Judea Pearl’s notion of causality in machine learning and statistics, which I will attempt to apply to physics below.

Causal Models

First, I’ll outline Pearl’s framework for causality. I used Causality (Pearl) and Elements of Causal Inference (Peters, Janzing, Schölkopf) to learn about this topic.

Pearl assumes the world (or some part of it) can be represented by graph, where nodes represent potential observations, and their directed edges represent causal links. For example (from Pearl):


The core idea in Pearl’s causality is the intervention, which is a modification to the graph where a node is disconnected from all incoming arrows and held fixed at some value.

An example of an intervention:


An intervention in this graph is a graph surgery (as Pearl calls it). Graph interventions correspond to real-world interventions. The intervention depicted above corresponds to someone forcing the sprinkler system to turn on (e.g. by switching the sprinkler system’s setting from auto to manual). The sprinkler state is now causally independent of everything else in the graph, because we, the experimenters, have directly determined its state (we would need to be careful to ensure our own actions are not causally linked to the system we are studying). By observing the down stream effects of this change to the graph, the causal effect of the particular node X3 can be measured. That is the effect of X3, independent of other nodes like X1.

Generally Pearl places a probability distribution on graph node states, given by P(X1=x1,X2=x2,X3=x3,), or using shorthand, P(x1,x2,x3,). I’ll use capital letters, Xi, to denote graph nodes themselves (or random variables on graph nodes), and lowercase letters, xi, to denote a specific value that the correspond node takes on. So for example, node X3, the sprinkler state, could take on the values ON or OFF. In the abstract, X3 takes on some value x3. Sometimes I’ll introduce a “prime” tick, x3, to denote some other value that may be distinct from x3.

There is an alternative functional perspective, where each node’s value is a deterministic function of incoming values traveling along inward arrows, and an auxiliary noise input not depicted in the graph. Those noise inputs can themselves be determined (i.e. held fixed), but be pulled from an algorithmically random stream. I will stick to the deterministic perspective when I discuss physics, while recognizing that random physical processes can be viewed as deterministic but algorithmically random.

Quoting Causality, section 1.4.1, Structural Equations:

In its general form, a functional causal model consists of a set of equations of the form
xi=fi(pai,ui),i=1,,n,
where pai (connoting parents) stands for the set of variables that directly determine the value of Xi and where the Ui represent errors (or “disturbances”) due to omitted factors.

That is to say, the parents PAi of node Xi is the set of nodes with arrows pointing into Xi. So in the example, PA3={X1} because X1 is the only node pointing into X3, and PA4={X2,X3} because both X2 and X3 point into X4. Node X1 is not a direct parent of X4.

Ui is an auxiliary input node to each Xi which is not depicted in the graph, which makes the output value xi random. In my view, each value ui is pulled from an algorithmically random stream. Given the set of values pai and value ui, the output of the function fi is then able to be random.

A note about notation: It would not be correct to write fi(PAi,Ui) which passes the nodes themselves into the function fi. On the other hand, fi(pai,ui) is passing the values pai of the parent nodes PAi and ui of the noise input node Ui into the function.

Do-Operator

If P is the probability measure on the initial graph (e.g. figure 1.2 above), then what is the probability measure on the modified graph after taking an intervention (e.g. figure 1.4)? Pearl uses “do”-notation, which for the example above looks like this:

P(x1,x2,x3,x4,x5do(X3=ON)).

This is the probability of the vector of node values (x1,x2,x3,x4,x5) given that the intervention setting node X3 to constant value ON was taken. Note the notational similarity to conditional probability: P(x1,x2,x3,x4,x5X3=ON). Conditionalization is a different operation on the measure P than the “do”-operator, but they are mathematically related and their similar notation is justified.

For an arbitrary graph with nodes X1,,Xn, and probability measure P on node values, the conditional probability of value vector (x1,,xn) given Xi=xi is

P(x1,,xnxi)={P(x1,,xn)P(xi)xi=xi0xixi

whereas the probability of (x1,,xn) given that intervention do(Xi=xi) was taken is (Causality, eq 3.11)

P(x1,,xndo(xi))={P(x1,,xn)P(xipai)xi=xi0xixi

Both operations are performing a domain restriction on P, in the sense that the resulting measure assigns 0 probability to all vectors (x1,,xn) where xixi, for some constant xi. The difference between them is that conditionalization, P(x1,,xnxi), simply rescales the resulting measure by 1/P(xi) after domain restriction, whereas intervention, P(x1,,xndo(xi)), re-weights every single probability independently by 1/P(xipai), where pai is the set of values in (x1,,xn) for the parent nodes PAi of node Xi.

Rewriting P(x1,,xn), we can see why multiplying by 1/P(xipai) corresponds to an intervention:

P(x1,,xn)=j=1nP(xjpaj),

by the chain rule of probability, because the graph also encodes which nodes are statistically independent, i.e. if xkpaj, then P(xjxk)=P(xj).

The operation of removing the connections going into Xi from the parents PAi is a matter of removing the term P(xipai) by dividing it out.

This formulation of an intervention can be generalized further. Instead of setting Xi to a constant value xi, in general, we can replace the node distribution P(XiPAi) with the new distribution Q(XiPAi) where PAi is some new set of parents, which may or may not be the empty set, or equivalent to or overlap with the old parents PAi. If PAi is empty, that is equivalent to making Q statistically independent where Q(XiPAi)=Q(Xi). We can get our constant-value intervention by choosing a delta distribution (one-hot for discrete Xi, and Dirac delta for continuous Xi) Q(Xi)=δxi which is non-zero only if Xi=xi. Now this general-case intervention is replacing the term P(XiPAi) with Q(XiPAi), which looks like this:

P(x1,,xn|do{P(xipai)Q(xipai)} )=P(x1,,xn)Q(xipai)P(xipai).

When Q(xipai)=Q(xi)=δxi this expression reduces to the constant-value intervention defined above.

In the functional perspective, an intervention replaces fi(pai,ui) with some other function fi(pai,ui).

Causal Effect

In Causality, definition 3.2.1, Pearl defines causal effect as follows:

Let X and Y be two disjoint sets of graph nodes. The causal effect of X on Y is the function E from the space of node values for X to the space of probability measures on Y,

E(x)=P(Ydo(X=x)),

where x is some chosen vector of values for the nodes X.

That is to say, the causal effect of nodes X on nodes Y is characterized by the set of all interventions obtained setting X to every possible value x, where each intervention is characterized by a change in probability distribution on Y. That is to say, the causal effect of X on Y is characterized by how P(Ydo(X=x)) varies for different x, and compared to no intervention P(Y).

When Interventions And Conditionalization Are Equivalent

It should be obvious that when node Xi has no parents then P(x1:ndo(xi))=P(x1:nxi) for all node values xi, because PAi= and so P(xipai)=P(xi).

Another case is when we are only considering the marginal distribution on a subset of variables. Then the conditional distribution and intervention distribution on the Markov blanket of that subset are equivalent.

To see what I mean, let’s consider the Markov chain X1,,Xn where P(x1,,xn)=P(xnxn1)P(xn1xn2)P(x2x1)P(x1) and PAi={Xi1} for all i>1. Then we have

P(xn,,xi+1do(xi))=xi1,,x1P(xn,,x1do(xi))={xi1,,x1P(xn,,xi+1xi)P(xixi1)P(xi1,,x1)P(xixi1)xi=xi0xixi=xi1,,x1P(xn,,xi+1xi)P(xi1,,x1)=P(xn,,xi+1xi).

Causality For Physics

Pearl’s causality is based on the idea of the intervention, which is a kind of graph surgery.

To apply Pearl’s causality to physics, we’d need to define what an intervention does to physical processes. There are two immediate problems:

  1. Pearl defines interventions for causal graphs, where node values are sampled i.i.d., and the nodes represent are stateless and otherwise isolated processes (aside from their arrows). Physics, on the other hand, allows for arbitrary interactions between systems, to the point where the boundaries between systems may be blurred or destroyed so that it does not even make sense to think about there being any independent components at all (think about a liquid or gas). Physical processes are not i.i.d. (the future depends on the past), and they have internal state which determines their future time evolution.
  2. Classical physics is non-probabilistic (non-statistical Newtonian mechanics and relativity). If our notation of causality is to be suitable to all of physics, we need to apply to Newtonian mechanics, which means causality must precede probability. Therefore we need to define interventions on deterministic systems.

Pearl generally considers a graph intervention to represent an intervention that can conceivably be taken, and ideally taken recently so that the causal effect of various interventions can be empirically estimated with histograms (empirically estimate P(Ydo(X)) and P(Y)).

I don’t think physical plausible interventions can generalize to arbitrary physical systems. I will instead consider what I call a counterfactual intervention, which is merely a modification to a mathematical model (i.e. representation) of physics. A counterfactual intervention is hypothetical, and produces a different time-line than the “factual” time-evolution of the system. A counterfactual intervention is the answer to the question, “what would have happened if the system were in state x rather than state x at time t?”

If intuition serves right and the logical structure of causality lies within all theories of physics, the purpose of the counterfactual intervention is to probe those theories to make their implicit causal structures mathematically explicit.

My objective here is to define an abstract definition for theories of physics in general, define what it means to take a counterfactual intervention on a physical system (both probabilistic or non-probabilistic), and then to show the equivalence of this type of intervention to Pearl’s graph intervention above.

Abstract Physics

In any theory of physics there is a state space Ω. In Newtonian mechanics, state is a vector of various components of the system, such as a vector of positions and momenta given by ω=(q,p)Ω. In general state can include other kinds of degrees of freedom such as the orientation of solid bodies in 3D space. In quantum mechanics there is quantum state, and state spaces are Hilbert spaces.

A theory of physics specifies both the state space Ω and how to solve for the time-evolution of the system given a particular state ωt at time t. The result is a complete description of a system’s time evolution through state space given as a state-function of time, σ:RΩ:tσ(t), which I’ll call a trajectory. To be clear, a single trajectory σ is a single possible time-evolution, e.g. where σ(t)=ωt.

The mathematical machinery that converts known information, e.g. the state of the system at time t, varies between theories of physics and often makes use of a Lagrangian or Hamiltonian. These details can be abstracted away. In principle, for any theory of physics there is a family of time-evolution functions τΔt:ΩΩ (also called propagators), for every time interval ΔtR (both positive and negative) which maps any state ωΩ at time t to the state at time t+Δt. Typically physics is time-symmetric, which means that τΔt is a bijection and thus invertible. Note also that τΔt does not depend on the absolute time t, and so we are implicitly assuming the given theory of physics is time-translationally invariant.

The set of all trajectories is RΩ, denoting the set of all functions from R to Ω. For a given time-evolution family τ, there is a subset of trajectories which are valid for τ (or τ-valid),

Σ={σ:RΩt,ΔtR:σ(t+Δt)=τΔt(σ(t))}.

Incorporating Probability

Suppose we want to work with some kind of statistical physics. Perhaps we are uncertain about which state the system is in, or the state is randomly chosen. We can just as easily put a probability measure on the set of trajectories.

Let M be a probability measure on the set of all trajectories RΩ. Moreover, we want to require M to obey the physics of τ and assign zero probability to physically impossible trajectories, i.e. τ-invalid trajectories. Specifically, M should assign 0 probability to any set comprised only of τ-invalid trajectories, or equivalently, M(Σ)=1 (if M is a normalized measure).

This is not typically how statistical physics is conceived of. Normally, there is a probability measure on states at time t, and time evolution time-evolves that measure. Instead, I’ve put a static global measure M on entire trajectories. However, these two views are equivalent.

Let μt be the marginal probability measure on state space Ω of the “system” at time t. Specifically, μt is the unique marginal distribution of M on time t only, given by

μt(O)=M{σ:RΩσ(t)O},

for (measurable) state subsets OΩ. Then μt+Δt is then the time-evolution of measure μt, given by

μt+Δt(O)=μt(τΔt1(O)).


Proof that μt(τΔt1(O))=M{σ:RΩσ(t+Δt)O}:
M{σ:RΩσ(t+Δt)O}
=M{σΣσ(t+Δt)O}+M{σΣσ(t+Δt)O}
=M{σΣσ(t+Δt)O}+0.
{σΣσ(t+Δt)O}={σΣσ(t)τΔt1(O)} by the definition of Σ.
M{σΣσ(t)τΔt1(O)}
=M{σ:RΩσ(t)τΔt1(O)}
=μt(τΔt1(O)) by the definition of μt.   


Proof that M is uniquely determined by μt, so long as τΔt is a bijection and M(Σ)=1.

At time t, for each ωΩ, there is a unique τ-valid trajectory σ that passes through ω, given by the mapping tτtt(ω). Therefore, there is a family of bijections between the τ-valid trajectories Σ and state space Ω:

Γt:ΩΣ:ω(tτtt(ω)),

for all tR. So Γt(ω)=σ where σ(t)=ω and σ is τ-valid. The inverse is then Γt1(σ)=σ(t). I haven’t defined proper measure spaces on trajectories and states, so I will just assume Γt is a measurable function.

We can derive the following relation:
μt(O)=M{σ:RΩσ(t)O}
=M{σΣσ(t)O}
=MΓtO.

Thus for any (measurable) subset of trajectories S(RΩ), there is a corresponding (measurable) subset of states O=Γt1(SΣ) so that M(S)=M(SΣ)=M(ΓtΓt1(SΣ))=M(ΓtO)=μt(O).   

Interventions For Physics

Let I:ΩΩ be a state-replacement function. Usually we want I to be some kind of state projection function where I(Ω)Ω is a strict subset.

I will define a “do”-operator on individual trajectories which performs a surgery and outputs a modified trajectory. Specifically, given state-replacement function I and time T,

doIT[σ](t)=def{σ(t)t<TτtT(I(σ(T)))tT.

The resulting trajectory is identical to σ prior to time T, and discontinuously jumps at time T to the alternative τ-valid trajectory starting from state I(σ(T)). In this way, I determines which trajectory “tail” to attach to the given trajectory “head” σ. The resulting “Frankenstein”-trajectory is usually not globally τ-valid, but its head and tail are guaranteed to be τ-valid, and thus is locally τ-valid everywhere except across time T.

Let’s overload this “do”-operator to apply element-wise to sets, so

Σ=doITΣ={doIT[σ]|σΣ}.

Starting with the measure M from above, applying doIT to the set of all trajectories RΩ induces a transformed measure M, where for subsets S(RΩ),

M(S)=M(doIT1S)=M{σΣ|doIT[σ]S}.

This conception of intervention is different from Pearl’s, which is a modification of the functions generating the behavior of the process in question. In my formulation, I am modifying the state of the process at some point in time, but keeping the behavior-generating-functions, i.e. τ, unchanged. That is to say, I am modifying systems, but not the physics, whereas Pearl is modifying the physics, so to speak.

Proof of Pearl-Equivalence

The question is whether my definition of an intervention is equivalent with Pearl’s. To prove this, I need to put my intervention in terms of Pearl’s setup.

The measure M on trajectories corresponds to the measure P on graph states, and the transformed measure M corresponds to the transformed measure P(do()) on the modified graph.

Recall that M is defined by

M(S)=M{σΣ|doIT[σ]S},

for subsets S(RΩ) of trajectories. I need to show that M has the same form as Pearl’s general intervention.

It will help to define the following notation on trajectories:
σ(a,b) is the domain restriction of trajectory σ to time interval (a,b).

I will use the following short-hands:

  • σ>T=σ(T,)
  • σ<T=σ(,T)
  • σδT=σ(TΔt,T)

where Δt>0 be an arbitrarily small positive real number.

The goal is to prove that

M(S)=SQ(σ(T)σδT)M(σ(T)σδT)dM(σ),

for some probability measure Q. This is a Lebesgue integral w.r.t. σ using measure M, which for our purposes is just the expectation of the integrand w.r.t. measure M. As a Riemann integral: SM(σ)Q(σ(T)σδT)M(σ(T)σδT)dσ.

It turns out that

Q(σ(T)σδT)=M(I1(σ(T))σδT).


Proof

We have,
M(σ)=M(σ>Tσ(T))M(σ(T)σδT)M(σ<T),
where M(σ(T)σδT)M(σ<T)=M(σ(T)σ<T) because M is Markov w.r.t. time, which is guaranteed by the construction of Σ from τΔt.

Expanding out the integrand, we have

M(σ)Q(σ(T)σδT)M(σ(T)σδT)= M(σ>Tσ(T))M(σ(T)σδT)M(σ<T)Q(σ(T)σδT)M(σ(T)σδT)= M(σ>Tσ(T))Q(σ(T)σδT)M(σ<T)= M(σ>Tσ(T))M(I1(σ(T))σδT)M(σ<T).

The remainder of this proof consists of showing the following equivalences, which I’ll prove below as lemmas:

  • M(σ>Tσ(T))=M(σ>Tσ(T))
  • M(I1(σ(T))σδT)=M(σ(T)σδT)
  • M(σ<T)=M(σ<T)

That allows us to rewrite:

M(σ>Tσ(T))M(I1(σ(T))σδT)M(σ<T)= M(σ>Tσ(T))M(σ(T)σδT)M(σ<T)= M(σ).

Therefore

SQ(σ(T)σδT)M(σ(T)σδT)dM(σ)= SdM(σ)= M(S).

  


Proof of remaining lemmas:

The cases M(σ>Tσ(T))=M(σ>Tσ(T)) and M(σ<T)=M(σ<T) are easy:

  • Because the trajectory before T is unchanged,
    M(σ<T)=M(σ<T),
  • Because time evolution from T onward is deterministic and obeys τΔt, there is exactly one trajectory σ that is valid under τΔt s.t. σ(T)=σ(T). Thus
    M(σ>Tσ(T))=M(σ>Tσ(T))={1σT=σT0otherwise.

Now to prove M(I1(σ(T))σδT)=M(σ(T)σδT). Expanding out M(I1(σ(T))σδT), we get

M(I1(σ(T))σδT)= M{ζΣζ(T)I1(σ(T))ζ(TΔt,T)=σδT}.

Since

doIT[ζ](t)={ζ(t)t<TτtT(I(ζ(T)))tT

then ζ(T)I1(σ(T))I(ζ(T))=σ(T)doIT[ζ](T)=σ(T),
because τ0 is the identity function.

Furthermore, doIT[ζ](,T)=ζ(,T),
so ζ(TΔt,T)=σδTdoIT[ζ](TΔt,T)=σδT.

Thus we can further expand out M(I1(σ(T))σδT):

M(I1(σ(T))σδT)= M{ζΣζ(T)I1(σ(T))ζ(TΔt,T)=σδT}= M{ζΣ|doIT[ζ](T)=σ(T)doIT[ζ](TΔt,T)=σδT}= M(σ(T)σδT).

  


There is one problematic aspect to this equivalence. Taking another look at the first step in the proof,

M(σ)Q(σ(T)σδT)M(σ(T)σδT)= M(σ>Tσ(T))M(σ(T)σδT)M(σ<T)Q(σ(T)σδT)M(σ(T)σδT)= M(σ>Tσ(T))Q(σ(T)σδT)M(σ<T),

we make the assumption that M(σ(T)σδT)0. Given the trajectory slice σδT=σ(TΔt,T), there is only one τ-valid trajectory which shares the same slice, and so there is only one valid state ωT at time T to follow from σ(TΔt,T). Since M obeys τ,

M(σ(T)σδT)=δωT

which is non-zero only if σ(T)=ωT. If σ(TΔt,T) is itself not τ-valid, we can define M(σ(T)σδT) to be an improper probability measure that is always 0.

I’d argue that interventions on deterministic trajectories is a limiting case of interventions on probabilistic trajectories where the transition probabilities converge to delta distributions. Then M(σ(T)σδT)/M(σ(T)σδT)1 no matter what and the cancellation works.

Compatibility with modern physics

The generalized formulation of physics, using state space Ω and time-evolution function τΔt, is compatible with classical physics and special relativity (for arbitrary choice of Lorentz frame). Is it compatible with quantum mechanics, general relativity, and beyond?

It is compatible with QM if we are time-evolving quantum state and disregarding measurement. If we wanted to model stochastic measurement outcomes, or stochastic interactions in general, then we could do that using a non-deterministic time-evolution function, i.e. τΔt is not a proper function and assigns more than one output to a given input. Alternatively, the state Ω could contain algorithmically random data which serves as a source of random inputs for τΔt.

For special relativity, simultaneity is relative, but consistently holding to an arbitrary choice of Lorentz frame will work. Then, there is a τΔt for every Lorentz frame, and one can transform between these time-evolution functions via Lorentz boosts.

For general relativity, I am not personally clear on whether there exist global reference frames where there is a single simultaneous state of the universe, even if what is regarded as simultaneous is arbitrarily chosen. In that case, my formulation may break down. However, there should still be a causal DAG. Is it possible to topologically sort that DAG and then organize it into something like time slices? Each such slice would then correspond to a state in Ω.

articlephysics

Information Algebra

Modular Neural Networks