When you fit a curve to data, you would usually like to be able to use the result to make statements about the world, perhaps something like "there's a fifty percent chance the slope is between 1 and 2". But this is a bit peculiar from a philosophical point of view: if your data is a set of measurements of some real-world phenomenon, then it's a bit funny to talk about probabilities that the slope has certain values. The phenomenon has some fixed slope, so we can't somehow repeat the experiment many times and see how often the slope is between 1 and 2. But there is a way to make such a statement meaningful: Bayesian inference.
The basic idea is that you use probabilities to quantify the degree of uncertainty you have about the world; you are using a system of logic that uses probability and probability distributions rather than binary logic. This may sound fuzzy and esoteric, but Bayesian logic is used very successfully in, for example, automatic systems to evaluate whether email is spam.
When applied to data, Bayesian reasoning lets you make meaningful statements about probabilities of parameter values, at the cost of making some explicit assumptions going in, and also at the cost of some potentially substantial computation. I'll work through an example of fitting a light curve to a set of photon arrival times, using a Bayesian procedure.
First the problem setting: suppose we observe a pulsar, whose period we know exactly (perhaps from radio observations), with an X-ray or gamma-ray telescope. We see some (fairly small) number of photons, and we want to know whether the flux we see is modulated at the pulsar period. We "fold" the photon arrival times, recording the pulsar phase when each one arrives. So our data is a collection of some hundreds or thousands of numbers between zero and one.
The model we'll fit includes some fraction f of photons whose arrival times are distributed as a sinusoid with a peak at phase p; the remaining fraction (1-f) are independent of phase.
The key idea of Bayesian curve fitting is that if you have some collection of hypotheses Hi about the world, each having some probability P(Hi), and you make some observation, there's a simple procedure to update these probabilities to reflect your new knowledge.
From a philosophical point of view, it's a bit worrisome to have to supply hypotheses (called "priors") about what the world is like before we ever make any observations. It amounts to making assumptions in the absence of data. But in fact there are assumptions built into the usual "frequentist" methods of inference as well, and in Bayesian inference the ability be explicit about the hypotheses at least makes it clear what's going on.
What assumptions should we make for our pulsar? Well, there might or might not be pulsations, so we'll have two basic hypotheses: no pulsations and pulsations. Absent any information, we'll assume these are equally likely. Then, if there are pulsations, we need to specify prior distributions of phase and pulsed fraction. Since both these parameters are between zero and one, we'll just take a so-called "flat prior" that makes all values between zero and one equally likely.
Given these priors, we need to figure out how to use our observed photon arrival times to update the priors to give us "posteriors". The general formula is:
P(Hi|D) = P(D|Hi) P(Hi) / P(D)
That is, the probability of hypothesis i given the data, P(Hi|D), equals the probability of the data given Hi, P(D|Hi), times the prior probability of Hi, P(Hi), divided by the probability of the data given any hypothesis.
The first thing to note is that P(D|Hi) is just what we need to evaluate for a maximum-likelihood estimate: how likely data like what we observe is to arrive given some hypothesis. We only need to define it up to a constant, since it appears in both numerator and denominator. For our problem, the probability density for pulse arrival times is p(f,p,t) = f(1+cos(2 pi (t-p)))+(1-f). So P(D|Hi) is the product of p(f,p,ti) for all events ti.
How do we form P(D)? Well, since we have two hypotheses, H0 (no pulsations) and H1 (pulsations), we can write P(D) = P(D|H0)+P(D|H1). Further, H1 is actually a family of hypotheses depending on two parameters, so we need to integrate P(D|H1) over all possible values of the parameters.
If we apply the above formula, then, we should get two things: a posterior probability that there are any pulsations at all, and a posterior probability distribution for the two parameters.
Let's look at python code to implement this. First of all, we're going to need to be able to generate fake data sets:
def generate(fraction, phase, n):
m = np.random.binomial(n, fraction)
pulsed = np.random.rand(m)
c = np.sin(2*np.pi*pulsed)>np.random.rand(m)
pulsed[c] *= -1
pulsed += 0.25+phase
pulsed %= 1
r = np.concatenate((pulsed, np.random.rand(n-m)))
This routine generates the photons in two parts. First it decides randomly how many come from the pulsed component. Then the photons from the pulsed component are generated uniformly. To convert this to a sinusoidal distribution we select some of the photons in the lower part and move them to the upper part. We then add in some uniformly-distributed photons, and shuffle the two samples together.
Now we write a routine to evaluate the probability density function:
def pdf_data_given_model(fraction, phase, x):
Note that in spite of appearances, this routine can act on an array of values at once; this is important since python's interpreted nature makes each line of python take quite a long time.
And now the fitting routine:
def infer(events, n_phase=200, n_frac=201):
events = np.asarray(events)
phases = np.linspace(0,1,n_phase,endpoint=False)
fractions = np.linspace(0,1,n_frac)
lpdf = np.zeros((n_phase, n_frac))
for e in events:
lpdf += np.log(pdf_data_given_model(fractions, phases[:,np.newaxis], e))
# This weird-looking hack avoids exponentiating very large numbers
mx = np.amax(lpdf)
p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))
S = np.average(np.exp(lpdf))
return phases, fractions, p, (S/(S+1))
This uses one of the simplest approaches to calculating the distribution and its integral: just evaluate on a grid. Integration then becomes averaging. More sophisticated Bayesian problems usually involve high-dimensional integrals, and so a whole elaborate machinery has evolved for efficiently evaluating these (for example the python package pymc).
Finally, some wrappers to generate a fake data set, call the fitting routine, and plot and print the results:
import pylab as pl
events = generate(0.2,0.5,200)
phases, fractions, r, P = infer(events)
print "Probability the signal is pulsed: %f" % P
pl.contourf(fractions, phases, r)
p = np.average(r,axis=0)
li, mi, ui = np.searchsorted(np.cumsum(p)/np.sum(p),
[scipy.stats.norm.cdf(-1), 0.5, scipy.stats.norm.cdf(1)])
print ("Pulsed fraction: %f [%f, %f]" %
(fractions[mi], fractions[li], fractions[ui]))
One key point here is that when I want to know the distribution of pulsed fraction but don't care about the phase, I integrate (i.e. average) the joint distribution along the phase direction.
This gives us the following plot:
And the following output:
Probability the signal is pulsed: 0.450240
Pulsed fraction: 0.210000 [0.100000, 0.315000]
So it looks like the fitting routine is working: even with relatively few photons and a small pulsed fraction, it has selected quite good best-fit values. The probability that the signal is actually pulsed seems a little low, but keep in mind that we have only two hundred photons, and only forty of these are actually pulsed (while a Poisson uncertainty on the number of photons would be something like 14). But giving plausible results is not really enough: I want to systematically test this routine for correctness. But that will be another post.