Lighthouse in the Sky: November 2009

Portraits

Here's a link to some charming portraits of astronomers (via Dynamics of Cats).

I've done a little photography myself - in this age of digital cameras, who hasn't? - and I've really enjoyed things like macro shots, landscapes, and time-lapse video. But I've never really been good at photographing people. So when I see good portrait photography, like these, or like a wonderful book of portraits by Yousuf Karsh I saw the other day, I'm always kind of amazed at how much character it's possible to capture in a single picture.

Full post

Kernel Density Estimators

Since high school science class, I've been making graphs that show one variable as a function of another - force as a function of extension, redshift as a function of distance, intensity as a function of wavelength, et cetera. But around that time I was also writing code - probably generating some kind of chaos or fractal - that needed to plot the distribution of data points in one dimension. Seems simple, right? Well, the standard solution is to produce a histogram - divide up the range into "bins", count how many fall into each bin, and plot that. But choosing the right number of bins is something of an art: too many and your plot is hopelessly spiky, too few and you can't see any features your data might have. So I wondered if there was something better. It turns out that there is (though it's not always clear when it actually is better); a tool called a kernel density estimator.

First, the problems with histograms. I generated a collection of 400(*) photon arrival phases corresponding to two equal peaks. Above is plotted a histogram of their arrival times. Below is a histogram of their arrival times shifted by half a bin. To me, at least, it's not at all obvious that the two histograms are of the same distribution, let alone the same sample.

Since the signal is periodic, one natural thing to try is to work with the Fourier series. It's not too hard to construct some coefficients of the Fourier series of the photon arrival times. If I used all (infinitely many) coefficients I'd get a collection of spikes, one at each photon arrival time, which isn't too useful. But if I discard all Fourier coefficients after a certain point, I smooth out those spikes into a more reasonable shape. Here's what I get using ten harmonics:

The two peaks look more similar now, which makes sense since the Fourier coefficients don't really care about the starting phase. There are those nasty ripples, though - in fact, they go negative, which is rather distressing for something that's supposed to be a probability distribution. (Incidentally, there's some Fourier magic that goes into that error band on the sinusoid but I don't want to get into it right now.)

One way to get rid of those ripples is to think of this as a problem in digital filtering, where they are an example of the "ringing" that can occur in digital filters with too-sharp cutoffs. In that context, the usual solution is to taper the coefficients to zero smoothly, and that's what I will do. But there is a statistical way to think about what's going on.

First of all, tapering or cutting off the Fourier coefficients can be thought of as multiplying the Fourier coefficients by another set of Fourier coefficients. By a wonderful theorem of Fourier analysis, this amounts to convolving the set of photon arrival times by a kernel. That is, we take the forest of delta-functions representing the photon arrival times, and replace each delta function with a copy of some smoother function, and add them all together. This process, when carried out on the real line, is known to statisticians as constructing a kernel density estimator (the "kernel" is the smoother function, and it is used to estimate a probability density).

Simply chopping off the Fourier coefficients corresponds to the kernel sin((n+1/2)x)/sin(x/2), the periodic "sinc" function. This has a peak at zero but oscillates from positive to negative, so it's not too surprising that we got ripples. So to get rid of the ripples, we just choose a kernel that is positive and whose Fourier coefficients we know. There is a natural (if somewhat obscure) candidate: the von Mises probability distribution. This is a circular analog of a Gaussian, both in appearance and in a technical sense: the Gaussian distribution is the distribution with maximum entropy for specified mean and standard deviation. The von Mises distribution has maximum entropy for specified "circular mean" (which includes information about spread as well as location). In any case, it's a positive smooth periodic distribution whose Fourier coefficients can be computed in terms of Bessel functions using your favourite scientific computing tool. So using it instead gives this:

This looks pretty good; it shows the distribution pretty cleanly, with no ringing. There's only one fly in the ointment: while I don't need to specify a starting phase, I do need to come up with a parameter - analogous to the number of harmonics - that determines the width of the kernel. If I choose too wide a kernel, it flattens out all the features in my data; if I choose too narrow a kernel I get something horrible like this:

So in a way I haven't solved the original problem with histograms that motivated me. Fortunately, one of the selling points of kernel density estimators is that the statistical literature is full of papers on how programs can automatically choose the degree of smoothing. None of their techniques (that I have found) are appropriate for periodic data, but I have my own ideas about that (to be written up in future).

(*) 400 is not actually a very small number of photons; Fermi often gets something like one photon a week from a source.

Full post

Ghetto apple crumble

I made a surprisingly successful apple crumble today. Apple crumble, as you may know, is basically apples with sugar and cinnamon, with a topping of butter, flour, oatmeal, and sugar, all in about equal quantities. Almost idiot-proof. So why "surprisingly successful"? Well, my oven died halfway through (in fact after preheating but before the crumble went in). It turns out you can microwave an apple crumble and it's fine. Who knew?

Recipe below.

Filling:

Five apples, peeled and diced. Granny Smith are good for a bit of tartness to balance the sweet crumble. (Peeling is technically optional but the peel makes for an awkward texture.)
5 mL cinnamon.
100 mL sugar.
a little butter.

Crumble:

125 mL brown sugar (white will do fine).
125 mL rolled oats.
125 mL flour.
100 mL butter.
a mL or two of vanilla.

Put the apples in a deep baking dish, sprinkle the sugar and cinnamon on top and shake a bit; scrape a little butter on top.

In a bowl, add the crumble ingredients. In the likely case that the butter is hard, microwave it for a few seconds to soften it up. Mix the ingredients together; you should get a crumbly mass. Pack it in on top of the filling.

Microwave on 50% for fifteen to twenty minutes; when it's ready the top layer will have sunk somewhat because the apples will have softened. As it cools the crumble on top will harden somewhat.

Serve hot or cold. Particularly good with a scoop of vanilla ice cream.

Full post

Alchemy

The alchemists' dream (one of them, anyway) was always to make gold. We now know there are very good reasons they couldn't: since gold is an element, making it from anything that doesn't contain gold(*) requires you to change the nuclei of the atoms involved, while all the alchemists had access to was the electron shells around the atom(**). So their efforts were basically hopeless. Now, though, we do have the ability to manipulate nuclei, and in fact we do so on industrial scales. So could we make gold? In fact, let's be more ambitious: could I make gold in my basement? The answer is, surprisingly, yes.

First of all, the easiest nuclear reaction to go for is to transmute mercury into gold. Mercury is commercially available (though somewhat encumbered by very sensible environmental concerns) and not actually very expensive - on the order of $18 per kg (***). Gold, by comparison is more like $40000 per kg. So there's room for some profit here.

How could I make the nuclear reaction happen? This particular reaction needs so-called "fast neutrons", that is, neutrons that are still zipping around at the high energies typical of nuclear reactions, as opposed to neutrons that are bouncing around at energies consistent with room temperature. I could stick the gold in a "fast breeder reactor", but I don't actually have one in my basement, and they're kind of hard to build. I could use a particle accelerator to generate some neutrons (basically by bashing nuclei around until some neutrons fall off) but while I do have a particle accelerator in my basement, it takes one a lot more serious than I can reasonably build to get neutrons out. Nuclear fusion reactions give out neutrons, though, so all I'd have to do would be to build a fusion reactor in my basement. Improbable as it sounds, this actually is feasible, provided I'm not trying to get any energy out.

The trick is that there's a fusion reactor, called the Farnsworth-Hirsch fusor, that is surprisingly simple to build. It is actually something of a cross between a fusion reactor and a particle accelerator: I'd set up an electrical potential in a spherical configuration, accelerating deuterium ions towards the center, where they'd crash into other deuterium ions, occasionally hard enough for fusion to happen. This fusion would release a fast neutron.

To make gold, then, all I'd have to do would be to build a fusor, surround it with a blanket of mercury, run it for a while, and then extract the gold from the mercury. Simple, really.

Let's look at the economics, though.

Suppose we want to make a kilogram of gold, giving us $40000. We need about a kilogram of mercury, costing $18. We also need about 5 g of deuterium (assuming perfect efficiency), which would cost about $30. Finally, we need the power to run the fusor. That's not going to be cheap. An optimistic number for the best fusor ever built is about 10^12 neutrons per second from about 4 kW input. That amounts to 9*10^14 neutrons per kilowatt-hour. Assuming perfect efficiency again, we need about 3*10^24 neutrons for our kilogram of gold, or 3 terawatt-hours, about the world's total energy production for an hour and a half. At $0.10 per kilowatt hour (I live in the land of cheap hydroelectricity) that's three hundred billion dollars.

There's a somewhat more disturbing possibility, though. Gold is easily obtained; you can just buy it. But as a global society, we try very hard to make sure you can't easily get plutonium, particularly plutonium-239, since that is well-suited to building atomic bombs. (You can make bombs out of uranium too, but that requires you to separate the different isotopes, which are very nearly chemically identical. Plutonium, on the other hand, can easily be separated from uranium since it is a different element.) Uranium isn't too hard to come by, especially "depleted uranium" (uranium with most of the uranium-235 removed) - armies fire the stuff at each other, for example. And if you had lots of U-238, a fusor would let you make plutonium out of it. The cost would be high, hopefully prohibitively so, but you could do it without doing anything that would put you on the radar of the IAEA. Fortunately, the power use is so outrageous we don't really need to worry about it.

So, in short, I could make gold in my basement, but not any appreciable quantity, and not for any kind of sensible price.

(*) Since gold is a noble metal, there aren't many chemical compounds that contain gold; unlike, say, iron, gold is often found on Earth as lumps of raw gold. So while in principle alchemists could have started from some gold compound and gotten the original gold back, this would not have been a very interesting accomplishment.

(**) There are actually situations where you can affect the nucleus by manipulating the electron shells. For example, if an isotope decays by electron capture, you can drastically slow down its decay by stripping away all its electrons. But stripping away all the electrons from a reasonably heavy element is one of those things that's virtually impossible under terrestrial conditions but not too rare astrophysically. In any case this has no effect on stable isotopes.

(***) Canadian dollars and American dollars are equivalent to astronomical accuracy.

Full post

Curve Fitting part 4: Validating Bayesian Code

In my previous post, I wrote a tool to use Bayesian inference to ask whether a collection of photons represented a pulsed signal, and if so, what its parameters were. It gave plausible answers, but knowing my own fallibility, I really want some more careful test before trusting its output.

I previously talked about how to implement such a correctness test in a frequentist setting. But in a Bayesian setting, what one gets are probability distributions describing our knowledge about the hypotheses - how can a program test those?

After racking my brains for a while trying to find a way to do this, I asked about it on the scipy-user mailing list, and received a few useful suggestions, the most valuable of which was to send a polite email to Professor Tom Loredo. I did, and he replied immediately with a very helpful email, and some links to other people's work on the subject.

It turns out that, in the context of Bayesian fitting, the issue is called "calibration". The key realization is that if you select a model according to your prior distribution, generate a data set, and then do your fitting to generate posterior distributions, then your "true" parameters that you used to generate the data set look just as if they had been drawn from the posterior distribution.

This makes a certain kind of sense: after all, your posterior distribution is supposed to describe the distribution of models that might have generated your data set.

So how do you turn this into a test? After all, if you just get one sample from a distribution, it's pretty hard to say anything very definite, especially when the tails - the most unlikely extreme parameters - are not necessarily very well modeled. If you try to generate another sample, you pick a different model and must then generate a different data set, so you get a new point but a totally different posterior distribution. So now you have two points, drawn from two different distributions, and your situation has not necessarily improved.

There's a trick to let you work with a large sample: you transform them to all have the same distribution. You can do this because you know the posterior distribution you're working with; in my case I have its values sampled evenly. So I can construct the cumulative distribution function and use its inverse to get the posterior probability of a value less than the true model. This should be distributed uniformly no matter the shape of the posterior.

In fact, I'd rather use a slightly different trick: instead of looking at the probability of a value less than the true model, I'll use the probability of a value more extreme than the true model. Essentially I'll combine both tails. This has a more direct bearing on the question of confidence intervals, and still results in a uniform distribution as I repeat the process many times.

Once I have a long list of many values that should be distributed uniformly, there are any number of tests I can use; I'll use the handy but not particularly good Kolmogorov-Smirnov test. My unit testing code now looks like this:


def test_credible_interval():

    M = 100
    N = 50

    ps = []
    for i in range(M):
        f, p = np.random.rand(2)

        events = bayespf.generate(f, p, N)

        phases, fractions, r, P = bayespf.infer(events)

        frac_pdf = np.average(r,axis=0)
        fi = np.searchsorted(fractions, f)
        p = np.sum(frac_pdf[:fi])/np.sum(frac_pdf)
        p = 2*min(p, 1-p)

        ps.append(p)

    assert scipy.stats.kstest(ps,lambda x: x) > 0.01

Note that I don't bother testing the posterior for pulse phase. This is just laziness.

In a real Bayesian problem, there would usually be many more parameters, and I would probably not be able to evaluate the posterior on a grid like this. I'd probably be using some sort of Monte Carlo method, which would return a list of samples drawn from the posterior instead. But there are reasonable approaches in this setting too.

So far so good. But what about the probability that the signal is pulsed? In some sense hypothesis testing is just a special case of parameter fitting, but the discreteness of the "parameter" poses a problem: there's no way we can hope for a nice uniform distribution when only two values are possible. Professor Loredo very kindly sent me an unpublished note in which he addresses this problem, showing that if your code accepts a hypothesis whose posterior probability is P_crit, then a correct piece of code will be right - have chosen the correct hypothesis - with probability at least P_crit. Unfortunately P_crit is only a lower limit on the probability of getting things right; but there isn't really a way around this: suppose my code were working with a million photons a run. Then it would, in almost every case, give a probability very close to zero or one. There would be very, very few errors, no matter what value P_crit you set.

The fact that P_crit is only a lower limit does mean that this doesn't allow you to catch code that is too conservative: if your code makes fewer errors than its probabilities claim it should, this test has no way to tell that that's what's happening.

One must also choose P_crit carefully. In my example, if I choose P_crit=0.5, then code that simply chose randomly, or returned the same value all the time, would pass: after all, my priors specify equal probabilities for each model, and with P_crit=0.5 the code only needs to be right half the time. On the other hand, with P_crit really high, the code will almost never be wrong, although it only takes a few errors to fail, but it will almost never actually come to a conclusion, so you'll wait forever for evidence. There should be some fairly well-defined optimum value of P_crit, and I need to think more about what it is.

In any case, having selected P_crit and the other parameters, I can write a unit test for the hypothesis testing component:


def test_pulsed_probability():
    np.random.seed(0)
    p_accept = 0.75
    M = 50
    N = 30

    accepted = 0
    correct = 0
    total = 0
    while accepted<M:
        if np.random.rand()>0.5:
            f, p = np.random.rand(2)
            pulsed = True
        else:
            f, p = 0., 0.
            pulsed = False

        events = bayespf.generate(f, p, N)

        phases, fractions, r, P = bayespf.infer(events, n_phase=100, n_frac=101)

        if P>=p_accept:
            accepted += 1
            if pulsed:
                correct += 1
        if P<1-p_accept:
            accepted += 1
            if not pulsed:
                correct += 1
        total += 1
    assert 0.01<scipy.stats.binom(M,p_accept).cdf(correct)

As a final note, I'd like to thank Professor Loredo for his help, but note that any errors in the above code or description are entirely mine.

Full post

Curve Fitting part 3: Bayesian fitting

When you fit a curve to data, you would usually like to be able to use the result to make statements about the world, perhaps something like "there's a fifty percent chance the slope is between 1 and 2". But this is a bit peculiar from a philosophical point of view: if your data is a set of measurements of some real-world phenomenon, then it's a bit funny to talk about probabilities that the slope has certain values. The phenomenon has some fixed slope, so we can't somehow repeat the experiment many times and see how often the slope is between 1 and 2. But there is a way to make such a statement meaningful: Bayesian inference.

The basic idea is that you use probabilities to quantify the degree of uncertainty you have about the world; you are using a system of logic that uses probability and probability distributions rather than binary logic. This may sound fuzzy and esoteric, but Bayesian logic is used very successfully in, for example, automatic systems to evaluate whether email is spam.

When applied to data, Bayesian reasoning lets you make meaningful statements about probabilities of parameter values, at the cost of making some explicit assumptions going in, and also at the cost of some potentially substantial computation. I'll work through an example of fitting a light curve to a set of photon arrival times, using a Bayesian procedure.

First the problem setting: suppose we observe a pulsar, whose period we know exactly (perhaps from radio observations), with an X-ray or gamma-ray telescope. We see some (fairly small) number of photons, and we want to know whether the flux we see is modulated at the pulsar period. We "fold" the photon arrival times, recording the pulsar phase when each one arrives. So our data is a collection of some hundreds or thousands of numbers between zero and one.

The model we'll fit includes some fraction f of photons whose arrival times are distributed as a sinusoid with a peak at phase p; the remaining fraction (1-f) are independent of phase.

The key idea of Bayesian curve fitting is that if you have some collection of hypotheses Hi about the world, each having some probability P(Hi), and you make some observation, there's a simple procedure to update these probabilities to reflect your new knowledge.

From a philosophical point of view, it's a bit worrisome to have to supply hypotheses (called "priors") about what the world is like before we ever make any observations. It amounts to making assumptions in the absence of data. But in fact there are assumptions built into the usual "frequentist" methods of inference as well, and in Bayesian inference the ability be explicit about the hypotheses at least makes it clear what's going on.

What assumptions should we make for our pulsar? Well, there might or might not be pulsations, so we'll have two basic hypotheses: no pulsations and pulsations. Absent any information, we'll assume these are equally likely. Then, if there are pulsations, we need to specify prior distributions of phase and pulsed fraction. Since both these parameters are between zero and one, we'll just take a so-called "flat prior" that makes all values between zero and one equally likely.

Given these priors, we need to figure out how to use our observed photon arrival times to update the priors to give us "posteriors". The general formula is:

P(Hi|D) = P(D|Hi) P(Hi) / P(D)

That is, the probability of hypothesis i given the data, P(Hi|D), equals the probability of the data given Hi, P(D|Hi), times the prior probability of Hi, P(Hi), divided by the probability of the data given any hypothesis.

The first thing to note is that P(D|Hi) is just what we need to evaluate for a maximum-likelihood estimate: how likely data like what we observe is to arrive given some hypothesis. We only need to define it up to a constant, since it appears in both numerator and denominator. For our problem, the probability density for pulse arrival times is p(f,p,t) = f(1+cos(2 pi (t-p)))+(1-f). So P(D|Hi) is the product of p(f,p,ti) for all events ti.

How do we form P(D)? Well, since we have two hypotheses, H0 (no pulsations) and H1 (pulsations), we can write P(D) = P(D|H0)+P(D|H1). Further, H1 is actually a family of hypotheses depending on two parameters, so we need to integrate P(D|H1) over all possible values of the parameters.

If we apply the above formula, then, we should get two things: a posterior probability that there are any pulsations at all, and a posterior probability distribution for the two parameters.

Let's look at python code to implement this. First of all, we're going to need to be able to generate fake data sets:


def generate(fraction, phase, n):
 m = np.random.binomial(n, fraction)
 pulsed = np.random.rand(m)
 c = np.sin(2*np.pi*pulsed)>np.random.rand(m)
 pulsed[c] *= -1
 pulsed += 0.25+phase
 pulsed %= 1

 r = np.concatenate((pulsed, np.random.rand(n-m)))
 np.random.shuffle(r)
 return r

This routine generates the photons in two parts. First it decides randomly how many come from the pulsed component. Then the photons from the pulsed component are generated uniformly. To convert this to a sinusoidal distribution we select some of the photons in the lower part and move them to the upper part. We then add in some uniformly-distributed photons, and shuffle the two samples together.

Now we write a routine to evaluate the probability density function:


def pdf_data_given_model(fraction, phase, x):
 return fraction*(1+np.cos(2*np.pi*(x-phase)))+(1-fraction)

Note that in spite of appearances, this routine can act on an array of values at once; this is important since python's interpreted nature makes each line of python take quite a long time.

And now the fitting routine:


def infer(events, n_phase=200, n_frac=201):

 events = np.asarray(events)
 phases = np.linspace(0,1,n_phase,endpoint=False)
 fractions = np.linspace(0,1,n_frac)

 lpdf = np.zeros((n_phase, n_frac))
 for e in events:
     lpdf += np.log(pdf_data_given_model(fractions, phases[:,np.newaxis], e))

 # This weird-looking hack avoids exponentiating very large numbers
 mx = np.amax(lpdf)
 p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))

 S = np.average(np.exp(lpdf))

 return phases, fractions, p, (S/(S+1))

This uses one of the simplest approaches to calculating the distribution and its integral: just evaluate on a grid. Integration then becomes averaging. More sophisticated Bayesian problems usually involve high-dimensional integrals, and so a whole elaborate machinery has evolved for efficiently evaluating these (for example the python package pymc).

Finally, some wrappers to generate a fake data set, call the fitting routine, and plot and print the results:


if __name__=='__main__':
 import pylab as pl

 events = generate(0.2,0.5,200)
 phases, fractions, r, P = infer(events)
 print "Probability the signal is pulsed: %f" % P

 pl.subplot(211)
 pl.contourf(fractions, phases, r)
 pl.xlabel("Pulsed fraction")
 pl.ylabel("Phase")
 pl.xlim(0,1)
 pl.ylim(0,1)

 pl.subplot(212)
 p = np.average(r,axis=0)
 li, mi, ui = np.searchsorted(np.cumsum(p)/np.sum(p),
         [scipy.stats.norm.cdf(-1), 0.5, scipy.stats.norm.cdf(1)])
 pl.plot(fractions, p)

 pl.xlabel("Pulsed fraction")
 pl.ylabel("Probability")
 pl.axvline(fractions[li])
 pl.axvline(fractions[mi])
 pl.axvline(fractions[ui])
 print ("Pulsed fraction: %f [%f, %f]" %
         (fractions[mi], fractions[li], fractions[ui]))

 pl.show()

One key point here is that when I want to know the distribution of pulsed fraction but don't care about the phase, I integrate (i.e. average) the joint distribution along the phase direction.

This gives us the following plot:

And the following output:


Probability the signal is pulsed: 0.450240
Pulsed fraction: 0.210000 [0.100000, 0.315000]

So it looks like the fitting routine is working: even with relatively few photons and a small pulsed fraction, it has selected quite good best-fit values. The probability that the signal is actually pulsed seems a little low, but keep in mind that we have only two hundred photons, and only forty of these are actually pulsed (while a Poisson uncertainty on the number of photons would be something like 14). But giving plausible results is not really enough: I want to systematically test this routine for correctness. But that will be another post.

Full post

Testing statistical tests

Often one wants to ask something like "is there a periodic signal in this data?" or "is there a correlation between these two data sets?". Often there is some way to calculate a number that represents how much signal there is or how much correlation there is. But of course there is always noise and uncertainty in the data, and so it's possible that the apparent signal or correlation is actually just noise. So for such a number to be useful, it must also come with an estimate of how likely it would be to get such a value if there were no signal, only noise. Such an estimate is often called a p-value.

I should say, before I go on, that there are a number of things a p-value isn't: it's not (one minus) a probability that the signal is real. It's not a measure of the strength or importance of the signal. It's not even enough to tell you whether you should believe the signal is real: if you look at a hundred thousand data sets, you should expect to find many with p<0.01 even if there's nothing there. But this has been better discussed elsewhere, and it's not my topic today.

Today I'm addressing a specific issue: suppose that you have implemented some new way of quantifying the presence of periodic signals, or correlations, or whatever, and your implementation returns a p-value. How do you make sure that p-value is actually correct?

Of course, you can (and should!) go over your code carefully, with a critical eye. But evaluating the p-value often involves evaluating some arcane special function. How can you be sure that you implemented it correctly, or for that matter, that you got the math right in the first place? It turns out there's a nice way to test your code as a black box.

For our example test, let's take a well-known test: the chi-squared test. As I'm going to use it here, we have m measurements, each with measurement uncertainty one, and we want to know whether the underlying quantity is the same in all measurements. The chi-squared test works by constructing the quantity chi-squared, which is the sum of the squares of the differences of the measurements from the mean:


def chi_squared(values):
  return np.sum((values-np.mean(values))**2)

This quantity has a known distribution, naturally enough called the chi-squared distribution. It is parameterized by a number called the "degrees of freedom", which in our case is m-1, the number of measurements minus one because we subtracted the mean. For my p-value, I will just ask what the probability is that random noise would produce a chi-squared value larger than what we observed. (Note that we might well want to consider unusual any data set for which the chi-squared value is unnaturally small as well as ones where the chi-squared is unnaturally large, but that can only happen if we have overestimated the uncertainties on our measurements, so I'll leave it aside.)


def chi_squared_p(m,c):
  return scipy.stats.chi2(m-1).sf(c)

For a distribution in scipy.stats, the "sf" method evaluates the "survival function", that is, what fraction of the distribution's values are larger than the given value.

So now we have a test and a p-value calculation. How are we to check that we implemented it correctly? Well, let's pick an m and a p-value we're interested in, say p0 = 0.05. Let's also pick a number of repetitions, N. We will generate N fake data sets in which there is nothing but noise, and see how many times our statistic reports p<p0. This should be something close to p0*N. How close? Well, the statistic should behave like flipping an unfair coin with probability p0, so we can use the binomial distribution to find limits that should contain the number of false positives 98% of the time.


def test_p(N,p0,m):
  false_positives = 0
  for i in range(N):
      if chi_squared_p(m,chi_squared(np.random.randn(m)))<p0:
          false_positives += 1

  assert scipy.stats.binom(N,p0).cdf(false_positives)>0.005
  assert scipy.stats.binom(N,p0).sf(false_positives)>0.005

There we have a test; the more trials you let it run (N), the tighter constraints it puts on the p-value. Unfortunately, it fails 2% of the time even if everything's fine, and it's random, so if it fails, you can't restart running it in the debugger (since the next run will get different values). There's a way around these problems, too. The first problem can be avoided by running the test once, then if it fails, rerunning it. Then the test only reports a failure 0.04% of the time if all is well, but a real problem in the algorithm will probably still show up. The second problem you can solve by seeding the random number generator every time you run the test. In python, decorators provide a handy way to avoid having to write boilerplate code to do both these things for each test; I have two decorators, seed() and double_check, that do this. I'll omit their code here because they have some nastiness to work around limitations in nosetests (but you can find them, along with a more detailed example of the techniques in this post here).

This test is nice, but a little limited: it only tests one p value, p0. Since every time the statistical test runs it returns a p value, surely we can do better?

Well, since the p value is supposed to be the probability that a random data set will exceed the value obtained in a particular test, if we generate lots of noise samples, we should have the fraction whose value is less than p0 roughly equal to p0 for every p0 - in other words, the p-values returned should be uniformly distributed. So we can use a statistical test for uniform distribution to check whether they're coming out right! One such test, present in scipy, is the Kolmogorov-Smirnov test. That gives the following scheme:


def test_ks(N,m):
  pvalues = [chi_squared_p(m,chi_squared(np.random.randn(m))) for i in range(N)]

  D, p = scipy.stats.kstest(pvalues, lambda x: x)
  assert p>0.01

This does have the drawback that the K-S test is known to be most sensitive in the middle of the distribution, while what we care about is actually the tail. A minor modification can help, at the cost of some speed:


def test_ks_tail(N,m,p0):
  pvalues = []
  while len(pvalues)<N:
      pv = chi_squared_p(m,chi_squared(np.random.randn(m)))
      if pv<p0:
          pvalues.append(pv/p0)
  D, p = scipy.stats.kstest(pvalues, lambda x: x)
  assert p>0.01

This has the embarrassing problem that it's too good: it turns out that while this works for the chi-squared statistic I describe, for the K-S test itself, the p-value returned is actually rather approximate, so that this test tends to fail.

Finally, there's one essential thing to check: how stringent are these tests? If we return the wrong p-value, do they pick it up? A quick check can be done simply by modifying the chi-squred calculator to jigger the value a bit, then checking that the tests fail. With a thousand trials, the tests pass just fine if I return 1.01*chi-squared; I have to increase it to something like 1.05*chi-squared to start getting failures.

This brings us to the elephant in the room: the power of a statistical test. The p-value is really only half the answer; it tells us how likely we are to get an apparent signal when there's nothing there. It tells us nothing about whether we would actually see a signal if it were there; for that, you need a different quantity, the power of the test.

Full post

Cosmos

In honor of Carl Sagan day, I'd like to link to Cosmos:

Nearly thirty years later, and in spite of tremendous advances in astronomy, Cosmos is still a wonderful introduction. Rather than focus on the science, Carl Sagan does a great job of sharing the wonder of discovery.

I am kicking myself now: when I was much younger, I was visiting a cousin who works at Cornell, and he pointed to Carl Sagan's office and asked if I wanted to go up and say hello. Out of shyness, I guess, I demurred, but it would have been a fascinating visit.

Full post

Lighthouse in the Sky

Portraits

Kernel Density Estimators

Ghetto apple crumble

Alchemy

Curve Fitting part 4: Validating Bayesian Code

Curve Fitting part 3: Bayesian fitting

Testing statistical tests

Cosmos

Search This Blog

Popular Posts

Labels

Blog Archive

Other Blogs

Search This Blog

Popular Posts

Labels

Blog Archive

Other Blogs

Subscribe To