Not for the first time in my life I find myself a bit of a laughing stock, after blowing my top during a seminar at Cardiff yesterday by retired Professor Mike Disney. In fact I got so angry that, much to the amusement of my colleagues, I stormed out. I don’t often lose my temper, and am not proud of having done so, but I reached a point when the red mist descended. What caused it was bad science and, in particular, bad statistics. It was all a big pity because what could have been an interesting discussion of an interesting result was ruined by too many unjustified assertions and too little attention to the underlying basis of the science. I still believe that no matter how interesting the *results* are, it’s the *method* that really matters.

The interesting result that Mike Disney talked about emerges from a Principal Components Analysis (PCA) of the data relating to a sample of about 200 galaxies; it was actually published in *Nature* a couple of years ago; the arXiv version is here. It was the misleading way this was discussed in the seminar that got me so agitated so I’ll give my take on it now that I’ve calmed down to explain what I think is going on.

In fact, Principal Component Analysis is a very simple technique and shouldn’t really be controversial at all. It is a way of simplifying the representation of multivariate data by looking for the correlations present within it. To illustrate how it works, consider the following two-dimensional (i.e. bivariate) example I took from a nice tutorial on the method.

In this example the measured variables are Pressure and Temperature. When you plot them against each other you find they are correlated, i.e. the pressure tends to increase with temperature (or vice-versa). When you do a PCA of this type of dataset you first construct the covariance matrix (or, more precisely, its normalized form the correlation matrix). Such matrices are always symmetric and square (i.e. N×N, where N is the number of measurements involved at each point; in this case N=2) . What the PCA does is to determine the eigenvalues and eigenvectors of the correlation matrix.

The eigenvectors for the example above are shown in the diagram – they are basically the major and minor axes of an ellipse drawn to fit the scatter plot; these two eigenvectors (and their associated eigenvalues) define the principal components as linear combinations of the original variables. Notice that along one principal direction (v_{1}) there is much more variation than the other (v_{2}). This means that most of the variance in the data set is along the direction indicated by the vector v_{1}, and relatively little in the orthogonal direction v_{2}; the eigenvalue for the first vector is consequently larger than that for the second.

The upshot of this is that the description of this (very simple) dataset can be compressed by using the first principal component rather than the original variables, i.e. by switching from the original two variables (pressure and temperature) to one variable (v_{1}) we have compressed our description without losing much information (only the little bit that is involved in the scatter in the v_{2} direction.

In the more general case of N observables there will be N principal components, corresponding to vectors in an N-dimensional space, but nothing changes qualitatively. What the PCA does is to rank the eigenvectors according to their eigenvalue (i.e. the variance associated with the direction of the eigenvector). The first principal component is the one with the largest variance, and so on down the ordered list.

Where PCA is useful with large data sets is when the variance associated with the first (or first few) principal components is very much larger than the rest. In that case one can dispense with the N variables and just use one or two.

In the cases discussed by Professor Disney yesterday the data involved six measurable parameters of each galaxy: (1) a dynamical mass estimate; (2) the mass inferred from HI emission (21cm); (3) the total luminosity; (4) radius; (5) a measure of the central concentration of the galaxy; and (6) a measure of its colour. The PCA analysis of these data reveals that about 80% of the variance in the data set is associated with the first principal component, so there is clearly a significant correlation present in the data although, to be honest, I have seen many PCA analyses with much stronger concentrations of variance in the first eigenvector so it doesn’t strike me as being particularly strong.

However, thinking as a physicist rather than a statistician there is clearly something very interesting going on. From a theoretical point of view one would imagine that the properties of an individual galaxy might be controlled by as many as six independent parameters including mass, angular momentum, baryon fraction, age and size, as well as by the accidents of its recent haphazard merger history.

Disney et al. argue that for gaseous galaxies to appear as a one-parameter set, as observed here, the theory of galaxy formation and evolution must supply at least five independent constraint equations in order to collapse everything into a single parameter.

This is all vaguely reminiscent of the Hertzsprung-Russell diagram, or at least the main sequence thereof:

You can see here that there’s a correlation between temperature and luminosity which constrains this particular bivariate data set to lie along a (nearly) one-dimensional track in the diagram. In fact these properties correlate with each other because there is a single parameter model relating all properties of main sequence stars to their mass. In other words, once you fix the mass of a main sequence star, it has a fixed luminosity, temperature, and radius (apart from variations caused by age, metallicity, etc). Of course the problem is that masses of stars are difficult to determine so this parameter is largely hidden from the observer. What is really happening is that luminosity and temperature correlate with each other, because they both depend on the hidden parameter *mass*.

I don’t think that the PCA result disproves the current theory of hierarchical galaxy formation (which is what Disney claims) but it will definitely be a challenge for theorists to provide a satisfactory explanation of the result! My own guess for the physical parameter that accounts for most of the variation in this data set is the mass of the dark halo within which the galaxy is embedded. In other words, it might *really* be just like the Hertzsprung-Russell diagram…

But back to my argument with Mike Disney. I asked what is the first principal component of the galaxy data, i.e. what does the principal eigenvector look like? He refused to answer, saying that it was impossible to tell. Of course it isn’t, as the PCA method actually requires it to be determined. Further questioning seemed to reveal a basic misunderstanding of the whole idea of PCA which made the assertion that all of modern cosmology would need to be revised somewhat difficult to swallow. At that point of deadlock, I got very angry and stormed out.

I realise that behind the confusion was a reasonable point. The first principal component is well-defined, i.e. v_{1} is completely well defined in the first figure. However, along the line defined by that vector, P and T are proportional to each other so in a sense only one of them is needed to specify a position along this line. But you can’t say on the basis of this analysis alone that the fundamental variable is either pressure or temperature; they might be correlated through a third quantity you don’t know about.

Anyway, as a postscript I’ll say I did go and apologize to Mike Disney afterwards for losing my rag. He was very forgiving, although I probably now have a reputation for being a grumpy old bastard. Which I suppose I am. He also said one other thing, that he didn’t mind me getting angry because it showed I cared about the truth. Which I suppose I do.