A Main Sequence for Galaxies?
Not for the first time in my life I find myself a bit of a laughing stock, after blowing my top during a seminar at Cardiff yesterday by retired Professor Mike Disney. In fact I got so angry that, much to the amusement of my colleagues, I stormed out. I don’t often lose my temper, and am not proud of having done so, but I reached a point when the red mist descended. What caused it was bad science and, in particular, bad statistics. It was all a big pity because what could have been an interesting discussion of an interesting result was ruined by too many unjustified assertions and too little attention to the underlying basis of the science. I still believe that no matter how interesting the results are, it’s the method that really matters.
The interesting result that Mike Disney talked about emerges from a Principal Components Analysis (PCA) of the data relating to a sample of about 200 galaxies; it was actually published in Nature a couple of years ago; the arXiv version is here. It was the misleading way this was discussed in the seminar that got me so agitated so I’ll give my take on it now that I’ve calmed down to explain what I think is going on.
In fact, Principal Component Analysis is a very simple technique and shouldn’t really be controversial at all. It is a way of simplifying the representation of multivariate data by looking for the correlations present within it. To illustrate how it works, consider the following two-dimensional (i.e. bivariate) example I took from a nice tutorial on the method.
In this example the measured variables are Pressure and Temperature. When you plot them against each other you find they are correlated, i.e. the pressure tends to increase with temperature (or vice-versa). When you do a PCA of this type of dataset you first construct the covariance matrix (or, more precisely, its normalized form the correlation matrix). Such matrices are always symmetric and square (i.e. N×N, where N is the number of measurements involved at each point; in this case N=2) . What the PCA does is to determine the eigenvalues and eigenvectors of the correlation matrix.
The eigenvectors for the example above are shown in the diagram – they are basically the major and minor axes of an ellipse drawn to fit the scatter plot; these two eigenvectors (and their associated eigenvalues) define the principal components as linear combinations of the original variables. Notice that along one principal direction (v1) there is much more variation than the other (v2). This means that most of the variance in the data set is along the direction indicated by the vector v1, and relatively little in the orthogonal direction v2; the eigenvalue for the first vector is consequently larger than that for the second.
The upshot of this is that the description of this (very simple) dataset can be compressed by using the first principal component rather than the original variables, i.e. by switching from the original two variables (pressure and temperature) to one variable (v1) we have compressed our description without losing much information (only the little bit that is involved in the scatter in the v2 direction.
In the more general case of N observables there will be N principal components, corresponding to vectors in an N-dimensional space, but nothing changes qualitatively. What the PCA does is to rank the eigenvectors according to their eigenvalue (i.e. the variance associated with the direction of the eigenvector). The first principal component is the one with the largest variance, and so on down the ordered list.
Where PCA is useful with large data sets is when the variance associated with the first (or first few) principal components is very much larger than the rest. In that case one can dispense with the N variables and just use one or two.
In the cases discussed by Professor Disney yesterday the data involved six measurable parameters of each galaxy: (1) a dynamical mass estimate; (2) the mass inferred from HI emission (21cm); (3) the total luminosity; (4) radius; (5) a measure of the central concentration of the galaxy; and (6) a measure of its colour. The PCA analysis of these data reveals that about 80% of the variance in the data set is associated with the first principal component, so there is clearly a significant correlation present in the data although, to be honest, I have seen many PCA analyses with much stronger concentrations of variance in the first eigenvector so it doesn’t strike me as being particularly strong.
However, thinking as a physicist rather than a statistician there is clearly something very interesting going on. From a theoretical point of view one would imagine that the properties of an individual galaxy might be controlled by as many as six independent parameters including mass, angular momentum, baryon fraction, age and size, as well as by the accidents of its recent haphazard merger history.
Disney et al. argue that for gaseous galaxies to appear as a one-parameter set, as observed here, the theory of galaxy formation and evolution must supply at least five independent constraint equations in order to collapse everything into a single parameter.
This is all vaguely reminiscent of the Hertzsprung-Russell diagram, or at least the main sequence thereof:
You can see here that there’s a correlation between temperature and luminosity which constrains this particular bivariate data set to lie along a (nearly) one-dimensional track in the diagram. In fact these properties correlate with each other because there is a single parameter model relating all properties of main sequence stars to their mass. In other words, once you fix the mass of a main sequence star, it has a fixed luminosity, temperature, and radius (apart from variations caused by age, metallicity, etc). Of course the problem is that masses of stars are difficult to determine so this parameter is largely hidden from the observer. What is really happening is that luminosity and temperature correlate with each other, because they both depend on the hidden parameter mass.
I don’t think that the PCA result disproves the current theory of hierarchical galaxy formation (which is what Disney claims) but it will definitely be a challenge for theorists to provide a satisfactory explanation of the result! My own guess for the physical parameter that accounts for most of the variation in this data set is the mass of the dark halo within which the galaxy is embedded. In other words, it might really be just like the Hertzsprung-Russell diagram…
But back to my argument with Mike Disney. I asked what is the first principal component of the galaxy data, i.e. what does the principal eigenvector look like? He refused to answer, saying that it was impossible to tell. Of course it isn’t, as the PCA method actually requires it to be determined. Further questioning seemed to reveal a basic misunderstanding of the whole idea of PCA which made the assertion that all of modern cosmology would need to be revised somewhat difficult to swallow. At that point of deadlock, I got very angry and stormed out.
I realise that behind the confusion was a reasonable point. The first principal component is well-defined, i.e. v1 is completely well defined in the first figure. However, along the line defined by that vector, P and T are proportional to each other so in a sense only one of them is needed to specify a position along this line. But you can’t say on the basis of this analysis alone that the fundamental variable is either pressure or temperature; they might be correlated through a third quantity you don’t know about.
Anyway, as a postscript I’ll say I did go and apologize to Mike Disney afterwards for losing my rag. He was very forgiving, although I probably now have a reputation for being a grumpy old bastard. Which I suppose I am. He also said one other thing, that he didn’t mind me getting angry because it showed I cared about the truth. Which I suppose I do.
December 2, 2010 at 9:45 pm
Having known both people involved in this incident for many years, I can only respond with: ha ha, ho ho, hee hee.
(On the science, without looking at the paper, let’s see if I can guess what the principal component analysis found. The colour will be strongly correlated with the HI mass, because gas-poor galaxies will not have formed stars recently and will contain only old stars (mostly red). Giant elliptical galaxies will have large dynamical masses, large luminosities and high central concentrations, compared with most spirals and irregulars. These parameters will therefore be slightly anticorrelated with those properties associated with spirals, namely HI masses and blue colours. These are mostly issues of the outcomes of galaxy formation and evolution, not particular formation theories. But it is always very dangerous to comment on science without having read the original paper.)
December 2, 2010 at 9:48 pm
If you’d been there you would probably have found it even more amusing!
P.S. It’s a blind HI-selected survey so there aren’t any ellipticals in there…
December 2, 2010 at 10:11 pm
A blind, HI-selected survey? Well, that makes my point that it’s dangerous to comment on science without reading the paper!
December 3, 2010 at 1:38 pm
I’m still struggling with your Bayesian probability posts but until I saw your neat little diagram I had never really seen a convincing illustration of exactly what an eignevector was. Suddenly the light dawned after many many years – well done and thanks.
December 3, 2010 at 6:17 pm
I have read his paper several times and keep on changing my mind whether it is saying anything new. I don’t really think it is, because the first four parameters are all ones you would expect to be correlated: galaxies with higher dynamical masses generally contain more gas and stars and have larger radii. The correlation of the fifth parameter, the concentration, just shows that early-type spirals (the ones with big bulges) are the more massive ones – which we knew anyway. And the final parameter, colour, is not correlated with the others. Also, unlike the fundamental plane for ellipticals, the correlations aren’t very tight.
December 3, 2010 at 6:50 pm
I looked at the paper last night and since then I have been trying to suppress the urge to post the comment “Doesn’t it just show that big galaxies are big and small galaxies are small?”
December 3, 2010 at 8:44 pm
That’s basically what I said in the post. I think all these parameters correlate with the halo mass.
December 3, 2010 at 9:30 pm
Using a sample chosen on the basis of HI observations restricts the sample somewhat.
Did anybody ask Prof. Disney if he’d considered selection effects? ….
December 4, 2010 at 11:30 am
Steve,
I think you’d expect all these things to correlate with halo mass, but possibly rather loosely. It all boils down to the tightness of the observed correlation. I have to say I don’t think it’s all that tight, so although it’s interesting I don’t think it’s at all earth-shattering…
Peter
December 4, 2010 at 10:33 pm
Peter,
I’d argue that it is the mass in the visible part of the galaxy that is important, and much of that will be baryonic. That determines the dynamical mass. If there are limits on the central surface densities of stars and gas (instrinsic or observational), we would expect a correlation between most of the other main parameters and size. I agree that it is an interesting set of results, though not wholly unexpected.
Bryn.
February 10, 2013 at 5:39 pm
[…] I think anyone who has worked in scientific research will recognize elements of the stories discussed in the Observer piece. On the positive side, cracking a challenging research problem can lead to a wonderful sense of euphoria. Even much smaller technical successes lead to a kind of inner contentment which is most agreeable. On the other hand, failure can lead to frustration and even anger. I’ve certainly shouted in rage at inanimate objects, but have never actually put my first through a monitor but I’ve been close to it when my code wouldn’t do what it’s supposed to. There are times in that sort of state when working relationships get a bit strained too. I don’t think I’ve ever really exploded in front of a close collaborator of mine, but have to admit that one one memorable occasion I completely lost it during a seminar…. […]