After blogging a few days ago about the possibility that our entire Universe might be asymmetric, I found out today that a short comment of mine about a completely different form of asymmetry has been published in the Proceedings of the National Academy of Sciences of New York.
Earlier this summer a paper by Ivanka Savic & Per Lindstrom concerning gender and sexuality differences in brain structure received widespread press coverage and the odd blog comment. They had analysed a group of 90 volunteers divided into four classes based on gender and sexual orientation: male heterosexual, male homosexual, female heterosexual and female homosexual.
They studied the brain structure of these volunteers using Magnetic Resonance Imaging and used their data to look for differences between the different classes. In particular they measured the asymmetry between left and right hemispheres for their samples. The right side of the brain for heterosexual men was found to be typically about 2% larger than the left; homosexual women also had an asymmetry, but slightly smaller than this at about 1%. Gay men and heterosexual women showed no discernible cerebral asymmetry. These claims are obviously very interesting and potentially important if they turn out to be true. It is in the nature of the scientific method that such results should be subjected to rigorous scrutiny in order to check their credibility.
As someone who knows nothing about neurobiology but one or two things about statistics, I dug out the research paper by Savic & Lindstrom and looked at the analysis it presents. I very quickly began to suspect there might be a problem. For each volunteer, the authors obtain measurements of the left and right cerebral volumes (call these L and R respectively). Each pair of measurements is then combined to form an asymmetry index (AI) as (L-R)/(L+R). There is then a set of values for AI, one for each volunteer. The claim is that these are systematically different for the different gender and orientation groups, based on a battery of tests including Analysis of Variance (ANOVA) and t-tests based on sample means.
Of course, it would be better to do this using a consistent, Bayesian, approach because this would make explicit the dependence of the results on an underlying model of the data. Sadly, the statistical methodology available off-the-shelf is of inferior frequentist type and this is what researchers tend to do when they don’t really know what they’re doing. They also don’t bother to read the health warnings that state the assumptions behind the results.
The problem in this case is that the tests done by Savic & Lindstrom all depend on the quantity being analysed (AI) having a normal (Gaussian) distribution. This is very often a reasonable hypothesis for biometric data, but unfortunately in this case the construction of the asymmetry index is such that it is expected to have a very non-Gaussian shape as is commonly the case for distributions of variables formed as ratios. In fact, the ratio of two normal variates has a peculiar distribution with very long tails. Many statistical analyses appeal to the Central Limit Theorem to justify the assumption of normality, but distributions with very long tails (such as the Cauchy distribution) violate the conditions of this Theorem, namely that the distribution must have finite variance. The asymmetry index is probably therefore an inappropriate choice of variable for the tests that Savic & Lindstrom perform. In particular the significance levels (or p-values) quoted in their paper are very low (of order 0.0008, for example, in the ANOVA test) which is surprising for such small samples. These probabilities are obtained by assuming the observations have Gaussian statistics, and they would be much lower for a distribution with longer tails.
Being a friendly chap I emailed Dr Savic drawing this problem to her attention and asking if she knew about this problem and the possible implications it might have for the analysis she had presented. If not, I offered to do an independent (private) check on the data to see how reliable the claimed statistical results actually were. I never received a reply.
Worried that the world might be jumping to all kinds of far-reaching conclusions about gay genes based on these questionable statistics, I wrote instead to the editor of the Journal Proceedings of the National Academy of Sciences of New York, Randy Schekman, who suggested I submit a written comment to the Journal. I did, it was accepted by the editorial committee, and it came out in the 11th November Issue. What I didn’t realise was that Savic & Lindstrom had actually prepared a reply and that this was published alongside my comment. I find it strange that I wasn’t told about this before publication but that aside, it is in principle quite reasonable to let the authors respond to criticisms like mine. Their response reveals that they completely missed the point of the danger of long-tailed distributions I mentioned above. They state that “when the sample size n is big the sampling distribution of the mean becomes approximately normal regardless of the distribution of the original variable“. Not if the distribution of the original variable has such a long tail it doesn’t! In fact, if the observations have a Cauchy distribution then so does the sampling distribution of the mean, whatever the size of sample. You can find this caveat spelled out in many places, including here. Savic & Lindstrom seem oblivous to this pitfall, even after I specifically pointed it out to them.
They also claim that a group size of n=30 is sufficient to be confident that the central limit theorem holds. A pity, then, that none of their groups is of that size. The overall sample is 90, but it is broken down into two groups of 20 and two of 25.
They also say that the measured AI distribution is actually normal anyway and give a plot (above). This shows all the AI values binned into one histogram. Since they don’t give any quantitative measures of goodness of fit, it’s hard to tell whether this has a normal distribution or not. One can, however, easily identify a group of five or six individuals that seem to form a separate group with larger AI values (the small peak to the right of the large peak). Since they don’t give histograms broken down by group it is impossible to be sure, but I would hazard a guess that these few individuals might be responsible for the entire result; remember that the entire sample has n only of 90.
More alarmingly, Savic & Lindstrom state in their reply that “one outlier” is omitted from this graph. Really? On what basis was the outlier rejected? The existence of outliers could be evidence of exactly the sort of problem I am worried about! Unless there was a known mistake in the measurement, this outlier should never have been omitted. They claim that the “recalculation of the data excluding this outlier does not change the results”. It find it difficult to believe that the removal of an outlier from such a small sample could not change the p-values!
In my note I made a few constructive suggestions as to how the difficulty might be circumvented, by Savic & Bergstrom have not followed any of them. Instead they report (without details of the p-values) having done some alternative, non-parametric, tests. These are all very well, but they don’t add very much if their p-values also assume Gaussian statistics. A better way to do this sort of thing robustly would be using Monte Carlo simulations.
The bottom line is that after this exchange of comments we haven’t really got anywhere and I still don’t know if the result is significant. I don’t really think it’s useful to go backwards and forwards through the journal, so I’ve emailed Dr Savic again asking for access to the numbers so I can check the statistics privately. In astronomy it is quite normal for people to make their data sets publically available, but that doesn’t seem to be the case in neurobiology. I’m not hopeful that they will reply, especially since they branded my comments “harsh” and “inappropriate”. Scientists should know how to take constructive criticism.
Their conclusion may eventually turn out to be right, but the analysis done so far is certainly not robust and it needs further checking. In the meantime I don’t just have doubts about the claimed significance of this specific result, which merely serves to illustrate the extremely poor level of statistical understanding displayed by large numbers of professional researchers. This was one of the things I wrote about in my book From Cosmos to Chaos. I’m very confident that a large fraction of claimed results in biosciences are based on bogus analyses.
I’ve long thought that scientific journals that deal with subjects like this should employ panels of statisticians to do the analysis independently of the authors and also that publication of the paper should require publication of the raw data. Science advances when results are subject to open criticism and independent analysis. I sincerely hope that Savic & Lindstrom will release their data in order for their conclusions to be checked in this way.
It’s no wonder that there is so much public distrust of science, when such important claims are rushed into the public domain without proper scrutiny.