Reblogged from http://innge.net/?q=node/491
Patrick Meirmans is an assistant professor in the experimental plant systematics group at University of Amsterdam. Recently, he published a paper titled ‘Seven common mistakes in population genetics and how to avoid them’. After reading the paper, I decided to ask Patrick a few questions about why he wrote this paper, why he thinks these mistakes are so common and what he hopes the paper will do for the field of population genetics.
To whet your appetite for the interview that follows, here are the seven mistakes:
- Giving more attention to genotyping than sampling
- Failing to perform or report experimental randomization in the laboratory
- Equating geopolitical borders with biological borders
- Testing significance of clustering output
- Only interpreting a single value of k [number of clusters]
- Misinterpreting Mantel’s r statistic
- Forgetting that only a small portion of the genome will be associated with climate
(Questions emailed to Patrick on 28th September 2015; Patrick emailed back with his answers on 30th September 2015.)
Hari: What was your motivation to write this paper?
Patrick: Over the years, I came across many instances where people made mistakes, often quite simple ones, in their population genetic analyses. Some of these I encountered quite often and I always thought I should write a paper about them. However, every single mistake did not carry enough weight to publish by its own, so I decided to put them all together into a single paper.
H: You make the point that the easy availability of large genetic datasets has contributed to this problem. Can you tell us why you think so?
P: Nowadays, there are a lot of people without any formal population genetic background who use genetic data [also see Karl et al. 2012]. This is because it is increasingly easy to get genetic data, because so much of the genotyping can be outsourced. On one hand this is great, because a lot of interesting questions that are outside of the realm of classic population genetics can be answered using genetic data. On the other hand this is dangerous because those researchers may not be aware of the limitations of genetic data analysis and the assumptions and biases of the used methods.
H: Your paper lists seven mistakes – are these the most important, according to you? Are there others you have left out of the paper?
P: I wouldn’t say that my list is exhaustive or that the listed mistakes are the most important ones. These were just several that I thought were worth pointing out, because they had received relatively little attention so far. I would argue that the most important problem in population genetic analyses nowadays stems from the large number of false positives that many methods suffer from (though I do mention that also in the paper). However, that is a problem that has received quite a bit of attention, for example in the great series of papers by Lotterhos & Whitlock [e.g. see Lotterhos 2012; Lotterhos &Whitlock 2015]
H: One of the mistakes you discuss is ‘Giving more attention to genotyping than to sampling’. You say: ‘faced with limited financial resources, researchers often prefer to spend their money on additional genotyping than on sampling. This is unfortunate as a failure to invest in robust sampling may completely waste the investment in genotyping.’ Why do you think this is so, i.e. why is sampling given so little importance in genetics studies as compared to, say ecology studies?
P: To generalise: while the typical ecologist likes to be in the field taking samples, a stereotypical geneticist likes to be in the lab. Sampling takes time, effort and skills, and so does genotyping and data analysis. So there is a trade-off here as well. In ecology, the data collection often starts and ends in the field. So the quality of your data is mostly dependent on your sampling strategy. In population genetics, sampling is only the first of many steps in the data collection.
H: You say that where it is not feasible, one should be willing to ‘skip analyses all together’. This is a topic that has been attracted some debate in science – some people feel that it is better to not do at all rather than do something poorly, while others feel that we should do the best we can, given current knowledge. Are you an advocate of the former?
P: I am mostly an advocate of the former and I realise that may make my own genotyping papers a bit boring since I do not throw every possible statistical technique at my data. A more nuanced view would be that it definitely depends on whether a method has a large error or whether it has a bias. When it is mostly a question of error, it is no problem to do things poorly, since when taken over many studies the results will be approximately correct. When there is a bias, it is very risky to do things anyway, since when taken over many studies, we will be falsely confident in the wrong direction.
H: Some of the mistakes you highlight, when illustrated with the unicorn examples, seem ridiculously obvious. Yet, they seem to be very commonly made. Why do you think that is?
P: There may be different reasons for this. One reason may be that people put too much trust in genotyping data. Biologists learn in their first year that every individual has a certain genotype that is unique and will not change throughout its life. So they get the idea that genotypes are fixed and therefore that genetic data represents “the” genotype of an individual. However, it is not; it rather is an estimate of the genotype of the individual, with all kinds of possibilities of error. Once you have accepted that, it is obvious that we need to allow for this in our sampling design, and all other aspects of the genotyping.
A second reason may that people have trouble with the transition from univariate to multivariate data. I have talked to people about these issues, and often they agreed that the problem was there for univariate data, but somehow thought that it would not be the case for multivariate data. The complexity of multivariate data can be so big that we miss very simple issues.
H: If I asked you what the 2-3 main takeaways from this paper, what would you say?
P: The first is to approach genotyping studies really as an experiment. This means that at all steps, one has to keep in mind that there may be biases and therefore one has to take proper precautions for this. Another main takeaway is to make a proper distinction between confirmatory and exploratory analyses, since this really impacts the strength of the inference that you can make. Nowadays these things are often not made explicit, which makes it very difficult to judge the merits of certain conclusions that are derived from population genetic analyses. Finally, start by not believing the outcome of your analyses, as there’s a fair chance that they are wrong. Only accept results after a large amount of scrutiny
H: The entry of modern genomic tools into molecular ecology is fairly recent. Do you think what you describe are teething problems which will go away as the field matures, or do you think a more proactive effort is required?
P: I mostly positioned the paper in the light of modern genomic tools because they exacerbate the problem because of their sheer size and power. However, many of the problems have been around for a long time. So it is definitely not the case that these are teething problems that will go away. I hope that my paper actually helps to fix this.
There are other problems that will certainly disappear as technology progresses. For example, the sample sizes that are used for whole-genome analyses are often rather small. Often, only a handful of individuals from a couple of populations are sequenced. These sample sizes are orders of magnitude lower than a typical study that uses microsatellites. This limited sampling introduces a large number of issues, because the demographic and genealogical stochasticity is often so high that this cannot be captured in so few samples. Obviously, this problem will go away very soon as technology proceeds even further.
H: I notice, in your reference list, that there have been a few other papers highlighting problems with population genetics methods (e.g. Karl et al. 2012; Lotterhos 2012) . You have also written other papers in the past on this topic (e.g. Meirmans 2012). How have these papers been received within the community of population geneticists? Do you think they will have an impact on the issues you are addressing?
P: I get quite a lot of emails from researchers about these issues, so I think it is being picked up very well. I also often hear from people that they have been specifically asked by the reviewers of their manuscripts to discuss these issues. It is also good to see things change in the literature as well. For example, a couple of years ago, researchers did not seem very much aware that neutral processes like isolation by distance can create patterns that closely match those that result from selection along an environmental gradient. Several papers then came out that commented upon this (including my own 2012 paper on isolation by distance). Since then it has become standard to take this into account in the analyses, and several new methods have been developed for this.
H: Can you name a few of your favourite population genetics studies/papers, i.e. which you think are based on appropriate design and analyses and careful interpretation?
P: There are several studies that I cite in my own paper that I like very much. The Science paper by Tishkoff et al. from 2009 is really great. It’s about the population structure of humans, so it’s no surprise that they have excellent genotyping. But what I especially like is the fact that they put an incredible amount of effort in sampling. I also really like the paper by Audzijonyte & Vrijenhoek (2010), since it is a great example of how to use simulations in a genotyping study. They realized that there might have been a bias in their analyses, and then used simulations to see whether this was really the case.