Category: Research Methods

Dealing with old scientific paper pdfs

Tired of your favourite reference software not being able to retrieve metadata from old research articles? Tired of not being able to highlight or search those old papers published way before you were born? Tired of reading scientific papers in general? Kya aapke toothpaste mein namak hai?

Well the answer to the last two questions is beyond the scope of this blog. The first question is handled by some softwares better than others (Papers by Mekentosj, which I use, is pretty bad at that). The second question is what we’ll tackle here.

Old research articles are essentially scanned photographs and hence, image files. Some softwares or websites help convert these image files with text to text files using Optical Character Recognition, or OCR. Rittik (from my lab) recently told me about the easiest way of managing this conversion – Google Drive. All you need to do is save and open the pdf in google drive (formerly google docs) and a text version of a page follows that page. Here’s an example –

Screen Shot 2014-07-17 at 5.22.21 pm

Screen Shot 2014-07-18 at 3.53.51 pm

This allows you to search for keywords in the document which the image file wouldn’t have allowed. You can also take notes, in the absence of the highlighting option, by copy pasting some parts of the text to your favourite ‘notepad’.

Sadly Google Drive scans only the first 10 pages of a pdf. So you might have to look elsewhere for long articles. There are some other websites doing the same job for you, but most of them again have constraints on the size of the file you can upload; e.g. allows a maximum file size of 5mb.

Hope this mitigates some of the problems of dealing with old research articles. If you have an easier way of doing the same do suggest in the comments section. And don’t forget to add namak in your toothpaste.

Mathematical Biostatistics Boot Camp on Coursera

mathbiostatCoursera is hosting a Mathematical Biostatistics Boot Camp by John Hopkins School of Public Health. Here are some excerpts from the course web page:


This class presents the fundamental probability and statistical concepts used in elementary data analysis. It will be taught at an introductory level for students with junior or senior college-level mathematical training including a working knowledge of calculus. A small amount of linear algebra and programming are useful for the class, but not required.

After completing this course, students will have a basic level of understanding of the goals, assumptions, benefits and negatives of probability modeling in the medical sciences. This understanding will be invaluable when approaching new statistical topics and will provide students with a framework and foundation for future self learning. Topics include probability, random variables, distributions, expectations, variances, independence, conditional probabilities, likelihood and some basic inferences based on confidence intervals.


7 weeks, starting 16th April 2013. Workload of 3-5 hrs per week.


In his charming and entertaining way, Dr. Jeremy Fox very nonchalantly ends his ‘Advice : tips on stats’(in the ‘Advice’ series of posts in Dynamic Ecology) – “Don’t just blindly follow the “rules”. Rules are not a substitute for thought.” Having given up Math as a subject after 12th standard, confronting statistics suddenly and out of the blue at the PhD level was quite a rude shock for me. To make matters worse, most of my classmates seemed to know the fundamentals of stats, why one did statistical tests and what statistical software to use. I unfortunately also attended a statistics course (ah, those credits are a bane) in a non-Biological Sciences department and (no prizes for guessing) got thoroughly overwhelmed. But the worst was yet to come – my own data were to give me the most vivid nightmares of them all (I have been grappling with very naughty data sets for a very long time without being able to convince reviewers that they are in fact very well behaved).

As a naive user of statisconfusedtics, I wonder if it is as simple to ‘not blindly follow rules’. I intend to use statistics merely as a tool to state with conviction that what I observe in nature is reasonably true (hopefully at p < 0.05, or 0.1 whenever ‘ecologically relevant’; but let’s put that discussion off for another time). And unfortunately, the only way I can ensure that I have done the statistics bit correctly, is if I have followed the rules. For the inexperienced, a set of rules to follow is often the only way forward in the confusing maze of multiple ways of achieving the same thing.

Dr. Fox uses subtle metaphors (in a way only experienced scientists can) to make the task of doing good stats sound less daunting – “Statistics is like cooking—there are recipes to follow, but not all of them are good ones, and even the good ones are treated more like guidelines by the best cooks, who can judge when, why, and how to deviate from the recipe”. How does one become a ‘good cook’? I guess the best cooks start off following the rules before they put creativity and thought into the process. When a subject is alien and unintuitive, is it really that simple to start with a ‘thought’ than with a tried and tested ‘rule’? I was wondering if everyone feels the same way or if I am the only one here who completely missed the boat? (I exclude the physicist and the mathematician amongst us from this discussion; their practical advice would be to stop doing statistics altogether. I second that motion.)

Can I believe my own data?

I want to discuss something I have often wondered about when collecting field data. How can I be sure that my data collection is not biased by my own expectations?

Let me explain this with an example. Imagine I’m interested in comparing the abundance of a bird species at different altitudes. Based on a particular ecological theory, I expect this species to be more abundant at lower altitudes. To test my hypothesis, I walk transects at low and high altitudes and count all the individuals that I see of that species. However, it is likely that detectability is not perfect; the further away an individual is from a transect, the more likely am I to miss it. Therefore, I also visually estimate the distance of each individual from the transect. Using the distance measurements, I can examine how the number of individuals seen, drops off away from the transect line. I can then use this information to calculate how many individuals I missed seeing and adjust my estimates of abundance accordingly (I’ve described it in very simplistic terms; the actual process is a bit more complicated (pdf )).

You will notice that the accuracy of the abundance estimates depend on two factors: my ability to accurately identify the species-of-interest (i.e. to be able to tell it apart from other similar-looking species) and to accurately estimate distances of individuals to the transects (underestimates of distance will inflate abundances and vice-versa). Given that both (species identification and distance estimation) are done visually, there will be some measurement error associated with my estimates. What we generally assume is that this error is equally likely in our different treatments (low and high altitudes in our example) and therefore will not bias our results in any direction. But is that assumption really valid? I went into the study expecting to find more birds at lower altitudes. Is it possible that my desire to find this result biases my data collection without me being aware of it? For example, am I more likely to classify an individual of uncertain identity as belonging to the species-of-interest in lower altitudes? Am I more likely to underestimate distances of individuals in lower altitudes? I think what i describe will be an issue whenever there is strong motivation to obtain results in a particular direction. Given this, it is likely to be particularly problematic in research areas such as conservation science where researchers are even more strongly wedded to their hypotheses (e.g. forests better than plantations; protected seas better than trawled seas).

In lab-based experimental research, this bias is avoided by doing ‘blind’ experiments, i.e. where the person recording observations is unaware of which the control and which the experimental group is (in experiments involving human subjects there is an additional complication. The subjects also need to be kept unaware of the treatment groups and therefore ‘double-blind’ experiments are carried out. Fortunately we don’t have that problem with animal or plant subjects….or do we?).

Unfortunately, in field research there is no way to hide the identities of ‘control’ and ‘experimental’ groups. One can’t blindfold and airdrop researchers in a site and hope that they don’t figure out where they are! More generally, in any situation where there are cues that give away the identities of the different treatments, this problem is likely to arise; so this is not a problem unique to field observational research. What is the way out then? One possibility is to get someone who is unaware of the hypothesis being tested, to collect the data. But is this feasible? Is it even ethical to hide details of a project from the people working on it? Another way out is to try to reduce subjectivity in data collection as far as possible. In other words, to collect data like a robot would! But is this easier said than done? Finally, might it help to have multiple alternative hypotheses, instead of just one? What do you think?

This problem is not restricted to the data-gathering stage of research. It can creep in to our choice of study sites; during data entry and analysis; in our interpretation of our results and in what we choose to write-up as papers (more on this in a future post maybe).