Dealing with old scientific paper pdfs

Tired of your favourite reference software not being able to retrieve metadata from old research articles? Tired of not being able to highlight or search those old papers published way before you were born? Tired of reading scientific papers in general? Kya aapke toothpaste mein namak hai?

Well the answer to the last two questions is beyond the scope of this blog. The first question is handled by some softwares better than others (Papers by Mekentosj, which I use, is pretty bad at that). The second question is what we’ll tackle here.

Old research articles are essentially scanned photographs and hence, image files. Some softwares or websites help convert these image files with text to text files using Optical Character Recognition, or OCR. Rittik (from my lab) recently told me about the easiest way of managing this conversion – Google Drive. All you need to do is save and open the pdf in google drive (formerly google docs) and a text version of a page follows that page. Here’s an example –

Screen Shot 2014-07-17 at 5.22.21 pm

Screen Shot 2014-07-18 at 3.53.51 pm

This allows you to search for keywords in the document which the image file wouldn’t have allowed. You can also take notes, in the absence of the highlighting option, by copy pasting some parts of the text to your favourite ‘notepad’.

Sadly Google Drive scans only the first 10 pages of a pdf. So you might have to look elsewhere for long articles. There are some other websites doing the same job for you, but most of them again have constraints on the size of the file you can upload; e.g. allows a maximum file size of 5mb.

Hope this mitigates some of the problems of dealing with old research articles. If you have an easier way of doing the same do suggest in the comments section. And don’t forget to add namak in your toothpaste.


