The growth in number of practicing scientists has led to an increase in the quantity of journals and this activity is continuously proliferating with higher technology and time. Peer-reviewed scientific literature is still the most important source of reliable data on genes, proteins and small molecules. The sheer number of scientific publications and the fact that they are written in machine non-readable format necessitated the development of systematic protocols to extract this information into machine-readable format that can be used by computational algorithms. We call this process as Reverse Informatics.


Gathering, Identifying, Separating, Extracting Information from Primary Literature


Our curation process begins by gathering the relevant literature which embodies the theme. Papers of interest for a given research topic are identified through periodic searches in the primary literature (published patents and journal articles).


The number of papers reviewed for the curation process varies from one theme to another. The identification of papers containing data relevant to a particular theme involves a paper-by-paper review by our curation team. The team would review anywhere from 100 to 1,000 papers a month.


The types of data presented in the biological literature are extremely diverse, and choosing which ones to capture in any given theme is a challenge! Primary data is basic information collected and the complexity of the reported data varies largely, our curator’s have an eagle’s view to collect all the details of expressions, patterns, pathways and interaction networks to build the complete knowledge nest for the topic of interest.


Our curator’s have the best analytical skills to understand and identify the main key terms of the paper. They then proceed to read the entire paper in detail. The data can be presented in multiple types of information; the curator decides the arrangement pattern for the information from the understanding developed from the experimental concept and the reported results. The process is quite straightforward in nature but at times ambiguities do develop on what to be extracted and this is sorted out only by multiple readings of the paper followed up by discussions with our domain experts.


