Text mining for chemical cancer risk assessment

Analyze data


About CRAB

CRAB 3.0 is a fully integrated Text Mining tool aimed at supporting the entire process of literature review in chemical cancer risk assessment.

CRAB 3.0 supports the gathering of risk assessment literature via PubMed, semantic classification of the literature for cancer risk assessment, statistical analysis of the classified literature, and efficient reading and study of the literature.

Given the double exponential growth rate of biomedical literature over recent years, there is a pressing need to develop technology that can make information in published literature more accessible and useful for scientists. Such technology can be based on text mining. Drawing on techniques from natural language processing, information retrieval and data mining, text mining can automatically retrieve, extract and discover novel information even in huge collections of written text.

The CRAB project develops text mining technology to support one of the most literature-dependent areas of biomedicine: chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means.

The project has developed a tool that automates literature review and analysis in chemical risk assessment by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users.

Currently applicable to cancer, the tool could be straightforwardly adapted to support the assessment and study of other important health risks related to chemicals (e.g. allergy, asthma, reproductive disorders, among many others).

CRAB 3.0 is is a joint project between the University of Cambridge, UK and the Karolinska Institutet, Sweden.


January 2018, CRAB 3.0 released

CRAB 3.0 is a fully redesigned application offering instant classification results and 'just-in-time' argument zoning.

  • Fast search results using a local copy of PubMed database, updated daily
  • Download comprehensive Word document report containing retrieved abstracts
  • Mobile responsive for accessing results on the move

April 2014, CRAB 2.0 released

CRAB 2.0 is an extended and fully integrated text mining tool aimed at supporting the entire process of literature review in real-life chemical cancer risk assessment (CRA). CRAB 2.0 not only supports the semantic classification of CRA literature on the basis of scientific evidence, like CRAB, but also the gathering of the relevant literature via the PubMed query interface, the statistical analysis of the classified literature, and efficient reading and information extraction from the classified literature.

In addition, it introduces new features including:

  • A control panel that allows users to register and login for easy access of the system (instead of emailing the system administrator to request access)
  • Private access to the output of the classifier so that users can review their own data only
  • A new widget that enables users to download classified literature, including both a navigation page and the abstracts assigned to each category, as a .zip file
  • A revamp of the CRAB user interface from layout to font, from logo to icon, and from color to texture.

November 2011, CRAB released

CRAB is the first text mining tool for assisting literature review in chemical cancer risk assessment (CRA). This tool enables classifying PubMed abstracts for a given chemical semantically according to the scientific evidence used for CRA and visualizing the results in a taxonomy-like structure.


Journal publications

Imran Ali, Yufan Guo, Ilona Silins, Johan Högberg, Ulla Stenius, Anna Korhonen. 2016. Grouping chemicals for health risk assessment: A text mining-based case study of polychlorinated biphenyls (PCBs). Toxicol Lett. 2016 Jan 22; 241:32-7. doi: 10.1016/j.toxlet.

Ilona Silins, Anna Korhonen, Ulla Stenius. 2014. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool. Front Pharmacol. 2014 Jun 23;5:145. doi: 10.3389/fphar.

Yufan Guo, Ilona Silins, Ulla Stenius, Anna Korhonen. Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review. Bioinformatics. 2013 Jun 1;29(11):1440-7. doi: 10.1093/bioinformatics/btt163.

Sandeep Kadekar, Ilona Silins, Anna Korhonen, Kristian Dreij, Lauy Al-Anati, Johan Högberg and Ulla Stenius. 2012. Exocrine pancreatic carcinogenesis and autotaxin expression. PLoS One. 2012;7(8):e43209. doi: 10.1371/journal.pone.0043209.

Anna Korhonen, Diarmuid O'Séaghdha, Ilona Silins, Lin Sun, Johan Högberg and Ulla Stenius. 2012. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS One. 2012;7(4):e33427.

Ilona Silins, Anna Korhonen, Johan Högberg and Ulla Stenius. 2012. Data and literature gathering in chemical cancer risk assessment. Integrated Environmental Assessment and Management. 2012 Jan 3. doi: 10.1002/ieam.1278.

Yufan Guo, Anna Korhonen, Ilona Silins and Ulla Stenius. 2011. Weakly-supervised learning of information structure of scientific abstracts - is it accurate enough to benefit real-world tasks in biomedicine? Bioinformatics 2011; doi: 10.1093/bioinformatics/btr536.

Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg and Ulla Stenius. 2011. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics 2011, 12:69.

Anna Korhonen, Lin Sun, Ilona Silins, and Ulla Stenius. 2009. The First Step in the Development of Text Mining Technology for Cancer Risk Assessment: Identifying and Organizing Scientific Evidence in Risk Assessment Literature. In BMC Bioinformatics 10:303.

Conference publications

Yufan Guo, Diarmuid Ó Séaghdha, Ilona Silins, Lin Sun, Johan Högberg, Ulla Stenius and Anna Korhonen. 2014. CRAB: A text mining tool for supporting literature review in chemical cancer risk assessment. To appear in Proceedings of COLING 2014. Dublin, Ireland.

Anna Korhonen, Yufan Guo, Meliha Yetisgen-Yildiz, Ulla Stenius, Masashi Narita, Pietro Lio. 2014. Improving Literature-Based Discovery with Text Mining. In Proceedings of CIBB 2014. Cambridge, UK.

Yufan Guo, Ilona Silins, Roi Reichart and Anna Korhonen. 2012. CRAB Reader: A Tool for Analysis and Visualization of Argumentative Zones in Scientific Literature. In Proceedings of COLING 2012. Mumbai, India.

Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Lin Sun and Ulla Stenius. 2010. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes In Proceedings of the BioNLP 2010. Uppsala, Sweden.

Lin Sun, Anna Korhonen, Ilona Silins, and Ulla Stenius. 2009. User-Driven Development of Text Mining Resources for Cancer Risk Assessment. In Proceedings of the BioNLP 2009. Boulder, Colorado.

Ian Lewin, Ilona Silins, Anna Korhonen, Johan Hogberg, and Ulla Stenius. 2008. A New Challenge for Text Mining: Cancer Risk Assessment. In Proceedings of the ISMB BioLINK Special Interest Group on Text Data Mining. Toronto, Canada.

Conference Presentations and Posters

Ilona Silins, Anna Korhonen, Yufan Guo, Ulla Stenius. 2014. A text mining approach for chemical risk assessment and cancer research. Eurotox 2014, Edinburgh, UK.

Ilona Silins, Anna Korhonen, Johan Högberg, Ulla Stenius. 2011. A text mining approach to identify chemicals' modes of action in risk assessment of combined exposures. Toxicology of Mixtures Conference, Arlington, VA, USA.

Sandeep Kadekar, Ilona Silins, Anna Korhonen, Johan Hogberg, Kristian Dreij, and Ulla Stenius. 2010. Carcinogen-induced inflammation and pancreatic cancer. 101th Annual Meeting of the American Association for Cancer Research. Washington D.C., USA.

Ilona Silins, Anna Korhonen, Lin Sun, Johan Högberg, and Ulla Stenius. 2010. Chemical Carcinogenesis and Biomedical Text Mining. Karolinska Institutet Cancer Conference, Stockholm, Sweden.

Ilona Silins, Anna Korhonen, Johan Hogberg, Lin Sun, and Ulla Stenius. 2009. Improved Cancer Risk Assessment Using Text Mining. Proceedings of the 100th Annual Meeting of the American Association for Cancer Research. Denver, Colorado, USA.

Emma Westerholm, Jordi Boix, Hanna Miettinen, Robert Roos, Elsa Antunes-Fernandes, Remco Westerink, Majorie van Duursen, Mia Stenberg, Sara Carreira, Miroslav Machala, Ilona Silins, Ulla Stenius, Krister Halldin, Annika Hanberg, and Helen Hakansson. 2009. ATHON NDL-PCB effect database - a tool to facilitate the cumulative risk assessment of NDL-PCBs. In Toxicology Letters, Volume 189, Supplement 1, 13 September 2009. Abstracts of the 46th Congress of the European Societies of Toxicology.

Anna Korhonen, Ian Lewin, Ilona Silins, Johan Hogberg, and Ulla Stenius. 2008. CRAB - Cancer Risk Assessment and Biomedical Text Mining. European Conference on Computational Biology. Sardinia, Italy. See the ECCB08 website

Help guide


Enter a search term into the main search box, for example benzyl chloride

Then press RETURN on your keyboard or click on the icon. The results of the search will be displayed below the search box:

The red circle next to each tag indicates the number of abstracts found for the search term that are categorised with that tag:

If no abstracts have been found for a particular tag, the tag will be greyed out:

To view the abstracts for a particular tag, click on the red circle next to the tag:

This will load the top 50 abstracts for the tag below the results area:

To load the next 50 abstracts (if available), scroll to the bottom of the abstracts and click the View more button:

Argument Zones

The Argument Zones for each abstract - the parts of the abstract corresponding to "Background", "Objective", "Method", "Result", "Conclusion", "Related work" and "Future work" - will be calculated in the background.

Before an abstract's Argument Zones have been calculated, the abstract will be greyed out:

Once the Argument Zones for an abstract have been calculated, they will be highlighted in the abstract:

Show or hide particular Argument Zones by clicking on the Argument Zone buttons at the top of the abstracts section:

Graphs and downloads

To view a graph of the results, click on the blue graph button

Three different graphs will be displayed, one for each of the top-level sections, ie. "Scientific Evidence", "Mode of Action", "Toxicokinetics".

Clicking the bar or label of a graph will show the associated abstracts:

To download a graph as an image, click the red download button

To download the raw data as a CSV spreadsheet, click the red Excel button

To download the data as a Word document, click the red Word button

Interface details

Limit search to date range

To limit the search to a particular year date range, add the following text to your search:

AND ("START_YEAR"[Date - Publication] : "END_YEAR"[Date - Publication])

For example:

Expanding or collapsing elements

You can expand or collapse particular parent tags by clicking on the orange icon to the left of the tag:

You can also expand or collapse every tag, by clicking on the green expand icon or the orange collapse icon to the left of the results area:

The content of each abstract can be hidden by clicking on the title of the abstract:

The content of all abstracts can be expanded or contracted by clicking on the green expand icon or the orange collapse icon to the left of the abstracts area:

View original PubMed article

You can view the original PubMed article by clicking on the icon to the left of the abstract's title:

View status of Argument Zoning process

It takes 1-2 seconds to calculate the Argument Zones for a particular abstract from scratch, though this data will be cached once it has been calculated for a particular abstract.

The status of the Argument Zoning process will appear on the bottom-right of the screen in a grey box:

Sharing searches

You can share searches by clicking on the share button on the top-right of the screen and copying the URL:

Multiple searches

Multiple searches can be created, allowing data analysis to be carried out across several datasets.

To create an additional search, click on the Add tab button:

Repeat your search as before.

The name of the search in the tab will reflect the text content of the search. To change the name of the search tab, click on the tab's text box directly:

To create a dataset representing all records, enter the wildcard character * into the search box. This can be used to run a significance test of a particular term against the entire database (see Data analysis, below).

To delete a search, click on the delete icon

Data analysis

To analyze one or more datasets, click on the data analysis icon

This will take you to the Analyze data screen:

Select the data analysis function from the Select function dropdown:

Select the parent node from the Select parent node dropdown to select a subset of tags to carry out data analysis on:

Select the datasets from the searches that have been completed:

Click the blue plus icon to run the analysis. The graph/table will added to the bottom of the page:

To download the image of a graph, click the download icon

To download the CSV data for a table, click the Excel image icon

To delete a data analysis element, click the delete icon

Note: The state of the data analysis page is not saved in the URL and is not recreated when a URL is shared.

Contact us

Missing or invalid fields

You will need to fix the following problems below to submit your message:

Message sent

Your message has been sent