Corpora for Drug-Drug Interaction Extraction
The DDI Corpus
DDI Corpus (dataset 2013): a corpus annotated with drug-drug interactions. Text were collected from the Drugbank database and MedLine. This version of the corpus was used in the SemEval-2013 Task 9 Drug-Drug Interaction extraction task (http://www.cs.york.ac.uk/semeval-2013/task9/)
DDI Corpus (dataset 2011): this version of the corpus was used as training and test dataset in the DDIExtraction 2011 shared task (http://hulat.inf.uc3m.es/DDIExtraction2011/).
In any work that uses the DDI Corpus, please acknowledge the authors, as follows:
María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, Thierry Declerck, The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions, Journal of Biomedical Informatics, Volume 46, Issue 5, October 2013, Pages 914-920, http://dx.doi.org/10.1016/j.jbi.2013.07.011.
Isabel Segura-Bedmar, Paloma Martínez, María Herrero Zazo, (2014). Lessons learnt from the DDIExtraction-2013 shared task, Journal of Biomedical Informatics, Vol.51, pp:152-164.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The management of drug-drug interactions (DDIs) is a critical issue resulting from the overwhelming amount of information available on them. Natural Language Processing (NLP) techniques can provide an interesting way to reduce the time spent by healthcare professionals on reviewing biomedical literature. However, the shortage of annotated corpora for DDI extraction is the main bottleneck in the development of NLP systems for this area of Pharmacovigilance.
The DDI corpus is made up of 792 texts selected from the DrugBank database and other 233 Medline abstracts on the subject of DDIs. The corpus was annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) as well as pharmacodynamic (PD) interactions. To date, the corpora annotated with DDIs have focused in PK DDIs, but not in PD DDIs.
Annotation guidelines were developed by domain experts in order to ensure a high-quality, reliable and accurate annotation of the corpus. Pharmacological substances were classified according to four entity types: drug (for generic drugs), brand (for trade drugs), group (for drug classes) and drug_n (for active substances not approved for human use). DDIs were also classified into four types: mechanism (for DDIs describing the way the interaction occurs), effect (for DDIs describing the consequence of the interaction), advice (for DDIs described by a recommendation or advice) and int (for DDIs without any additional information). Inter-Annotator Agreement (IAA) was measured to assess the consistency and quality of the corpus. The agreement was almost perfect (Kappa up to 0.96 and generally over 0.80), except for the DDIs in the MedLine database (0.55?0.72).
The DDI corpus has been developed for the Semeval 2013-DDI Extraction 2013 challenge, whose main goal was to provide a common framework for the evaluation of information extraction techniques applied to the recognition and classification of pharmacological substances (DrugNER subtask) and the detection and classification of drug-drug interactions (DDIExtraction subtask) from biomedical texts. The DDI corpus is a valuable gold-standard for those research groups interested in the recognition of pharmacological active substances, including drugs, groups of drugs, toxins, etc. or those specifically working in the field of DDI relation extraction.
The DDI corpus is divided into two datasets: training and test. The training dataset is the same for both subtasks and contains gold-standard annotations of pharmacological substances and their interactions. It consists of 714 texts (572 from DrugBank and 142 MedLIne abstracts) annotated with a total of 13029 pharmacological substances (13029 from DrugBank and 1826 from MedLine) and 4037 DDIs (3805 from DrugBank and 232 from MedLine). The test dataset for the Drug NER subtask consists of 52 DrugBank texts (annotated with 303 pharmacological substances) and 58 MedLine abstracts (with 382 pharmacological substances). The test dataset for the subtask of DDI extraction consists of 158 DrugBank Texts (annotated with 889 DDIs) and 33 MedLine abstracts (with 95 DDIs).
We hope that the release of this dataset will encourage further research on the DDI problem.
DDIExtraction goldstandard is now available via EvALL. EvALL is an evaluation web service that is lunching in the NLP&IR research group of UNED. It allows researchers to evaluate their systems outputs according to several metrics. EvALL also allows researchers to publish their system outputs to an already stored goldstandard, (benchmark in EvALL), in the EvALL repository. Including your outputs in EvALL will allow future researchers working with this dataset to see your approaches, compare with them, explore your papers, and cite them, between others. You can find more information about EvALL in the next video.
In order to submit your system outputs you only need to:
1.- Prepare the system output. EvALL uses a tsv file format with three columns. The first one is the TEST CASE (in this dataset is “DDI2013”), the second is the ID of the elements and the third one is the class assigned to the ID:
"DDI2013" "DDI-DrugBank.d712.s1.p0" "2"
2.- Register and login. In the top right corner of the screen you can see the “Sign in” button. Click it and follow the steps.
3.- In the home menu, click on the option “Publish a new system output”.
4.- Select the DDIExtraction benchmark and follow the steps. With just 3 clicks you can upload the output.
5.- Fill in the form with the information about your output, the authors, the link to the pdf, the bibtext, etc.
6.- Save and go the the “Browse repository” option to see if everything is allrigth.