Coreference annotation

Overview

The identification of coreferential expressions, that is, expressions in text referring to the same thing, is important for many applications relying on the analysis of the meaning of statements in text. The GENIA Coreference corpus provides coreference annotations covering all the 1999 abstracts of the primary GENIA corpus.

The coreference annotation was produced by MedCo Annotation Project. The format conversion into GENIA format, and minor bug fixes were made by the GENIA Project.

Example

Corpus format

The coreference corpus is distributed in the XML format described in the GENIA Corpus Manual. A selected subset of revised corpus annotations are also available in a standoff format as part of the BioNLP Shared Task 2011 CO task corpus.

Major applications

- The GENIA Coreference corpus annotations served as the initial source data for the BioNLP Shared Task 2011 CO task.

Documentation

Encoding scheme

Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA Corpus Manual - Encoding schemes for the corpus and annotation. Technical Report(TR-NLP-UT-2006-1). Tsujii Laboratory, University of Tokyo, 2006.

Publications

Su, Jian, Yang, Xiaofeng, Hong, Huaqing, Tateisi, Yuka, Tsujii, Jun'ichi, Coreference Resolution in Biomedical Texts: a Machine Learning Approach. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl Seminar Proceedings. 08131 - Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives, 2008.

Download

- GENIA-MedCO coreference corpus version 1.0: GENIA_MedCo_coreference_corpus_1.0.tar.gz

Acknowledgments

The coreference corpus annotations were produced by the MedCo Annotation Project.

Page updated

Report abuse