The identification of coreferential expressions, that is, expressions in text referring to the same thing, is important for many applications relying on the analysis of the meaning of statements in text. The GENIA Coreference corpus provides coreference annotations covering all the 1999 abstracts of the primary GENIA corpus.
The coreference annotation was produced by MedCo Annotation Project. The format conversion into GENIA format, and minor bug fixes were made by the GENIA Project.
The coreference corpus is distributed in the XML format described in the GENIA Corpus Manual. A selected subset of revised corpus annotations are also available in a standoff format as part of the BioNLP Shared Task 2011 CO task corpus.
The GENIA Coreference corpus annotations served as the initial source data for the BioNLP Shared Task 2011 CO task.
Encoding scheme
Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA Corpus Manual - Encoding schemes for the corpus and annotation. Technical Report(TR-NLP-UT-2006-1). Tsujii Laboratory, University of Tokyo, 2006.
Publications
Su, Jian, Yang, Xiaofeng, Hong, Huaqing, Tateisi, Yuka, Tsujii, Jun'ichi, Coreference Resolution in Biomedical Texts: a Machine Learning Approach. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl Seminar Proceedings. 08131 - Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives, 2008.
GENIA-MedCO coreference corpus version 1.0: GENIA_MedCo_coreference_corpus_1.0.tar.gz
The coreference corpus annotations were produced by the MedCo Annotation Project.