The identification of linguistic expressions referring to entities of interest in molecular biology such as proteins, genes and cells is a fundamental task in biomolecular text mining. The GENIA technical term annotation covers the identification of physical biological entities as well as other important terms. The corpus annotation covers the full 1,999 abstracts of the primary GENIA corpus.
The GENIA Term corpus is available in an XML format described in the GENIA Corpus Manual.
The GENIA Term corpus annotations served as the initial source data for the BioNLP / JNLPBA 2004 Shared Task on Bio-Entity Recognition corpus, which has been used as training material for numerous domain entity mention taggers such as the GENIA Tagger.
The GENIA Term corpus annotations form the basis of the GENIA Event corpus annotations, which have in turn served as the initial source data for the BioNLP Shared Task 2009 and the BioNLP Shared Task 2011 GE task.
Encoding scheme
Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA Corpus Manual - Encoding schemes for the corpus and annotation. Technical Report(TR-NLP-UT-2006-1). Tsujii Laboratory, University of Tokyo, 2006.
Annotation guidelines
Kim, Jin-Dong, Tomoko Ohta, Yuka Tateisi and Jun’ichi Tsujii. GENIA Ontology. Technical Report(TR-NLP-UT-2006-2). Tsujii Laboratory, University of Tokyo, 2006.
Kim, Jin-Dong and Jun’ichi Tsujii. GENIA Corpus Curation Framework. Technical Report(TR-NLP-UT-2006-3). Tsujii Laboratory, University of Tokyo, 2006.
Publications
Ohta, Tomoko, Yuka Tateisi, Hideki Mima and Jun'ichi Tsujii. The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. In the Proceedings of the Human Language Technology Conference (HLT 2002). San Diego, USA, March 2002.
Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics. 19(suppl. 1). pp. i180-i182, Oxford University Press, 2003. ISSN 1367-4803.
Kim, Jin-Dong, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi and Nigel Collier. Introduction to the Bio-Entity Recognition Task at JNLPBA. In the Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-04). Geneva, Switzerland, pp. 70-75, 2004.
GENIA Term corpus version 3.02: GENIAcorpus3.02.tgz (1.6M)
Tomoko Ohta: GENIA term corpus annotation coordinator
See also GENIA Project acknowledgments page