Part-of-speech and syntactic (phrase structure) annotation has been created for all of the 1999 abstracts of the primary GENIA corpus. The annotation scheme of the GENIA Treebank has been designed based on the Penn Treebank II (PTB) bracketing guidelines (Bies et al, 1995).
The primary GENIA treebank distribution is in XML format. Conversions into the Penn Treebank format have been created by a number of researchers not directly affiliated with the GENIA project. One recent version of the GENIA Treebank in PTB format was created by David McClosky.
The GENIA Treebank is the most widely applied corpus for training and adapting parsers to biomedical domain texts and has been applied
Stanford parser (combined "english" model)
McClosky-Charniak-Johnson parser (biomedical domain model)
Enju parser (biomedical domain model)
GDep: the GENIA dependency parser
Encoding scheme
Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA Corpus Manual - Encoding schemes for the corpus and annotation. Technical Report(TR-NLP-UT-2006-1). Tsujii Laboratory, University of Tokyo, 2006.
Annotation guidelines
Tateisi, Yuka, Akane Yakushiji, Tomoko Ohta and Jun'ichi Tsujii. GENIA Annotation Guidelines for Treebanking. Technical Report(TR-NLP-UT-2006-5). Tsujii Laboratory, University of Tokyo, 2006.
Publications
Tateisi, Yuka, Akane Yakushiji, Tomoko Ohta and Jun’ichi Tsujii . Syntax Annotation for the GENIA corpus. In the Proceedings of IJCNLP'05. Jeju Island, Korea, pp. 222--227, October 2005.
GENIA Treebank version 1.0: GENIA_treebank_v1.tar.gz (2.3M)
Yuka Tateisi: GENIA Treebank annotation coordinator
See also GENIA Project acknowledgments page