Treebank

Overview

Part-of-speech and syntactic (phrase structure) annotation has been created for all of the 1999 abstracts of the primary GENIA corpus. The annotation scheme of the GENIA Treebank has been designed based on the Penn Treebank II (PTB) bracketing guidelines (Bies et al, 1995).

Example

Corpus format

The primary GENIA treebank distribution is in XML format. Conversions into the Penn Treebank format have been created by a number of researchers not directly affiliated with the GENIA project. One recent version of the GENIA Treebank in PTB format was created by David McClosky.

Major applications

The GENIA Treebank is the most widely applied corpus for training and adapting parsers to biomedical domain texts and has been applied

- Stanford parser (combined "english" model)
- McClosky-Charniak-Johnson parser (biomedical domain model)
- Enju parser (biomedical domain model)
- GDep: the GENIA dependency parser

Documentation

Encoding scheme

- Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun'ichi Tsujii. GENIA Corpus Manual - Encoding schemes for the corpus and annotation. Technical Report(TR-NLP-UT-2006-1). Tsujii Laboratory, University of Tokyo, 2006.

Annotation guidelines

- Tateisi, Yuka, Akane Yakushiji, Tomoko Ohta and Jun'ichi Tsujii. GENIA Annotation Guidelines for Treebanking. Technical Report(TR-NLP-UT-2006-5). Tsujii Laboratory, University of Tokyo, 2006.

Publications

- Tateisi, Yuka, Akane Yakushiji, Tomoko Ohta and Jun’ichi Tsujii . Syntax Annotation for the GENIA corpus. In the Proceedings of IJCNLP'05. Jeju Island, Korea, pp. 222--227, October 2005.

Download

- GENIA Treebank version 1.0: GENIA_treebank_v1.tar.gz (2.3M)

Acknowledgments

Yuka Tateisi: GENIA Treebank annotation coordinator

Page updated

Report abuse