Part-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:
The NNP and NNPS (proper name) tag is used only for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
We tried to eliminate SYM tags as much as possible.
See the annotation guideline for the detail. The abstracts are first tagged by the JunK tagger and then corrected by human annotators.
The corpus is available in two formats, both included in the package available for download below.
PTB-like format: The file contains one token/POS pair per line, and a "==========" line (ten equal signs) is put between sentences.
"Merged" gpml format: The POS information is merged into GENIA corpus ver 3.02 using <w> tag which surrounds the token, where the POS is represented as the value of "c" attribute.
In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.
Annotation guidelines
Tateisi, Yuka and Jun'ichi Tsujii. GENIA Annotation Guidelines for Tokenization and POS tagging. Technical Report (TR-NLP-UT-2006-4). Tsujii Laboratory, University of Tokyo, 2006.
Publications
Tateisi, Yuka and Jun'ichi Tsujii. Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. Lisbon, Portugal, pp. 1267-1270, May 2004.
GENIA corpus version 3.02 POS annotation: GENIAcorpus3.02p.tgz (4.6M)
Yuka Tateisi: GENIA part-of-speech corpus annotation coordinator
See also GENIA Project acknowledgments page