GENIA corpus‎ > ‎

Part-of-speech annotation

Overview

Part-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:
  • The NNP and NNPS (proper name) tag is used only for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
  • We tried to eliminate SYM tags as much as possible.
See the annotation guideline for the detail. The abstracts are first tagged by the JunK tagger and then corrected by human annotators.

Examples

Corpus format

The corpus is available in two formats, both included in the package available for download below.
  • PTB-like format: The file contains one token/POS pair per line, and a "==========" line (ten equal signs) is put between sentences.
  • "Merged" gpml format: The POS information is merged into GENIA corpus ver 3.02 using <w> tag which surrounds the token, where the POS is represented as the value of "c" attribute.
In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.

Documentation

Annotation guidelines

  • Tateisi, Yuka and Jun'ichi Tsujii. GENIA Annotation Guidelines for Tokenization and POS tagging. Technical Report (TR-NLP-UT-2006-4). Tsujii Laboratory, University of Tokyo, 2006.

Publications

Download

Acknowledgments

Yuka Tateisi: GENIA part-of-speech corpus annotation coordinator
Ċ
Tomoko OHTA,
Dec 8, 2011, 9:45 PM
Ċ
Tomoko OHTA,
Dec 8, 2011, 9:45 PM