OverviewPart-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:
See the annotation guideline for the detail. The abstracts are first tagged by the JunK tagger and then corrected by human annotators. ExamplesCorpus formatThe corpus is available in two formats, both included in the package available for download below.
In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>. DocumentationAnnotation guidelines
Publications
Download
AcknowledgmentsYuka Tateisi: GENIA part-of-speech corpus annotation coordinator See also GENIA Project acknowledgments page |
GENIA corpus >