GENIA corpus

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.

The corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated with various levels of linguistic and semantic information.

The primary categories of annotation in the GENIA corpus and the corresponding subcorpora are

Part-of-Speech annotation
Constituency (phrase structure) syntactic annotation
Term annotation
Event annotation
Relation annotation
Coreference annotation

We refer to these pages for descriptions of each of the subcorpora.

Page updated

Report abuse