Exhaustive PTM corpus


Event extraction approaches to biomolecular information extraction have demonstrated high reliability for the extraction of protein post-translational modification (PTM) events, with the best extraction performance for the single PTM type (phosphorylation) considered in the BioNLP Shared Task (ST) 2009 exceeding 80% F-score and the best performance for the 12 PTM types considered in the BioNLP ST 2011 EPI task approaching 70% F-score.

Nevertheless, the full set of different PTM types is much larger than that considered in these efforts; by some estimates as high as 300. To build on the successes of event extraction technology for PTM extraction towards systems that are capable of effectively exhaustive extraction of instances of PTM mentions, we annotated this targeted corpus nearly 40 of the most frequently discussed PTM types using the GENIA / BioNLP ST'11 EPI event representation. The included PTM types are estimated to cover between 97.5% and 99.6% of PTM mention instances.


Corpus format

The corpus is distributed in the BioNLP Shared Task - flavored standoff format.

Annotation guidelines

The corpus is annotated following the GENIA Event corpus annotation guidelines, adapted as described in "Towards Exhaustive Protein Modification Event Extraction"
  • Tomoko Ohta, Jin-Dong Kim and Jun’ichi Tsujii, Guidelines for event annotation, University of Tokyo Technical Report, 2007.