Genia event extraction (GE) task, 2016

[ Overview | Details ]

Reference data sets

As a reference data set, 20 full paper articles with relevant annotations are provided to the participants. The 20 articles are sourced from PubMed Central Open Access subset (PMCOA, here after).

Following multiple layers of annotations are initially provided to the participants:

1. Event annotation is Genia-style event annotation. It is a revised version of GE 2016 event annotation data set.

  • It is a full manual annotation.
  • In total, 7721 spans are annotated as protein names.
  • Among them, 6551 spans are also annotated with UniProt IDs by UniProt annotation (see below).
  • There are 700 unique spans annotated as protein names.

2. UniProt annotation links protein names in text to UniProt IDs.

  • It is an automatic annotation based on a dictionary and simple string similarity computation.
  • The dictionary is manually complied for the 20 full paper articles.
  • In total, 8334 spans are annotated with 220 UniProt IDs.
  • Among them, 6651 spans are also annotated as protein names by the event annotation.

3. Coreference annotation is a kind of linguistic annotation for anaphora resolution. Anaphors bound by protein or event references are annotated.

  • It is a semi-automatic annotation.
  • In total, 338 anaphora structures are annotated
  • for 218 anaphor expressions
  • which are bound by 320 antecedents.

Supporting resources

Some supporting resources also will be provided. Supporting resources will include additional (automatic) annotations, dictionaries, and so on, which are in public domain.

Currently there is 1 set of supporting resources has been prepared, but more will be added.

1. Enju predicate-argument structure annotation is a syntactic parsing results using Enju.


The GE 2016 data sets are primarily provided in PubAnnotation JSON format.

Its conversion to BioNLP-ST annotation format will also be provided.

Data set

Benchmark reference data set:
Benchmark test data set:

Initial Knowledge Base

The initial KB is populated with the above two resources. From the front page of the KB, SPARQL queries can be issued to search it. Some example queries are also provided for reference.


The knowledge base will be evaluated by counting correct answers to a set of queries.
The base set of query will include (but not limited to) following ones:

  • Find protein XX
  • Find event YY on protein XX
  • Find regulations on protein XX
  • Find regulations on YY event on protein XX
  • Find the proteins that regulate protein XX
  • Find the proteins that protein XX regulates
  • Find the proteins that regulate YY event of protein XX
  • Find the proteins that bind to protein XX

In the queries, XX can be replaced with any of UniProt IDs annotated, and YY can be replated with any of event types.

The queries for evaluation are not limited to the above ones. As GE 2016 task pursues "open evaluation", participants may come up with other useful queries, and evaluate their system for the queries. Anyway, the goal of the entire GE task is to prove usefulness of text mining for KB construction.