Starting the manual curation process for your genome

The i5k Workspace@NAL offers the Apollo manual curation tool for genome projects that are interested in improving their genome project quality via manual curation. Because manual curation is a substantial effort on your part, we’d like to discuss how you plan on organizing and completing the manual curation effort prior to setting up Apollo for you.
1) Decide on whether and how you would like to generate an "Official Gene Set" (OGS).
If you plan on manually curating one or more genomes, you should consider how your community plans on handling the manual annotation process. Specifically, have you identified an individual who will serve as coordinator for the manual curation process? Is the end goal of the curation process to obtain an official gene set (OGS)? If so, have you already identified how the OGS will be generated? How many computationally predicted gene sets will be used in the OGS?
We have seen three general ‘models’ for manual curation at the i5k Workspace:

  1. In the first case, the genome or community coordinator wishes to generate an OGS from the manually curated genes and a single, ‘primary’ set of computationally predicted genes. (This does not mean that the curator can’t use evidence from other gene sets to curate their model – it only refers to how the OGS generation is handled at the end of the manual curation process).
  2. In the second case, the community prefers not to designate a ‘primary’ set of computationally predicted genes, and instead chooses to computationally integrate all gene predictions and manual curations into one OGS at the end of the manual curation period.
  3. Third, some groups may not choose to go through the effort of generating an OGS, for example if the community is small enough such that a single ‘gold standard’ gene set is unnecessary, or if it is desired to keep manual curations separate from computationally predicted gene models.

We provide some support for the first model (annotating a ‘primary’ gene set), and are working with the Apollo development team on a streamlined method to automatically generate an OGS from a primary gene prediction track and manual curations. While this is being developed, we have implemented a ‘Replaced Models’ field in Apollo, that users will need to fill out if their group wants help from us in generating an OGS (see https://i5k.nal.usda.gov/web-apollo-replaced-models-field-explanations-and-examples for documentation, and https://i5k.nal.usda.gov/i5k-workspaceweb-apollo-replaced-models-field-faq for background on why we think this is necessary, as opposed to using a simple computational approach based on overlaps). Briefly, the idea is to have the annotator explicitly state which gene prediction, if any, should be removed from the primary predictions when merging the predictions and manual curations. We are developing an OGS pipeline using this information for the i5k pilot, and we can tentatively offer this service to your group, with the large caveat that this process is still under development and fairly time-consuming, so we can’t give any guarantees on completion dates.
It would be great if you could consider the three ‘models’ we list above, and let us know which one you would like to follow (or, alternatively, suggest an entirely different model to us).
2) Organize your manual curation community.

  • Identify members of your community who would be willing to annotate individual genes or gene families. It often helps to find related species that are willing to 'share' the annotation process - get a group of annotators together that will curate the same genes or gene families for all species.
  • It is possible for us to announce to our annotator mailing list (now over 300 annotators) that your genomes are now open for manual curation, and are recruiting new annotators. Let us know if you would be interested in this! 
  • Consider setting a fixed annotation deadline – this may work well to organize larger annotation groups.

3) Register to access the Apollo manual curation tool for your genome.
Once you have recruited your annotators, they can register to annotate here. We will contact you with each registration request – only annotators approved by you are given login credentials. Alternatively, you can send us a list of annotators (name and email) to approve immediately. Once we have received your approval, the annotator will be sent login credentials and relevant annotation information via email. We will also sign her/him up for our mailing list.
4) Start curating!

  • The Apollo development team at LBL has put together a comprehensive guide on the manual curation process.
  • Chris Childers at the National Agricultural Library (NAL) has put together an example workflow for manual curation.
  • In collaboration with the i5k pilot project, we have put together a list of ‘Annotation rules’ specific to the i5k pilot that your curation group may want to follow. However, note that we are in the process of re-evaluating these rules based on their success with i5k pilot genomes that are now in the finishing stages of manual curation.

 
Finally, feel free to contact us with any questions!