Accepted data types: Annotations/gene predictions

  1. Accepted file formats

    1. Gff3 (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)

    2. bed (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)

  2. Constraints

    1. All annotations must be submitted to us on the coordinate system of the assembly that we host (e.g. GenBank). You can find the assembly in our data downloads section, or contact us.

    2. If you did not use this assembly, there may be problems displaying the mapped data in the JBrowse genome browser. In this case, we may provide guidance or assistance on how to transfer coordinates or re-map your annotations.

  3. Acknowledgement of data source

    1. We list the annotation data source in the annotation analysis page (e.g. https://i5k.nal.usda.gov/annotations/85466).

    2. The data source is also listed in the ‘About this track’ section in the genome browser track display.

    3. We require and display submitter contact information (Name and affiliation).

  4. Annotation Categories

    1. Official Gene Sets (OGS)

      1. Definition: Official Gene Sets (OGS) are designated by the project coordinator as the definitive “best” gene set for this species. There are no requirements other than coordinator approval. However, Official Gene Sets are often a synthesis of several different gene prediction programs and manual curation. The i5k Workspace creates gene pages for all genes in an OGS, and the information for these genes is searchable.

      2. Official Gene Sets generated by the NAL result from a pipeline (under development) that performs QC and a merge of manual curations from Apollo and a single additional gene set. See the github page for more information: https://github.com/NAL-i5K/I5KNAL_OGS.  

      3. We strive to create searchable gene pages for each gene in an OGS (e.g. https://i5k.nal.usda.gov/AGLA025319).

      4. OGS Gene identifier policy

        1. We follow the Sequence Ontology definition of the ID attribute.

        2. We generally do not provide stable gene identifiers. We also do not currently map IDs between assembly or annotation versions.

        3. We aim to maintain the IDs provided to us, provided the gff3 file that they are stored in is compliant with the Sequence Ontology gff3 specification. In cases where IDs may be in conflict with other IDs or are otherwise problematic, we may assign new IDs (after consultation with the file submitter and the project contact).

        4. Gff3 files provided from GenBank will be re-formatted as follows:

          1. Gene IDs are replaced with the value of the “gene”  attribute.

          2. Transcript IDs are replaced with the value of the “transcript”  attribute.

          3. If identifiers for child features are needed but not present, we will generate new IDs derived from the parent identifiers in a consistent manner.

      5. Additional file modifications

        1. Occasionally, we will need to modify the original files provided to us in order to meet the requirements of our databases and applications. In this case, we will document the changes made and provide them in a readme file in our data downloads section.

    2. Primary Gene Sets.

      1. Definition: Primary Gene Sets are designated by the Project Coordinator as the gene set that should be curated in Apollo.

      2. Primary Gene Sets are only visualized in the genome browser - we do not import primary gene sets into our database for longer-term storage. As such, we do not change formatting or content of the file unless there are problems with the Jbrowse display, or we anticipate problems during the manual curation effort of the primary gene set. Changes are reported to the file provider, and are listed in a readme file in our ‘data downloads’ section.  

    3. Additional Gene Sets and Annotation Projects

      1. Definition: Any gene set that is not a primary or official gene set.

      2. Additional Gene Sets and Annotation Projects are only visualized in the genome browser - we do not import primary gene sets into our database for longer-term storage. As such, we do not change formatting or content of the file unless there are problems with the Jbrowse display, or we anticipate problems during the manual curation effort of the primary gene set. Changes are reported to the file provider, and are listed in a readme file in our ‘data downloads’ section.

    4. Manually curated annotations derived from Apollo

      1. Current Policy

        1. Active projects

          1. Active projects are open for curation under an active community curation team.

          2. The NAL creates regular backups of the annotation files. All curators have full access to the curated genes.

        2. Finished projects.

          1. A project is ‘finished’ when it is deemed as such by the project coordinator.

          2. Official Gene Set development at the NAL. We are actively developing a pipeline that performs QC on Apollo gff3 output and performs a merge between the Apollo output and an additional gene set that was designated to be curated. We aim to provide this service when requested, but because the project is still under development, we cannot make any guarantees as to the completion date or quality of the resulting project.

          3. For all finished projects, we

            1. Provide the project coordinator with the the final gff3 from Apollo;

            2. Store the final gff3 from Apollo in our records;

            3. Clear out all annotations from the user-created annotations track in Apollo

        3. Orphaned projects

          1. “Orphaned” projects are Apollo curation projects that are open for curation but that do not have an active project coordinator.

          2. Annotations in Apollo will be maintained by the NAL, but there are no guarantees that the annotations will be quality-controlled, integrated into an Official Gene Set, or deposited into an official repository.