i5k Workspace gene and protein naming guidelines

Rationale. This document outlines the i5k Workspace’s guidelines for gene and protein names and symbols. Our current philosophy is to 1) follow existing, well-established guidelines to the extent possible, while 2) streamlining the process of archiving community-curated annotations at NCBI’s GenBank database. We have chosen the UniProt guidelines (http://www.uniprot.org/docs/nameprot) as a starting point, acknowledging that there are other community naming standards from other taxa that may also be worth emulating (a list is available here: http://www.uniprot.org/docs/nomlist.txt). We are open to discussing these guidelines, and to collaborating on i5k- specific naming guidelines in the future if needed.

The i5k Workspace uses the Apollo annotation software (http://genomearchitect.github.io/) for community annotation. Therefore, these guidelines were developed with the Apollo user interface in mind, and contain some instructions that are specific to this platform.

Here, we outline the UniProt naming guidelines in broader strokes (in order to not scare you off), and sometimes reference texts from other standards when applicable. Feel free to delve into the source documents for further details. This document will likely change over time as we receive more feedback from the community. Feel free to provide comments on this document (comments will be moderated and visible by the public).

Guidelines for gene symbols:

Symbols are short-form representations (or abbreviations) of the descriptive gene name (see https://www.genenames.org/about/guidelines). They are typically only applied to a gene (rather than a protein, or isoforms of a gene). We do not recommend coining new symbols for newly named genes. However, if a name from an orthologous gene was adopted, you may use this gene’s symbol, as well.

Guidelines for gene and protein (mRNA) names:

We highly recommend the UniProt naming guidelines: http://www.uniprot.org/docs/nameprot. The guidelines below are primarily derived from the UniProt guidelines, and are tailored towards typical community annotation at the i5k Workspace.

  • Are you adopting a name from a homolog?

    • We recommend adopting an existing gene name if you have good evidence that the gene is a one-to-one ortholog of a named reference gene. If your gene is not a one-to-one ortholog, but you have good evidence that the gene is a homolog, then we recommend appending the suffix '-like' to the gene name (cf. https://www.genenames.org/about/guidelines). We recommend using Orthodb (http://www.orthodb.org/) as a starting point for checking homology.
    • If you are uncertain about the gene/mRNA Name, do not add a gene/mRNA Name. You may use the ‘description’ field instead. (e.g. use the description ‘ultraspiracle-like’ instead of the name ‘ultraspiracle-like’.)
    • Don’t use ‘similar to’ in the name – this may cause difficulties during submission to NCBI. If you’re not sure of the name, use the ‘Description’ field.
    • Do not add a species prefix (e.g. use ‘ultraspiracle’ instead of ‘Clec-ultraspiracle’). You’re certainly welcome to use prefixes when describing orthologs of multiple species in a publication, if it is necessary to avoid confusion.
    • Please double-check your spelling, and attempt to use the Name exactly as used in the ortholog.
    • Documentation of your justification for the name in the Comments field is welcome (e.g. “88% sequence similarity via blastp to D. melanogaster pepck P20007”).
  • Are you coining a new name for a gene that has not been named yet?

    • Try to choose a name that could be propagated to all orthologous proteins. From UniProt:

      • The protein naming guidelines are based on the premise that a good and stable recommended name (RN) for a protein is a name that is as neutral as possible.
      • An RN should be, as far as possible, unique and attributed to all orthologs.
      • One reason for this is that it should be possible to propagate a protein name to all orthologous proteins, from various organisms. This is why, ideally, the protein name should not contain a specific characteristic of the protein, and in particular it should not reflect the function or role of the protein, nor its subcellular location, its domain structure, its tissue specificity, its molecular weight or its species of origin.
    • From the HGNC: Tissue specificity and molecular weight designations should be avoided as they have only limited use as a description and may in time and across species prove inaccurate; however, they may be incorporated into the gene name if absolutely necessary.
  • Are you naming a gene from a gene family?

    • If a naming convention exists, use it. See http://www.uniprot.org/docs/nomlist.txt for references.
    • If there is no naming convention, UniProt recommends: “For proteins that belong to a multigene family, it is recommended that you choose a coherent nomenclature with numbers to specify the different members of the family.”
  • Are you naming an isoform?

    • Please use the suffix “isoform A”, “isoform B”, etc. NCBI prefers this to “-RA”, etc. The i5k Workspace will still use the “-RA” suffix convention for IDs.
  • Ideally, the protein (mRNA) name is the same as the gene name.
  • Anything else?

    • Don’t hesitate to get in touch with us if you have any questions about naming!
    • Feel free to provide comments on this document below (comments will be moderated and visible by the public).