VariationDescriptor

Variation Descriptors are part of the VRSATILE framework, a set of conventions extending the GA4GH Variation Representation Specification (VRS). Descriptors allow for the complemetary use of human-readable labels, descriptions, alternate contexts, and identifier cross-references alongside the precise computable representation of variation concepts provided by VRS.

Consequently, many forms and formats of variation can be used in variation descriptors, including HGVS descriptions, VCF Records, and SPDI alleles. We recommend the use of VRS Variation objects for representing variants when possible.

The Variation Descriptor should be used to describe candidate variants or diagnosed causative variants. The VariationDescriptor element itself is an element of a VariantInterpretation. If it is present, the Phenopacket standard has the following requirements.

A variation can refer to an external source, for example the ClinGen allele registry, ClinVar, dbSNP, dbVAR etc. using the id field. It is RECOMMENDED to use a CURIE identifier and corresponding Resource. When indicating multiple alternate ids for a variation, use the alternate_ids field.

Multiple alleles in-cis can be modeled as a VRS Haplotype.

The zygosity of the variant as determined in all of the samples represented in this Phenopacket is represented using a list of terms taken from the Genotype Ontology (GENO). For instance, if a variant affects one of two alleles at a certain locus, we could record the zygosity using the term heterozygous (GENO:0000135). This value is stored in the Variation Descriptor alleleic_state field.

Data model

Field Type Multiplicity Description
id string 1..1 Descriptor ID; MUST be unique within document. REQUIRED.
variation Variation 0..1 The VRS Variation object
label string 0..1 A primary label for the variation
description string 0..1 A free-text description of the variation
gene_context GeneDescriptor 0..1 A specific gene context that applies to this variant
expressions Expression 0..* HGVS, SPDI, and gnomAD-style strings should be represented as Expressions
vcf_record VcfRecord 0..1 A VCF Record of the variant. This SHOULD be a single allele, the VCF genotype (GT) field should be represented in the allelic_state
xrefs string 0..* List of CURIEs representing associated concepts. Allele registry, ClinVar, or other related IDs should be included as xrefs
alternate_labels string 0..* Common aliases for a variant, e.g. EGFR vIII, are alternate labels
extensions Extension 0..* List of resource-specific Extensions needed to describe the variation
molecule_context MoleculeContext 1..1 The molecular context of the vrs variation.
structural_type OntologyClass 0..1 The structural variant type associated with this variant, such as a substitution, deletion, or fusion. We RECOMMEND using a descendent term of SO:0001537.
vrs_ref_allele_seq string 0..1 A Sequence corresponding to a “ref allele”, describing the sequence expected at a SequenceLocation reference.
allelic_state OntologyClass 0..1 See allelic_state below. RECOMMENDED.

Variation

VRS is a GA4GH standard which provides a computable representation of variation, be it a genomic, transcript or protein variation. VRS also provides mechanisms for representing haplotypes and systemic variation such as Copy Number Variants (CNVs).

VcfRecord

This element is used to describe variants using the Variant Call Format, which is in near universal use for exome, genome, and other Next-Generation-Sequencing-based variant calling. It is an appropriate option to use for variants reported according to their chromosomal location as derived from a VCF file.

In the Phenopacket format, it is expected that one VcfRecord message described a single allele (in contrast to the actual VCF format that allows multiple alleles at the same position to be reported on the same line; to report these in Phenopacket format, two VariantDescriptor messages would be required). In general the VcfRecord should be used only for the purposes of reporting variants of specific interest, such as in the VariantInterpretation, for cases requiring larger numbers of variants in VCF format, the File should be used.

For structural variation the INFO field should contain the relevant information . In general, the info field should only be used to report structural variants and it is not expected that the Phenopacket will report the contents of the info field for single nucleotide and other small variants.

Field Type Multiplicity Description
genome_assembly string 1..1 Identifier for the genome assembly used to call the allele. REQUIRED.
chrom string 1..1 Chromosome or contig identifier. REQUIRED.
pos int 1..1 The reference position, with the 1st base having position 1. REQUIRED.
id string 0..1 Identifier: Semicolon-separated list of unique identifiers where available. If this is a dbSNP variant thers number(s) should be used.
ref string 1..1 Reference base. REQUIRED.
alt string 1..1 Alternate base. REQUIRED.
qual string 0..1 Quality: Phred-scaled quality score for the assertion made in ALT.
filter string 0..1 Filter status: PASS if this position has passed all filters.
info string 0..1 Additional information: Semicolon-separated series of additional information fields

Extension

The Extension class provides a means to extend descriptions with other attributes unique to a content provider. These extensions are not expected to be natively understood by all users, but may be used for pre-negotiated exchange of message attributes when needed.

Field Type Multiplicity Description
name string 1..1 A name for the Extension. REQUIRED.
value string 1..1 A string representation of the user-defined object. REQUIRED.

Expression

The Expression class is designed to enable descriptions based on a specified nomenclature or syntax for representing an object. Common examples of expressions for the description of molecular variation include the HGVS and ISCN nomenclatures.

We RECOMMEND the use one of the following values in the syntax field: hgvs, iscn, spdi

Field Type Multiplicity Description
syntax string 1..1 A name for the expression syntax. REQUIRED.
value string 1..1 The concept expression as a string. REQUIRED.
version string 0..1 An optional version of the expression syntax.

MoleculeContext

The molecular context of the variant. Default is unspecified_molecule_context.

Examples

In these examples we will show how the ClinVar allele 13294 can be represented using a VariationDescriptor. While it is possible to combine all these in a single message, we have separated them for clarity.

VRS

Here we’re representing the genomic variation using VRS, however VRS is capable of representing the variation in genomic, transcript or protein coordinates.

Example

variationDescriptor:
  id: "clinvar:13294"
  variation:
    allele:
      sequenceLocation:
        sequenceId: "NC_000010.11"
        sequenceInterval:
          startNumber:
            value: "121496700"
          endNumber:
            value: "121496701"
      literalSequenceExpression:
        sequence: "G"
  moleculeContext: "genomic"
  vrsRefAlleleSeq: "T"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

HGVS

Variants can be represented using the HGVS nomenclature as follows.

For example, the Human Genome Variation Society (HGVS) expression NM_000226.3:c.470T>G indicates that a T at position 470 of the sequence represented by version 3 of NM_000226 (which is the mRNA of the human keratin 9 gene KRT9).

We recommend using a tool such as VariantValidator or Mutalyzer to validate the HGVS string. See the HGVS recommendations for details about the HGVS nomenclature.

Example

variationDescriptor:
  id: "clinvar:13294"
  expressions:
  - syntax: "hgvs"
    value: "NM_000226.3:c.470T>G"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

VCF

Example

variationDescriptor:
    id: "clinvar:13294"
    vcfRecord:
        genomeAssembly: "GRCh38"
        chrom: "10"
        pos: 121496701
        id: "rs121918506"
        ref: "T"
        alt: "G"
        qual: "."
        filter: "."
        info: "."
    zygosity:
        id: "GENO:0000135"
        label: "heterozygous"

SPDI

The Sequence Position Deletion Insertion (SPDI) notation is a relatively new notation which uses the same normalisation protocol as VRS. We recommend that users familiarize themselves with this relatively new notation, which differs in important ways from other standards such as VCF and HGVS.

Tools for interconversion between SPDI, HGVS and VCF exist at the NCBI.

SPDI stands for

  1. S = SequenceId
  2. P = Position , a 0-based coordinate for where the Deleted Sequence starts
  3. D = DeletedSequence , sequence for the deletion, can be empty
  4. I = InsertedSequence , sequence for the insertion, can be empty

For instance, Seq1:4:A:G refers to a single nucleotide variant at the fifth nucleotide ( nucleotide 4 according to zero-based numbering) from an A to a G. See the SPDI webpage for more examples.

The SPDI notation represents variation as deletion of a sequence (D) at a given position (P) in reference sequence (S) followed by insertion of a replacement sequence (I) at that same position. Position 0 indicates a deletion that starts immediately before the first nucleotide, and position 1 represents a deletion interval that starts between the first and second residues, and so on. Either the deleted or the inserted interval can be empty, resulting in a pure insertion or deletion.

Note that the deleted and inserted sequences in SPDI are all written on the positive strand for two-stranded molecules.

Example

variationDescriptor:
  id: "clinvar:13294"
  expressions:
  - syntax: "spdi"
    value: "NC_000010.11:121496700:T:G"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

ISCN

The International System for Human Cytogenetic Nomenclature (ISCN), an international standard for human chromosome nomenclature, which includes band names, symbols and abbreviated terms used in the description of human chromosome and chromosome abnormalities.

For example del(6)(q23q24) describes a deletion from band q23 to q24 on chromosome 6.

Example

variationDescriptor:
  id: "id:A"
  expressions:
  - syntax: "iscn"
    value: "t(8;9;11)(q12;p24;p12)"

allelic_state

The zygosity of the variant as determined in all of the samples represented in this Phenopacket is represented using a list of terms taken from the Genotype Ontology (GENO). For instance, if a variant affects one of two alleles at a certain locus, we could record the zygosity using the term heterozygous (GENO:0000135).