VariationDescriptor

Variation Descriptors are part of the VRSATILE framework, a set of conventions extending the GA4GH Variation Representation Specification (VRS). Descriptors allow for the complementary use of human-readable labels, descriptions, alternate contexts, and identifier cross-references alongside the precise computable representation of variation concepts provided by VRS.

Consequently, many forms and formats of variation can be used in variation descriptors, including HGVS descriptions, VCF Records, and SPDI alleles. We recommend the use of VRS Variation objects for representing variants when possible.

The Variation Descriptor should be used to describe candidate variants or diagnosed causative variants. The VariationDescriptor element itself is an element of a VariantInterpretation. If it is present, the Phenopacket standard has the following requirements.

A variation can refer to an external source, for example the ClinGen allele registry, ClinVar, dbSNP, dbVAR etc. using the id field. It is RECOMMENDED to use a CURIE identifier and corresponding Resource. When indicating multiple alternate ids for a variation, use the alternate_ids field.

Multiple alleles in-cis can be modeled as a VRS Haplotype.

The zygosity of the variant as determined in all of the samples represented in this Phenopacket is represented using a list of terms taken from the Genotype Ontology (GENO). For instance, if a variant affects one of two alleles at a certain locus, we could record the zygosity using the term heterozygous (GENO:0000135). This value is stored in the Variation Descriptor alleleic_state field.

Data model

Field	Type	Multiplicity	Description
id	string	1..1	Descriptor ID; MUST be unique within document. REQUIRED.
variation	Variation	0..1	The VRS `Variation` object
label	string	0..1	A primary label for the variation
description	string	0..1	A free-text description of the variation
gene_context	GeneDescriptor	0..1	A specific gene context that applies to this variant
expressions	Expression	0..*	HGVS, SPDI, and gnomAD-style strings should be represented as Expressions
vcf_record	VcfRecord	0..1	A VCF Record of the variant. This SHOULD be a single allele, the VCF genotype (GT) field should be represented in the allelic_state
xrefs	string	0..*	List of CURIEs representing associated concepts. Allele registry, ClinVar, or other related IDs should be included as xrefs
alternate_labels	string	0..*	Common aliases for a variant, e.g. EGFR vIII, are alternate labels
extensions	Extension	0..*	List of resource-specific Extensions needed to describe the variation
molecule_context	MoleculeContext	1..1	The molecular context of the vrs variation.
structural_type	OntologyClass	0..1	The structural variant type associated with this variant, such as a substitution, deletion, or fusion. We RECOMMEND using a descendent term of SO:0001537.
vrs_ref_allele_seq	string	0..1	A Sequence corresponding to a “ref allele”, describing the sequence expected at a SequenceLocation reference.
allelic_state	OntologyClass	0..1	See allelic_state below. RECOMMENDED.

Variation

VRS is a GA4GH standard which provides a computable representation of variation, be it a genomic, transcript or protein variation. VRS also provides mechanisms for representing haplotypes and systemic variation such as Copy Number Variants (CNVs).

Note

When introduced in Phenopacket Schema v2, a protobuf version of VRS (github.com/ga4gh/vrs-protobuf) was derived from the source VRS representation in JSON schema and used for phenopackets. The vrs-protobuf message structure is losslessly transformable but syntactically distinct from the native VRS JSON schema.

VcfRecord

This element is used to describe variants using the Variant Call Format, which is in near universal use for exome, genome, and other Next-Generation-Sequencing-based variant calling. It is an appropriate option to use for variants reported according to their chromosomal location as derived from a VCF file.

In the Phenopacket format, it is expected that one VcfRecord message described a single allele (in contrast to the actual VCF format that allows multiple alleles at the same position to be reported on the same line; to report these in Phenopacket format, two VariantDescriptor messages would be required). In general the VcfRecord should be used only for the purposes of reporting variants of specific interest, such as in the VariantInterpretation, for cases requiring larger numbers of variants in VCF format, the File should be used.

For structural variation the INFO field should contain the relevant information . In general, the info field should only be used to report structural variants and it is not expected that the Phenopacket will report the contents of the info field for single nucleotide and other small variants.

Field	Type	Multiplicity	Description
genome_assembly	string	1..1	Identifier for the genome assembly used to call the allele. REQUIRED.
chrom	string	1..1	Chromosome or contig identifier. REQUIRED.
pos	int	1..1	The reference position, with the 1st base having position 1. REQUIRED.
id	string	0..1	Identifier: Semicolon-separated list of unique identifiers where available. If this is a dbSNP variant thers number(s) should be used.
ref	string	1..1	Reference base. REQUIRED.
alt	string	1..1	Alternate base. REQUIRED.
qual	string	0..1	Quality: Phred-scaled quality score for the assertion made in ALT.
filter	string	0..1	Filter status: PASS if this position has passed all filters.
info	string	0..1	Additional information: Semicolon-separated series of additional information fields

Extension

The Extension class provides a means to extend descriptions with other attributes unique to a content provider. These extensions are not expected to be natively understood by all users, but may be used for pre-negotiated exchange of message attributes when needed.

Field	Type	Multiplicity	Description
name	string	1..1	A name for the Extension. REQUIRED.
value	string	1..1	A string representation of the user-defined object. REQUIRED.

Expression

The Expression class is designed to enable descriptions based on a specified nomenclature or syntax for representing an object. Common examples of expressions for the description of molecular variation include the HGVS and ISCN nomenclatures.

We RECOMMEND the use one of the following values in the syntax field: hgvs, iscn, spdi

Field	Type	Multiplicity	Description
syntax	string	1..1	A name for the expression syntax. REQUIRED.
value	string	1..1	The concept expression as a string. REQUIRED.
version	string	0..1	An optional version of the expression syntax.

MoleculeContext

The molecular context of the variant. Default is unspecified_molecule_context.

Examples

In these examples we will show how the ClinVar allele 13294 can be represented using a VariationDescriptor. While it is possible to combine all these in a single message, we have separated them for clarity.

VRS

Here we’re representing the genomic variation using VRS, however VRS is capable of representing the variation in genomic, transcript or protein coordinates.

Example

variationDescriptor:
  id: "clinvar:13294"
  variation:
    allele:
      sequenceLocation:
        sequenceId: "NC_000010.11"
        sequenceInterval:
          startNumber:
            value: "121496700"
          endNumber:
            value: "121496701"
      literalSequenceExpression:
        sequence: "G"
  moleculeContext: "genomic"
  vrsRefAlleleSeq: "T"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

HGVS

Variants can be represented using the HGVS nomenclature as follows.

For example, the Human Genome Variation Society (HGVS) expression NM_000226.3:c.470T>G indicates that a T at position 470 of the sequence represented by version 3 of NM_000226 (which is the mRNA of the human keratin 9 gene KRT9).

We recommend using a tool such as VariantValidator or Mutalyzer to validate the HGVS string. See the HGVS recommendations for details about the HGVS nomenclature.

Example

variationDescriptor:
  id: "clinvar:13294"
  expressions:
  - syntax: "hgvs"
    value: "NM_000226.3:c.470T>G"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

VCF

Example

variationDescriptor:
    id: "clinvar:13294"
    vcfRecord:
        genomeAssembly: "GRCh38"
        chrom: "10"
        pos: 121496701
        id: "rs121918506"
        ref: "T"
        alt: "G"
        qual: "."
        filter: "."
        info: "."
    zygosity:
        id: "GENO:0000135"
        label: "heterozygous"

SPDI

The Sequence Position Deletion Insertion (SPDI) notation is a relatively new notation which uses the same normalisation protocol as VRS. We recommend that users familiarize themselves with this relatively new notation, which differs in important ways from other standards such as VCF and HGVS.

Tools for interconversion between SPDI, HGVS and VCF exist at the NCBI.

SPDI stands for

S = SequenceId
P = Position , a 0-based coordinate for where the Deleted Sequence starts
D = DeletedSequence , sequence for the deletion, can be empty
I = InsertedSequence , sequence for the insertion, can be empty

For instance, Seq1:4:A:G refers to a single nucleotide variant at the fifth nucleotide ( nucleotide 4 according to zero-based numbering) from an A to a G. See the SPDI webpage for more examples.

The SPDI notation represents variation as deletion of a sequence (D) at a given position (P) in reference sequence (S) followed by insertion of a replacement sequence (I) at that same position. Position 0 indicates a deletion that starts immediately before the first nucleotide, and position 1 represents a deletion interval that starts between the first and second residues, and so on. Either the deleted or the inserted interval can be empty, resulting in a pure insertion or deletion.

Note that the deleted and inserted sequences in SPDI are all written on the positive strand for two-stranded molecules.

Example

variationDescriptor:
  id: "clinvar:13294"
  expressions:
  - syntax: "spdi"
    value: "NC_000010.11:121496700:T:G"
  allelicState:
    id: "GENO:0000135"
    label: "heterozygous"

ISCN

The International System for Human Cytogenetic Nomenclature (ISCN), an international standard for human chromosome nomenclature, which includes band names, symbols and abbreviated terms used in the description of human chromosome and chromosome abnormalities.

For example del(6)(q23q24) describes a deletion from band q23 to q24 on chromosome 6.

Example

variationDescriptor:
  id: "id:A"
  expressions:
  - syntax: "iscn"
    value: "t(8;9;11)(q12;p24;p12)"

allelic_state

The zygosity of the variant as determined in all of the samples represented in this Phenopacket is represented using a list of terms taken from the Genotype Ontology (GENO). For instance, if a variant affects one of two alleles at a certain locus, we could record the zygosity using the term heterozygous (GENO:0000135).