Phenopacket Schema

The goal of the phenopacket-schema is to define a machine-readable phenotypic description of a patient/sample in the context of rare disease, common/complex disease, or cancer. It aims to provide sufficient and shareable information of the data outside of the EHR (Electronic Health Record) with the aim of enabling capturing of sufficient structured data at the point of care by a clinician or clinical geneticist for sharing with other labs or computational analysis of the data in clinical or research environments.

This work has been produced as part of the GA4GH Clinical Phenotype Data Capture Workstream and is designed to be compatible with GA4GH metadata-schemas.

The phenopacket schema defines a common, limited set of data types which may be composed into more specialised types for data sharing between resources using an agreed upon common schema.

This common schema has been used to define the ‘Phenopacket’ which is a catch-all collection of data types, specifically focused on representing disease data both initial data capture and analysis. The phenopacket schema is designed to be both human and machine-readable, and to inter-operate with standards being developed in organizations such as in the ISO TC215 committee and the HL7 Fast Healthcare Interoperability Resources Specification (aka FHIR®).

The structure of the schema is defined in protobuf. You can find out more in the section A short introduction to protobuf.

Version 1.0

The diagram below shows an overview of the schema elements.

_images/phenopacket-schema-v1-overview.svg

Overview of v1.0 of the schema. Lines between elements indicate composition. Note that the OntologyClass and TimeElement links have been omitted for legibility. The colour scheme represents: base classes, interpretation classes, genomic classes, pedigree classes, top-level classes

_images/phenopacket-schema-v1.svg

Detailed view of v1.0 of the schema. Lines between elements indicate composition. Note that the OntologyClass links have been omitted for legibility.

Version 2.0

_images/phenopacket-schema-v2-overview.svg

Overview of v2.0 of the schema. Lines between elements indicate composition. Note that the OntologyClass and TimeElement links have been omitted for legibility. The colour scheme represents: base classes, interpretation classes, measurement classes, genomic/vrs classes, pedigree classes, top-level classes, medical-action classes

_images/phenopacket-schema-v2.svg

Detailed view of v2.0 of the schema. Lines between elements indicate composition. Note that the OntologyClass and TimeElement links have been omitted for legibility.

Version 2.0 includes significant changes and additions to the model to enable better representation of cancer and common disease, as well as catering for the original use-case for rare-disease.

Additions

The following elements and their sub-elements were added to the 2.0 schema. Other additional fields have been added throughout the schema.

Measurements

Added a new Measurement message for capturing quantitative, ordinal (e.g., absent/present), or categorical measurements. This element is available as a repeated field in the Phenopacket and Biosample top-level elements.

Medical actions

The MedicalAction was added to capture medications, procedures, other actions taken for clinical management. This element is available as a repeated field in the Phenopacket.

Time element

The TimeElement was added to collect the various ways of expressing time or age throughout the schema. In general where there was an onset or start time, a resolution or end TimeElement has been added.

VRS / VRSATILE

The GeneDescriptor and VariationDescriptor replace the v1.0 Gene and Variant messages. The new messages are based on the VRS and VRSATILE schemas defined by the GA4GH GKS group

Non-breaking Changes

The .proto files in the schema have been re-organised into more self-contained logical groups extracted from the base.proto file. These files are all organised into a v2 package which lives alongside the v1 package. For some language bindings it may be required to fix import paths for code created with the previous version to compile against the latest release, but otherwise code using v1.0 of the schema should work identically.

Breaking Changes

Time in Individual, Biosample, Disease, Phenotypic Feature

The TimeElement replaces the onset oneof in PhenotypicFeature and Disease, the time_of_collection field in Biosample. The Individual age field has been replaced with a time_at_encounter TimeElement and Biosample individual_age_at_collection has been replaced with a time_of_collection TimeElement. PhenotypicFeature ‘negated’ field was renamed to ‘excluded’ to be in line with Disease when indicating an absent phenotype.

Gene and Variant contexts

In Phenopacket and Biosample the genes and variants fields have been removed. In the case of the Phenopacket these have been replaced with the updated Interpretation.

Interpretation

The v2.0 Interpretation is now a sub-element of a phenopacket, rather than an enclosing element. The change allows for better semantics on the Gene (now replaced by GeneDescriptor) and Variant (now replaced by VariationDescriptor) types and their relationship to an Individual or Biosample in the context of a Diagnosis based on a GenomicInterpretation.