Welcome to the documentation for phenopackets-schema!

The goal of the phenopacket-schema is to define the phenotypic description of a patient/sample in the context of rare disease or cancer genomic diagnosis. The schema as well as source code in Java, C++, and Python is available from the phenopacket-schema GitHub repository.

What is a Phenopacket?

Phenopackets represent an open standard for sharing disease and phenotype information will improve our ability to understand, diagnose, and treat both rare and common diseases. A Phenopacket links detailed phenotype descriptions with disease, patient, and genetic information, enabling clinicians, biologists, and disease and drug researchers to build more complete models of disease. The standard is designed to encourage wide adoption and synergy between the people, organizations and systems that comprise the joint effort to address human disease and biological understanding.

While great strides have been made in exchange formats for sequence and variation data (e.g. Variant Call Format), complementary standards for phenotypes and environment are urgently needed. For individuals with rare and undiagnosed diseases, such standards could improve the speed and accuracy of diagnosis. The development of a clinical phenotype data exchange standard is both necessary and timely. It is necessary because study sizes of well over 100,000 patients are thought to be required to effectively assess the role of rare variation in common disease or to discover the genomic basis for a substantial portion of Mendelian diseases. It is timely because studies of this power are now becoming financially and technologically tractable.

Phenotypic abnormalities of individuals are currently described in diverse places in diverse formats: publications, databases, health records, and even in social media. We propose that these descriptions a) contain a minimum set of fields and b) get transmitted alongside genomic sequence data, such as in VCF, between clinics, authors, journals, and data repositories. The structure of the data in the exchange standard will be optimized for integration from these distributed contexts. The implementation of such a system will allow the sharing of phenotype data prospectively, as well as retrospectively. Increasing the volume of computable data across a diversity of systems will support large-scale computational disease analysis using the combined genotype and phenotype data.

The terms ‘disease’ and ‘phenotype’ are often conflated. The Phenopackets standard uses phenotypic feature to refer to a phenotypic feature, such as Arachnodactyly, that is the component of a disease, such as Marfan syndrome. The Phenopacket proposed here is designed to support deep phenotyping, a process wherein individual components of each phenotype are observed and documented. The phenoptypes may be constitutional or those related to a sample (such as from a biopsy).

Phenopackets require the use of a common ontology, a logically defined hierarchy of terms, that allows sophisticated algorithmic analysis over medically relevant abnormalities. The National Cancer institute’s Thesaurus (NCIt) is used for cancer biosamples, and is the de facto standard for cancer knowledge representation and regulatory submission. The Human Phenotype Ontology (HPO) was built for this purpose and has been used for genomic diagnostics, translational research, genomic matchmaking, and systems biology applications. The HPO is developed in the context of the Monarch Initiative, an international team of computer scientists, clinicians, and biologists in the United States, Europe, and Australia; HPO is being translated into multiple languages to support international interoperability. Due to its extensive phenotypic coverage beyond other terminologies, HPO has recently been integrated into the Unified Medical Language System (UMLS) to support deep phenotyping in a variety of mainstream health care IT systems.

What is the Phenopacket Schema?

The goal of the phenopacket-schema is to define the phenotypic description of a patient/sample in the context of rare disease or cancer genomic diagnosis. It aims to provide sufficient and shareable information of the data outside of the EHR (Electronic Health Record) with the aim of enabling capturing of sufficient structured data at the point of care by a clinician or clinical geneticist for sharing with other labs or computational analysis of the data in clinical or research environments.

This work has been produced as part of the GA4GH Clinical Phenotype Data Capture Workstream and is designed to be compatible with GA4GH metadata-schemas.

The phenopacket schema defines a common, limited set of data types which may be composed into more specialised types for data sharing between resources using an agreed upon common schema.

This common schema has been used to define the ‘Phenopacket’ which is a catch-all collection of data types, specifically focused on representing rare-disease or cancer samples for both initial data capture and analysis. The phenopacket is designed to be both human and machine-readable, and to inter-operate with the HL7 Fast Healthcare Interoperability Resources Specification (aka FHIR®).