HtsFile

Phenopackets can be used to hold phenotypic information that can inform the analysis of sequencing data in VCF format as well as other high-throughput sequencing (HTS) or other data types. The HtsFile message allows a Phenopacket to link HTS files with data.

Given that HtsFile elements are listed in various locations such as the Phenopacket, Biosample, Family etc. which can in turn be nested, individual HTS files MUST be contained within their appropriate scope. For example within a Phenopacket for germline samples of an individual or within the scope of the Phenopacket.Biosample in the case of genomic data derived from sequencing that biosample. Aggregate data types such as Cohort and Family MUST contain aggregate HTS file data i.e. merged/multi-sample VCF at the level of the Family/Cohort, but each member Phenopacket can contain its own locally-scope HTS files pertaining to that individual/biosample(s).

HtsFile

Field Type Status Description
uri string required A valid URI e.g. file://data/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444
description string optional arbitrary description of the file
hts_format HtsFormat required VCF
genome_assembly string required e.g. GRCh38
individual_to_sample_identifiers a map of string key: value recommended The mapping between the Individual.id or Biosample.id to the sample identifier in the HTS file

HtsFormat

This message is used for a file in one of the HTS formats.

Field Description
UNKNOWN An HTS file of unknown type.
SAM A SAM format file
BAM A BAM format file
CRAM A CRAM format file
VCF A VCF format file
BCF A BCF format file
GVCF A GVCF format file
FASTQ A FASTQ format file

Example

{
    "uri": "file://data/genomes/germline_wgs.vcf.gz",
    "description": "Matched normal germline sample",
    "htsFormat": "VCF",
    "genomeAssembly": "GRCh38",
    "individualToSampleIdentifiers": {
      "patient23456": "NA12345"
    }
}

uri

URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444.

description

An arbitrary description of the file contents.

The File message MUST have at least one of path and uri and usually should just have one of the two (in exceptional cases the same file might be referenced on a local file system and on the network).

hts_format

This indicates which format the file has.

genome_assembly

The genome assembly the contents of this file was called against. We recommend using the Genome Reference Consortium nomenclature e.g. GRCh37, GRCh38.

individual_to_sample_identifiers

A map of identifiers mapping an individual refered to in the Phenopacket to a sample in the file. The key values must correspond to the Individual::id for the individuals in the message or Biosample::id for biosamples, the values must map to the samples in the HTS file.