Output Files

The following section describes the outputs produced by DRAGEN Array.

CNV VCF File

DRAGEN Array produces one CNV variant call file (VCF) (*.cnv.vcf) per sample to report the CN status on the gene and sub gene level, along with the CN events for PGx targets.

The CNV VCF output file follows the standard VCF format. The QUAL field in the VCF file measures the CNV call quality. The CNV call quality is a Phred-scaled score capped at 60 and the minimal value is 0. Low quality calls (QUAL<7) are flagged by the Q7 filter. Low quality samples with LogRDev greater than a threshold 0.2 are flagged with the SampleQuality flag.

The CNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The CNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.

The CNV VCF output file includes the following content.

##fileformat=VCFv4.1

##source=dragena 1.0.0

##genomeBuild=38

##reference=file:///hg38_with_alt/hg38_nochr_MT.fa

##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events. CN=5 indicates 5 or 5+">

##FORMAT=<ID=NR,Number=1,Type=Float,Description="Aggregated normalized intensity">

##ALT=<ID=CNV,Description="Copy number variant region">

##FILTER=<ID=Q7,Description="Quality below 7">

##FILTER=<ID=SampleQuality,Description="Sample was flagged as potentially low-quality due to high noise levels.">

##INFO=<ID=CNVLEN,Number=1,Type=Integer,Description="Number of bases in CNV hotspot">

##INFO=<ID=PROBE,Number=1,Type=Integer,Description="Number of probes assayed for CNV hotspot">

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of CNV hotspot">

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Structural Variant Type">

##CNVOverallPloidy=1.8

##CNVGCCorrect=True

##contig=<ID=1,length=248956422>

##contig=<ID=4,length=190214555>

##contig=<ID=10,length=133797422>

##contig=<ID=16,length=90338345>

##contig=<ID=19,length=58617616>

##contig=<ID=22,length=50818468>

##contig=<ID=22_KI270879v1_alt,length=304135>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 204619760001_R01C01

1 109687842 CNV:GSTM1:chr1:109687842:109693526 N <CNV> 60 PASS CNVLEN=5685;PROBE=124;END=109693526;SVTYPE=CNV CN:NR 2:0.966631132771593

4 68537222 CNV:UGT2B17:chr4:68537222:68568499 N <CNV> 60 PASS CNVLEN=31278;PROBE=383;END=68568499;SVTYPE=CNV CN:NR 0:0.376696837881692

10 133527374 CNV:CYP2E1:chr10:133527374:133539096 N <CNV> 60 PASS CNVLEN=11723;PROBE=194;END=133539096;SVTYPE=CNV CN:NR 2:0.980059731860893

16 28615068 CNV:SULT1A1:chr16:28615068:28623382 N <CNV> 57 PASS CNVLEN=8315;PROBE=164;END=28623382;SVTYPE=CNV CN:NR 2:0.980552325552963

19 40844791 CNV:CYP2A6.intron.7:chr19:40844791:40845293 N <CNV> 60 PASS CNVLEN=503;PROBE=38;END=40845293;SVTYPE=CNV CN:NR 2:0.9663775484762

19 40850267 CNV:CYP2A6.exon.1:chr19:40850267:40850414 N <CNV> 60 PASS CNVLEN=148;PROBE=21;END=40850414;SVTYPE=CNV CN:NR 2:0.9663775484762

22 42126498 CNV:CYP2D6.exon.9:chr22:42126498:42126752 N <CNV> 48 PASS CNVLEN=255;PROBE=370;END=42126752;SVTYPE=CNV CN:NR 2:0.981703411438716

22 42129188 CNV:CYP2D6.intron.2:chr22:42129188:42129734 N <CNV> 10 PASS CNVLEN=547;PROBE=333;END=42129734;SVTYPE=CNV CN:NR 2:0.965498002434641

22 42130886 CNV:CYP2D6.p5:chr22:42130886:42131379 N <CNV> 60 PASS CNVLEN=494;PROBE=172;END=42131379;SVTYPE=CNV CN:NR 2:0.970341562236357

22_KI270879v1_alt 270316 CNV:GSTT1:chr22_KI270879v1_alt:270316:278477 N <CNV> 60 PASS CNVLEN=8162;PROBE=91;END=278477;SVTYPE=CNV CN:NR 2:1.01191145130511

SNV VCF File

The software produces one genotyping variant call file (*.snv.vcf) file per sample, covering single nucleotide variants (SNV) and indels for the sample. It reports GenCell score (GS), B Allele Frequency (BAF), and Log R Ratio (LRR) per variant.

Certain SNV and indel calls can be skipped when reported in the VCF. Skipped data can include unmapped loci, intensity-only probes used for CNV identification, and indels that do not map back to the genome. See Warning/Error Messages and Logs for messages that may be seen with DRAGEN Array Local related to the skipped data.

The BAF and LRR are oriented with Ref as A and Alt as B relative to the reference genome, while GS is agnostic to the reference genome. Users familiar with GenomeStudio may observe BAF and LRR reported in the VCF as 1 minus the value reported in GenomeStudio depending on the Ref Alt allele orientation with the reference genome. GenomeStudio reports these values based on the information in the manifest without knowledge of the reference genome.

The SNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The SNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.

The SNV VCF output file includes the following content. The last row shows an example of variant call.

##fileformat=VCFv4.1

##source=dragena 1.0.0

##genomeBuild=38

##reference=file:///genomes/38/genome.fa

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GS,Number=1,Type=Float,Description="GenCall score. For merged multi-assay or multi-allelic records, min GenCall score is reported.">

##FORMAT=<ID=BAF,Number=1,Type=Float,Description="B Allele Frequency">

##FORMAT=<ID=LRR,Number=1,Type=Float,Description="LogR ratio">

##contig=<ID=1,length=248956422>

##contig=<ID=2,length=242193529>

##contig=<ID=3,length=198295559>

##contig=<ID=4,length=190214555>

##contig=<ID=5,length=181538259>

##contig=<ID=6,length=170805979>

##contig=<ID=7,length=159345973>

##contig=<ID=8,length=145138636>

##contig=<ID=9,length=138394717>

##contig=<ID=10,length=133797422>

##contig=<ID=11,length=135086622>

##contig=<ID=12,length=133275309>

##contig=<ID=13,length=114364328>

##contig=<ID=14,length=107043718>

##contig=<ID=15,length=101991189>

##contig=<ID=16,length=90338345>

##contig=<ID=17,length=83257441>

##contig=<ID=18,length=80373285>

##contig=<ID=19,length=58617616>

##contig=<ID=20,length=64444167>

##contig=<ID=21,length=46709983>

##contig=<ID=22,length=50818468>

##contig=<ID=MT,length=16569>

##contig=<ID=X,length=156040895>

##contig=<ID=Y,length=57227415>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 202937470021_R06C01

1 2290399 rs878093 G A . PASS . GT:GS:BAF:LRR 0/1:0.7923:0.50724137:0.14730307

Genotype Call (GTC) File

The genotype call algorithm produces one genotype call file (.gtc) per sample analyzed. The Genotype Call (GTC) file contains the small variant (SNV and indel) genotype for each marker specified by the product and sample quality metrics. The sample marker location is not included and must be extracted from the manifest file. Binary proprietary format can be parsed using the Illumina open-source tool BeadArray Library File Parser.

BedGraph File

The BedGraph file contains the log R ratios from the genotyping algorithm for use in visual tools.

Star Allele CSV File

The Star Allele CSV file is an intermediate file generated by the star-allele call command and serves as the input to the star-allele annotate command. It contains all the star allele calls for all samples in a run. Each row in the file provides either a star allele diplotype or simple variant call for a PGx-related gene. Star allele diplotype calls for a sample and a gene may span multiple lines where alternative solutions can be listed.

The Star Allele CSV file also contains meta information marked by # at the top of the file for the genome build and PGx database used for the star allele calling.

The star_allele.csv file contains the following details per sample:

Below is an example of the first 4 columns from a star allele CSV file:

Sample,Rank,Gene or Variant,Type,Solution

204650490282_R02C01,1,CYP2C9,Haplotype,*9/*11

204650490282_R02C01,1,CYP2C19,Haplotype,*2/*10

Genotype Summary Files

The software produces genotype summary files (gt_sample_summary.csv and gt_sample_summary.json) that contains the following details per sample:

  • Sample ID

  • Sample Name

  • Sample Folder

  • Autosomal Call Rate

  • Call Rate

  • Log R Ratio Std Dev

  • Sex Estimate

  • TGA_Ctrl_5716 Norm R

The TGA_Ctrl_5716 Norm R field is specific to PGx products (e.g., Global Diversity Array with enhanced PGx). The field value is the Normalized R value of one probe and is meant as an assay control where < 1 indicates the sample failed in the TGA (Targeted Gene Amplification) process. If the product does not have this probe, it is not included in the gt_sample_summary.

Final Report

DRAGEN Array Cloud produces a Final Report (gtc_final_report.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus per sample:

Note: Analyses on products with large numbers of loci (>1 Million) and large numbers of samples (>100) yield a large (50+ Gigabyte) Final Report that are difficult to download and review. It’s recommended to create analysis configurations that do not produce this report if large batches are desired.

For more information on interpreting DNA strand and allele information, see Illumina Knowledge article How to interpret DNA strand and allele information for Infinium genotyping array data.

Locus Summary

DRAGEN Array Cloud produces a Locus Summary (locus_summary.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus:

CN Summary File

The sample summary contains per sample key stats for each sample in a batch that contains the following details per sample:

  • Sample ID

  • Sample Name

  • Sample Folder

Copy Number Batch File

The copy number batch summary file (cn_batch_summary.csv) shows the total copy number gain, loss, and neutral (CN=2) values for each target region across all the samples in the analysis.

Example copy number batch summary file content:

Target Region,Total CN gain,Total CN loss,Total CN neutral

CYP2A6.exon.1,0,1,47

CYP2A6.intron.7,0,1,47

CYP2D6.exon.9,2,4,42

CYP2D6.intron.2,7,2,39

CYP2D6.p5,13,2,33

CYP2E1,2,0,46

GSTM1,0,42,6

GSTT1,0,33,15

SULT1A1,0,0,48

UGT2B17,0,34,14

All Target Regions,24,119,337

Warning/Error Messages and Logs

The following scenarios result in a warning or error message:

  • Manifest file used to generate GTC is not the same as the manifest file used to generate the CN model.

  • FASTA files and FASTA index files do not match.

For the following scenarios, the software reports messages to the terminal output (as either a warning or an error):

  • Indel processing for GTC to VCF conversion failed.

  • The input folder does not contain the required input files.

  • An input file is corrupt.

Examples of such notifications can include the following:

Star allele JSON File

The star allele JSON file is produced per sample. It contains the fields present in the star allele CSV file as well as additional meta data and annotations.

Fields included in the star allele JSON header are described below.

Fields included in the star allele call (locusAnnotations) information are described below.

Example of JSON file content:

{

"softwareVersion": "dragena 1.0.0",

"genomeBuild": "hg38",

"databaseSources": "PharmVar Version: 6.0.5, PharmGKB Database Version: Snapshot-2023.08.30, CPIC Database Version: 1.30.0",

"mappingFile": "gda_mapping_53e0931.zip",

"pgxGuideline": "CPIC",

"sampleId": "204619760027_R01C01",

"locusAnnotations": [

{

"gene": "CYP2C9",

"callType": "Star Allele",

"genotype": "*1/*1",

"activityScore": "2",

"phenotype": "Normal Metabolizer",

"qualityScore": "0.9999",

"rawScore": "0.9999",

"supportingVariants": "Complete: *1 ( )",

"candidateSolutions": [

{

"rank": 1,

"genotype": "*1/*1",

"activityScore": "2",

"phenotype": "Normal Metabolizer",

"qualityScore": 0.9999,

"rawScore": 0.9999,

"alleles": [

{

"solutionLong": "Complete: *1",

"supportingVariants": "Complete: *1 ( )",

"missingVariants": "Complete: *1 ( )",

"collapsedAlleles": "Complete: *1 ( )"

}

],

"copyNumberRegions": "p5,exon.1,intron.1,exon.2,intron.2,exon.3,intron.3,exon.4,intron.4,exon.5,intron.5,exon.6,intron.6,exon.7,intron.7,exon.8,intron.8,exon.9,p3",

"copyNumberSolution": "2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2"

}

],

TBI Index File

The TBI (TABIX) index file is associated with the bgzipped VCF files. It allows for data line lookup in VCF files for quick data retrieval. The format is a tab-delimited genome index file developed by Samtools as part of the HTSlib utilities. For more information, visit the Samtools website.

Methylation Control Probe Output File

The software produces a control probe output file ({BeadChipBarcode}_{Position}_ctrl.tsv.gz) per sample that includes the raw methylated and unmethylated values for each control probe.

Each control probe has an address, type, color channel, name, and probe ID. It also provides the raw signal for methylated green (MG), methylated red (MR), unmethylated green (UG) and unmethylated red (UR).

The file can help identify which probes are available on a given BeadChip.

Methylation CG Output File

The software produces a CG output file ({BeadChipBarcode}_{Position}_cgs.tsv.gz) per sample that includes beta values, m-values and detection p-values for each CG site.

Beta values measure methylation levels in a linear fashion for easy interpretation. Unmethylated probes are close to zero and methylated probes are close to 1.

M-values are a log transformed beta value which provides a more representative measure of methylation.

Detection p-values measure the likelihood that the signal is background noise. It is recommended that p-value >0.05 are excluded from analysis as they are likely background noise.

Methylation Sample QC Summary Files

The software produces methylation sample QC summary in .xlsx and .tsv file formats (sample_qc_summary.xlsx and sample_qc_summary.tsv) per analysis batch, which provides per sample QC data for all samples in the batch.

The QC summary provides details on 21 controls metrics (see tables below), which are computed in same way as in the BeadArray Control Reporter software from Illumina. In addition, it provides average red and green raw and normalized signals, time of scanning, proportion of probes passing, overall sample pass/fail status, and the failure codes for control metrics that did not pass. The sample pass status is defined as the passing of all 21 control metrics. The QC summary .xlsx file further highlights failing parameters for easy viewing.

The QC summary files contain the following fields:

The control metrics in the QC summary files are calculated as following. The default value for background correction offset (x) of 3,000 can be modified and applies to all background calculations indicated with (bkg + x). Note that the table uses default thresholds for EPIC arrays as example, the default thresholds changes with the methylation arrays. See section Threshold Adjustment for additional details.

Methylation Sample QC Summary Plots

The software produces methylation sample QC summary plots (sample_qc_summary.pdf) per analysis batch which provides visual depictions of two QC summary plots for quick visual review.

The file contains the following control plots:

Methylation Principal Component Summary

The software produces a methylation principal component summary file (pcs.tsv.gz) per analysis batch which provides principal component data for each sample within the batch. This can be used to identify the specific samples associated with points on the PCA control plot within the Methylation Sample QC Control Plots output file.

The files contain the following fields:

Methylation Manifest Files

The software produces two methylation manifest files

  1. Manifest in Sesame format (probes.csv)

  2. Additional information for control probes (controls.csv)

The probes.csv file has the following columns:

The controls.csv file has the following columns:

Methylation Warning/Error Messages and Logs

The following scenarios result in a warning or error message:

  • Missing IDATs or manifest

  • Incorrect sample sheet formatting

  • Duplicate BeadChip Barcode and Position within the sample sheet

  • Missing control or assay probes

  • Missing required columns in the manifest

  • Unable to compute certain metrics

Examples of such notifications can include the following:

Last updated

For Research Use Only. Not for use in diagnostic procedures. © 2024 Illumina, Inc. All rights reserved.