Input Files
The following section describes the input files required by DRAGEN Array. Product files (anything other than the IDATs) can be found on the support site.
IDAT Files
For each sample a pair of raw intensity files (.idat) are generated from the iScan System or NextSeq550 (for select arrays). They provide intensities in the red and green channels for each probe on the Infinium array. More information on which arrays can be used with NextSeq550, can be found on the Illumina Knowledge page on NextSeq550.
An IDAT file is identified by the BeadChip Barcode (12-digit unique Sentrix ID, i.e. 123456789101), BeadChip Position (row and column of the sample, i.e. R01C01), and Grn (Green) or Red for the specific channel.
Manifest Files
The CSV and BPM manifest files can be found on the Illumina Support Site for all commercial Infinium BeadChips or on MyIllumina for custom and semi-custom designs. DRAGEN Array only supports manifest files from the Illumina Support site. For instructions on obtaining manifest files from MyIllumina, see Illumina Knowledge article, How to access custom array product files (manifest and product definition files) in MyIllumina.
The CSV manifest file (.csv) provides complementary data to the BPM manifest file in a human readable format. It is a required input to the genotype gtc-to-vcf command to enable VCF generation for insertion/deletion variants. gtc-to-vcf
depends on the presence of accurate mapping information within the manifest, and may produce inaccurate results if the mapping information is incorrect. Mapping information follows the implicit dbSNP standard, where
Positions are reported with 1-based indexing.
Positions in the PAR are reported with mapping position to the X chromosome.
For an insertion relative to the reference, the position of the base immediately 5' to the insertion (on the plus strand) is given.
For a deletion relative to the reference, the position of the most 5' deleted based (on the plus strand) is given.
Cluster File
The cluster file (.egt) is a standard product file provided by Illumina for commercial genotyping products and it is a required input for the genotype call command in DRAGEN Array. Custom cluster files may be required for optimal genotyping performance. See section Optimizing cluster files and copy number models for additional details.
PGx CN Model File
The PGx CN (Copy Number) model file (.dat) is a required input to the pgx copy-number call command to enable accurate copy number calling for pharmacogenomics. Illumina provides a standard CN model file for each PGx array product. See section Optimizing cluster files and copy number models for additional details.
Cytogenetics Model File
The cytogenetics CN (Copy Number) model file (.dat) is a required input to the cyto call command to enable accurate Cytogenetics analysis. Illumina provides a standard CN model file for each supported array product. For custom or other products, please contact Tech Support to request a CN model file and include the product BPM manifest.
Note: The CN model file needs to be updated upon manifest revisions since probes can be added or removed during manifest revisions. A mismatch between the CN model file and the manifest will cause an error during pgx copy-number call
and cyto call
.
Mask File
The mask file (.msk) is a required input to the pgx copy-number train command to enable accurate pgx copy number training for pharmacogenomics. It does not need to be provided as an explicit input to the command line interface but should reside in the same folder as the BPM manifest. It should have the same base name as the manifest for the product. Illumina provides a mask file for each PGx array product and these can be found on the product files support page.
PGx Database File
The PGx database file (.zip) contains the variant mapping information from Infinium PGx arrays to PGx variants. Each line in this file represents a single probe ID mapping to a variant's HGVS (Human Genome Variation Society) tag. This creates a map of many probes to one variant. DRAGEN Array cross references this map with SNV VCF IDs during runtime to do star allele calling. It works across all supported PGx products, even though the probes and variant coverage differ across them.
Cytogenetics Database File
The cytogenetics database file (.zip) contains information from Ensembl and RefSeq data sources used in the generation of Cytogenetics Annotation JSON File. This file can be used across products (beadchip/manifest types and versions). It is only necessary for input to local analysis (i.e., cyto annotate
) as it is already stored in the cloud for cloud analysis. It may be updated in the future to accomodate changes in the underlying Ensembl and RefSeq datasources.
Genome FASTA Files
The genome FASTA file (.fa) is a text file with the reference genome sequences.The FASTA index file (.fai) contains metadata about chromosomal orchestration within the FASTA file for a particular species. DRAGEN Array PGx calling supports human genome build 37 and 38. The genome FASTA file and FASTA index file are both provided by Illumina for human species and should be stored together in the same input folder. For custom reference genomes, the contig identifiers in the provided genome FASTA file must match exactly the chromosome identifiers specified in the provided manifest. For a standard human product manifest, this means that the contig headers should read ">1" rather than ">chr1". Note: The Genome FASTA file is only required for the dragen-array-local-analysis workflow. If you're using dragen-array-cloud-analysis, you do not need to provide this file.
Sample Sheet
The sample sheet is a CSV formatted input file that utilizes a couple required fields for sample lookup (SentrixBarcode_A, SentrixPosition_A
for local, beadChipName, sampleSectionName
for cloud) to enable adding optional metadata and analyzing a filtered list of samples within a folder. It is intended to be flexible and the local version should be backwards compatible with most GenomeStudio samplesheets.
The root folder which DRAGEN Array will search the files for can be set by either providing it via the --idat-folder
or --gtc-folder
options (where applicable). Or by setting the RootFolder
field in the [Header]
section. This RootFolder
should be the full absolute path to the sample files. e.g.,
[Header]
RootFolder,/test/samples
[Data]
....
Note: In the case of conflict between RootFolder
and the CLI options (--idat-folder
or --gtc-folder
), the CLI options take precedence.
The following are examples of all valid samplesheets:
Most basic (no sections, one sample)
SentrixBarcode_A,SentrixPosition_A
204753010023,R02C01
Medium complexity (no sections, multiple samples, optional data)
SentrixBarcode_A,SentrixPosition_A,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010024,R01C01,NA1233,Group2,M
High complexity (sections, multiple samples, optional data)
[Header]
RootFolder,/tests/samples
Date,1/1/2025
[Data]
SentrixBarcode_A,SentrixPosition_A,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010024,R01C01,NA1233,Group2,M
Notes:
The column names are case insensitive. For example, the columns
Sample_Name
andsample_name
, would be considered the same and the software would produce an error like this:Duplicate column sample_name found. Column names are case-insensitive. Please remove or rename the column from the samplesheet and re-process.
Because user-provided fields get output in the Genotype Summary File, the column names cannot conflict with those fields. For example, if the user provides a column named
Sex Estimate
in their samplesheet. DRAGEN Array will produce the following error:Sex Estimate is a reserved keyword. Please remove or rename the column from the samplesheet and re-process.
The optional fields (i.e. not
SentrixBarcode_A
andSentrixBarcode_B
) will be output as-is in the genotype summary files for thegenotype call
command.The
[Manifests]
section (used by GenomeStudio to delineate manifests in multi-manifest analyses) is currently ignored in DRAGEN Array.There is a known issue regarding empty columns in the v1.3 Release Notes.
For cloud analyses (i.e., for use in sample selection in running cloud analyses), the samplesheet does not currently support sections such as [Header]
and [Data]
and instead of using SentrixBarcode_A
and SentrixPosition_A
columns as the sample's keys, it uses beadChipName
and sampleSectionName
. i.e., a valid cloud samplesheet could look like this:
beadChipName,sampleSectionName
204753010023,R01C01
204753010023,R02C01
204753010024,R01C01
204753010024,R02C01
There is also a template available on the sample selection interface on Basespace.
Methylation QC sample sheet
For DRAGEN Array Methylation QC on cloud, the additional optional sample sheet fields are used in analysis.
Following Sample_Group, any number of additional columns can be added to include meta data fields such as sex, sample type, plate and well information, etc. Additional columns added after the Sample_Group column may have user-defined column header values. The Sample_ID field and any additional metadata added will be replicated in the Sample QC Summary output files.
The Sample_Group field will be used to populate the PCA Control Plot within the Sample QC Summary Plots file and the Principal Component Summary file. For the PCA Control Plot, each sample group will be assigned a unique color. Samples assigned to the same Sample_Group value will be the same color in the PCA Control Plot. e.g.,
beadChipName,sampleSectionName,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010023,R02C01,NA1232,Group2,F
204753010024,R01C01,NA1233,Group2,M
204753010024,R02C01,NA1234,Group1,M
Cytogenetics analysis + Emedgene interpretation sample sheet
For Cytogenetics analysis + Emedgene interpretation on cloud, an additional column: demographicSex
will be used to compare against to the Sex Estimate
output from DRAGEN Array genotyping module and be displayed in Emedgene. The allowed values for this field are M
(Male), F
(Female), or U
(Unknown).
Example:
beadChipName,sampleSectionName,demographicSex
204753010023,R01C01,F
204753010023,R02C01,F
204753010024,R01C01,M
204753010024,R02C01,M
Input File Summary Table
In addition to the input files, there are set of intermediate files, including GTC, SNV VCF, CNV VCF and PGx CSV, which are outputs of some DRAGEN Array Local commands and inputs to other commands.
The table below summarizes the input files or intermediate file, their sources, and the associated DRAGEN Array Local commands and options.
IDAT
User provided from scanning instrument
genotype call
--idat-folder
CSV Manifest
Product file from Illumina
genotype gtc-to-vcf
--csv-manifest
BPM Manifest
Product file from Illumina
pgx copy-number train
genotype call
genotype gtc-to-bedgraph
genotype gtc-to-vcf
--bpm-manifest
Cluster File
Product file from Illumina or user created using GenomeStudio
genotype call
--cluster-file
PGx CN Model
Product file from Illumina or user created using DRAGEN Array Local
pgx copy-number call
--cn-model
Cytogenetics CN Model
Product file from Illumina
cyto call
--cn-model
PGx Database
Product file from Illumina
pgx star-allele call
--database
Cytogenetics Database
Product file from Illumina
cyto annotate
--database
Genome FASTA
Product file from Illumina
genotype gtc-to-vcf
pgx copy-number train
--genome-fasta-file
Sample Sheet
User provided
genotype call
genotype gtc-to-bedgraph
genotype gtc-to-vcf
pgx copy-number call
pgx copy-number train
--sample-sheet
GTC
DRAGEN Array output from genotype call
genotype gtc-to-bedgraph
genotype gtc-to-vcf
pgx copy-number call
pgx copy-number train
--gtc-folder
SNV and PGx CNV VCF
DRAGEN Array output from genotype gtc-to-vcf and pgx copy-number call
pgx star-allele call
--vcf-folder
PGx CSV
DRAGEN Array output from pgx star-allele call
pgx star-allele annotate
--star-alleles
Cytogenetics CNV VCF
DRAGEN Array output from cyto call
cyto annotate
--vcf-folder
Last updated
Was this helpful?