Analysis Pipeline Details
Illumina fastq generation
Generation of fastq files from Illumina run folders. This is run on all samples from illumina run folders generated through the Translational Genomics Laboratory, and data return is included as part of the sequencing costs.
Analysis : Basecalling and Demultiplexing using Illumina bcl2fastq software.
Workflows : bcl2fastq
Deliverables : Raw sequence data (fastq format)
Software versions :
- bcl2fastq
Alignment Only Pipelines
Alignment of raw sequence data to the hg38 genomic reference.
DNASeq
Analysis : Trimming of Illumina adapter sequence with cutadapt, followed by alignment to the reference genome with bwaMem. This is run agains a single fastq pair. Merging of data from multiple fastq pairs is part of the Alignment, Merging and Preprocessing pipeline.
Workflows : bwa
Deliverables : aligned sequence data, coordinate sorted (bam format) + indices.
Software versions :
- cutadapt
- bwa
- samtools
RNASeq
Analysis : Alignment to the reference genome with STAR. Multiple fastq or fastq pairs from the same sample are merged.
Workflows : star
Deliverables : Aligned sequence data to the genomic reference, coordinate sorted (bam format). Aligned sequence data to the transcriptome (bam format) using gencode v31
Software versions :
- star
Alignment, Merging and Preprocessing (DNASeq)
Generation of Call-Ready alignments suitable for downstream analysis.
Analysis : Alignment of raw sequence data to the hg38 genomic reference as described in the Alignment-only pipeline. This is followed by merging of multiple alignments from the same sample. A variety of optional steps can then be run, to prepare the alignments for downstream variant calling. This includes:
- filtering to remove secondary reads with samtools
- duplicate marking with picard
- local realignment of indels with GATK IndelRealigner
- base quality score recalibration with GATK bqsr
Workflows : bwa, bam_merge_preprocessing
Deliverables : aligned sequence data, coordinate sorted (bam format) + indices (1 per fastq file); merged preprocessed “call-ready” alignments (bam format) + indices
Software versions :
- samtools
- picard
- gatk
Somatic Variant Calling (DNASeq)
Tumour samples are processed through a variety of tools to generate the variant calls of different types. Somatic calling requires that each tumour sample has a matched normal (generally blood or tumour adjacent). These tools can be configured to run on Whole Genome libraries (mutations, structural variants, copy number variation) or Whole Exome libraries (mutations, copy number variation). The input to this is aligned sequence data in a call ready state as generated by the Alignment, Merging and Preprocessing Pipeline.
Mutations (snvs + short indels)
Analysis : somatic snvs and short indels are called using GATK Mutect2. Variants are annotated with Variant Effect Predictor
Workflows : bwa, bam_merge_preprocessing, mutect2, variant effect predictor
Deliverables : annotated somatic calls (vcf files + tbi index)
Software versions :
- mutect2
- variant effect predictor
Structural Variants
Analysis : somatic structural variants are callled with delly. Calls are then annotated and validated with mavis.
Workflows : bwa, bam_merge_preprocessing, delly, mavis
Deliverables : delly structural variant table, mavis structural variant table + structural variant drawings
Software versions :
- delly
- mavis
Note : if the same samples have undergone WT sequencing and fusion calling, mavis can be set up to assess both the WG and WT libraries, and merge structural variant calls.
Copy Number Variation
Analysis : varscan is used for mutation and copy number calls. This is input to sequenza for generation of copy number profiles.
Workflows : bwa, bam_merge_preprocessing, varscan, sequenza
Deliverables : sequenza segmentation files etc. These are generated across a series of width parameters. Review of the profiles is key in identifying an optimal output.
Software versions :
- varscan
- sequenza
Whole Transcriptome Analysis Pipeline
Whole transcriptome libraries are aligned to the hg38 reference, from which expression and fusion calls are generated.
Analysis : Alignment to the reference genome with STAR. Expression calls are generated from the alignment with RSEM. Fusions calls are generated with STAR-fusion. Fusion calls are further validated and annotated with mavis.
Workflows : star, rsem, star-fusion, mavis
Deliverables : aligned sequence data to the genomic reference, coordinate sorted (bam format), and to the transcriptome (bam format) using gencode v31. Expression call tables from RSEM using gencode v31. Fusion call tables from star-fusion.
Software versions :
- star
- rsem
- star-fusion
- mavis
Note : if the same samples have undergone WG sequencing and structural variant calling, mavis can be set up to assess both the WG and WT libraries, and merge structural variant calls.
WGTS Analysis Pipeline
This is an accredited pipeline that that requires Whole Genome (WT, tumour with matched normal) and Whole Transcriptome (TS, tumour only) data from the same donor, and processes the data through the Somatic Variant Calling and Whole Transcriptome Pipelines.
Analysis : see the Somatic Variant and Whole Transcriptome Pipelines for analysis details, workflows and deliverables. Structural variants (delly) and fusion events (star-fusion) are processed together through Mavis, which will identify and merge similar events from each for validation and annotation.
Notes : The his pipeline is used for generation of our accredited clinical reports. It has been validated using data with a minimum whole genome depth (80X tumour, 40X normal) and whole transcriptome readcount (n reads).
Single Sample Variant Calling Pipeline (DNASeq)
Call ready Aligned sequence data from tumour samples are processed through a variety of tools to generate the variant calls of different types. There is no matched normal sample available, so the calls will be a mix of germline and somatic calls. Additional filtering will be required to remove germline calls and identify possible somatic calls.
This workflow can also be run on normal samples to identify variants that can be used to generate a panel of normals (PON) which can be used for generation of somatic calls (available as a standard analysis protocol)
Mutations (snvs + short indels)
Analysis : somatic snvs and short indels are called using GATK Mutect2. Variants are annotated with Variant Effect Predictor
Workflows : mutect2, variant effect predictor
Deliverables : annotated somatic calls (vcf files + tbi index)
Software versions :
- mutect2
- variant effect predictor
Copy Number Variation
Analysis : varscan is used for mutation and copy number calls. This is input to sequenza for generation of copy number profiles.
Deliverables : sequenza segmentation files etc. These are generated across a series of width parameters. Review of the profiles is key in identifying an optimal output.
Software versions :
- varscan
- sequenza
Germline Variant Calling
Identification of germline mutations (snv and short indels) in single samples.
Analysis : GATK Haplotype caller is used to generation genome wide variant calls, with a call at each genomic position. A final set of calls, with genotypes are identified using GATK genotypeGVCFs, for each sample. The genome wide calls can also be used for joint-varying calling across a set of samples (available as a standard analysis protocol)
Workflows : haplotypeCaller, genotypeGVCFs
Deliverables : genome wide variant calls (gvcf), filtered variant calls with genotypes (vcf) + indices (tbi)
Software versions :
- gatk
- tabix (for indexing)
Alignment and UMI-Based collapse pipeline
Our targeted sequencing assays and cfMeDIP assays both incorporate Unique Molecular Identifiers prior to PCR amplification. Sequencing is generally done to very high depth, and the UMIs allow more accurate deduplication of the mapped reads.
Analysis : UMI extraction from the fastq files using barcodex, followed by alignment to the reference genome (hg38). Duplicates can be simply marked, or the final set of reads can be collapse to remove duplicates with UMI-tools, retaining the read with the highest quality
Workflows : umicollapse
Deliverables : umi-extracted raw sequence (fastq), aligned uncollapsed reads (bam), aligned collapsed reads (bam)
Software versions :
- barcodex
- bwa mem
- umi-tools
Targeted Sequencing UMI-Based Consensus generation and variant calling
Our targeted sequencing assays using a variety of targeted-sequencing panels incorporate Unique Molecular Identifiers. Sequencing is generally done to very high depth, and the UMIs allow more accurate deduplication of the mapped reads. Assessment of variablity in the primary sequence across duplicate reads allows generation of an error-corrected consensus sequence.
Analysis : UMI extraction from the fastq files using consensus cruncher, followed by alignment to the reference genome (hg38). The aligned sequence is then partitioned into several subsets. based on the ability to form either Duplex consensus (both strands), single strand consensus, or unable to form a consensus (singletons). Singletons that support the consensus are retained. A combination of duplex and single-strand consensus + supporting singletons is provided to Mutect2 for variant calling in tumour-only mode, followed by annotation with Variant Effect Predictor.
Workflows : consensusCruncherWorkflow
Deliverables : partitioned aligned reads in various subsets (bam), annotated variant calls (vcf)
Software versions :
- consensusCruncher
- bwa mem
- mutect2
- variantEffectPredictor
Shallow Whole Genome Copy Number Pipeline
Whole genome sequence to ultra-low depth can be used for generation of copy number profiles and determination of tumour ploidy and purity
Analysis : ichorCNA is used for assessment of copy number and ploidy + purity metrics
Workflows : ichorCNA
Deliverables : aligned sequence, metrics, profiles
Current Software versions :
- ichorCNA
Whole Transcriptome Immune Analysis Pipeline
Whole transcriptome libraries are processed through a variety of Immune inference tools
Analysis : Current tools include HLAMiner (HLA predictions) and Trust4 (immune repertoire)
Workflows : HLAMiner, Trust4
Deliverables :
Software versions :
- HLAMiner
- Trust4