Analysis Pipeline Details

Illumina fastq generation

Generation of fastq files from Illumina run folders. This is run on all samples from illumina run folders generated through the Translational Genomics Laboratory, and data return is included as part of the sequencing costs.

Analysis : Basecalling and Demultiplexing using Illumina bcl2fastq software.

Workflows : bcl2fastq

Deliverables : Raw sequence data (fastq format)

Software versions :

  • bcl2fastq

Alignment Only Pipelines

Alignment of raw sequence data to the hg38 genomic reference.

DNASeq

Analysis : Trimming of Illumina adapter sequence with cutadapt, followed by alignment to the reference genome with bwaMem. This is run agains a single fastq pair. Merging of data from multiple fastq pairs is part of the Alignment, Merging and Preprocessing pipeline.

Workflows : bwa

Deliverables : aligned sequence data, coordinate sorted (bam format) + indices.

Software versions :

  • cutadapt
  • bwa
  • samtools

RNASeq

Analysis : Alignment to the reference genome with STAR. Multiple fastq or fastq pairs from the same sample are merged.

Workflows : star

Deliverables : Aligned sequence data to the genomic reference, coordinate sorted (bam format). Aligned sequence data to the transcriptome (bam format) using gencode v31

Software versions :

  • star

Alignment, Merging and Preprocessing (DNASeq)

Generation of Call-Ready alignments suitable for downstream analysis.

Analysis : Alignment of raw sequence data to the hg38 genomic reference as described in the Alignment-only pipeline. This is followed by merging of multiple alignments from the same sample. A variety of optional steps can then be run, to prepare the alignments for downstream variant calling. This includes:

  • filtering to remove secondary reads with samtools
  • duplicate marking with picard
  • local realignment of indels with GATK IndelRealigner
  • base quality score recalibration with GATK bqsr

Workflows : bwa, bam_merge_preprocessing

Deliverables : aligned sequence data, coordinate sorted (bam format) + indices (1 per fastq file); merged preprocessed “call-ready” alignments (bam format) + indices

Software versions :

  • samtools
  • picard
  • gatk

Somatic Variant Calling (DNASeq)

Tumour samples are processed through a variety of tools to generate the variant calls of different types. Somatic calling requires that each tumour sample has a matched normal (generally blood or tumour adjacent). These tools can be configured to run on Whole Genome libraries (mutations, structural variants, copy number variation) or Whole Exome libraries (mutations, copy number variation). The input to this is aligned sequence data in a call ready state as generated by the Alignment, Merging and Preprocessing Pipeline.

Mutations (snvs + short indels)

Analysis : somatic snvs and short indels are called using GATK Mutect2. Variants are annotated with Variant Effect Predictor

Workflows : bwa, bam_merge_preprocessing, mutect2, variant effect predictor

Deliverables : annotated somatic calls (vcf files + tbi index)

Software versions :

  • mutect2
  • variant effect predictor

Structural Variants

Analysis : somatic structural variants are callled with delly. Calls are then annotated and validated with mavis.

Workflows : bwa, bam_merge_preprocessing, delly, mavis

Deliverables : delly structural variant table, mavis structural variant table + structural variant drawings

Software versions :

  • delly
  • mavis

Note : if the same samples have undergone WT sequencing and fusion calling, mavis can be set up to assess both the WG and WT libraries, and merge structural variant calls.

Copy Number Variation

Analysis : varscan is used for mutation and copy number calls. This is input to sequenza for generation of copy number profiles.

Workflows : bwa, bam_merge_preprocessing, varscan, sequenza

Deliverables : sequenza segmentation files etc. These are generated across a series of width parameters. Review of the profiles is key in identifying an optimal output.

Software versions :

  • varscan
  • sequenza

Whole Transcriptome Analysis Pipeline

Whole transcriptome libraries are aligned to the hg38 reference, from which expression and fusion calls are generated.

Analysis : Alignment to the reference genome with STAR. Expression calls are generated from the alignment with RSEM. Fusions calls are generated with STAR-fusion. Fusion calls are further validated and annotated with mavis.

Workflows : star, rsem, star-fusion, mavis

Deliverables : aligned sequence data to the genomic reference, coordinate sorted (bam format), and to the transcriptome (bam format) using gencode v31. Expression call tables from RSEM using gencode v31. Fusion call tables from star-fusion.

Software versions :

  • star
  • rsem
  • star-fusion
  • mavis

Note : if the same samples have undergone WG sequencing and structural variant calling, mavis can be set up to assess both the WG and WT libraries, and merge structural variant calls.


WGTS Analysis Pipeline

This is an accredited pipeline that that requires Whole Genome (WT, tumour with matched normal) and Whole Transcriptome (TS, tumour only) data from the same donor, and processes the data through the Somatic Variant Calling and Whole Transcriptome Pipelines.

Analysis : see the Somatic Variant and Whole Transcriptome Pipelines for analysis details, workflows and deliverables. Structural variants (delly) and fusion events (star-fusion) are processed together through Mavis, which will identify and merge similar events from each for validation and annotation.

Notes : The his pipeline is used for generation of our accredited clinical reports. It has been validated using data with a minimum whole genome depth (80X tumour, 40X normal) and whole transcriptome readcount (n reads).


Single Sample Variant Calling Pipeline (DNASeq)

Call ready Aligned sequence data from tumour samples are processed through a variety of tools to generate the variant calls of different types. There is no matched normal sample available, so the calls will be a mix of germline and somatic calls. Additional filtering will be required to remove germline calls and identify possible somatic calls.
This workflow can also be run on normal samples to identify variants that can be used to generate a panel of normals (PON) which can be used for generation of somatic calls (available as a standard analysis protocol)

Mutations (snvs + short indels)

Analysis : somatic snvs and short indels are called using GATK Mutect2. Variants are annotated with Variant Effect Predictor

Workflows : mutect2, variant effect predictor

Deliverables : annotated somatic calls (vcf files + tbi index)

Software versions :

  • mutect2
  • variant effect predictor

Copy Number Variation

Analysis : varscan is used for mutation and copy number calls. This is input to sequenza for generation of copy number profiles.

Workflows : varscan, sequenza

Deliverables : sequenza segmentation files etc. These are generated across a series of width parameters. Review of the profiles is key in identifying an optimal output.

Software versions :

  • varscan
  • sequenza

Germline Variant Calling

Identification of germline mutations (snv and short indels) in single samples.

Analysis : GATK Haplotype caller is used to generation genome wide variant calls, with a call at each genomic position. A final set of calls, with genotypes are identified using GATK genotypeGVCFs, for each sample. The genome wide calls can also be used for joint-varying calling across a set of samples (available as a standard analysis protocol)

Workflows : haplotypeCaller, genotypeGVCFs

Deliverables : genome wide variant calls (gvcf), filtered variant calls with genotypes (vcf) + indices (tbi)

Software versions :

  • gatk
  • tabix (for indexing)

Alignment and UMI-Based collapse pipeline

Our targeted sequencing assays and cfMeDIP assays both incorporate Unique Molecular Identifiers prior to PCR amplification. Sequencing is generally done to very high depth, and the UMIs allow more accurate deduplication of the mapped reads.

Analysis : UMI extraction from the fastq files using barcodex, followed by alignment to the reference genome (hg38). Duplicates can be simply marked, or the final set of reads can be collapse to remove duplicates with UMI-tools, retaining the read with the highest quality

Workflows : umicollapse

Deliverables : umi-extracted raw sequence (fastq), aligned uncollapsed reads (bam), aligned collapsed reads (bam)

Software versions :

  • barcodex
  • bwa mem
  • umi-tools

Targeted Sequencing UMI-Based Consensus generation and variant calling

Our targeted sequencing assays using a variety of targeted-sequencing panels incorporate Unique Molecular Identifiers. Sequencing is generally done to very high depth, and the UMIs allow more accurate deduplication of the mapped reads. Assessment of variablity in the primary sequence across duplicate reads allows generation of an error-corrected consensus sequence.

Analysis : UMI extraction from the fastq files using consensus cruncher, followed by alignment to the reference genome (hg38). The aligned sequence is then partitioned into several subsets. based on the ability to form either Duplex consensus (both strands), single strand consensus, or unable to form a consensus (singletons). Singletons that support the consensus are retained. A combination of duplex and single-strand consensus + supporting singletons is provided to Mutect2 for variant calling in tumour-only mode, followed by annotation with Variant Effect Predictor.

Workflows : consensusCruncherWorkflow

Deliverables : partitioned aligned reads in various subsets (bam), annotated variant calls (vcf)

Software versions :

  • consensusCruncher
  • bwa mem
  • mutect2
  • variantEffectPredictor

Shallow Whole Genome Copy Number Pipeline

Whole genome sequence to ultra-low depth can be used for generation of copy number profiles and determination of tumour ploidy and purity

Analysis : ichorCNA is used for assessment of copy number and ploidy + purity metrics

Workflows : ichorCNA

Deliverables : aligned sequence, metrics, profiles

Current Software versions :

  • ichorCNA

Whole Transcriptome Immune Analysis Pipeline

Whole transcriptome libraries are processed through a variety of Immune inference tools

Analysis : Current tools include HLAMiner (HLA predictions) and Trust4 (immune repertoire)

Workflows : HLAMiner, Trust4

Deliverables :

Software versions :

  • HLAMiner
  • Trust4