The quest for robust, rapid, scalable & integrated genomics analysis
In today’s world of bioinformatics, data has gotten too broad and complicated, creating intricate data silos which impede researchers from unlocking the true value of data. Systems can no longer cope, and instead, turn into hurdles that give rise to scalability, management and maintenance issues. Integrated Analysis (IA) is a holistic, FAIR and federated approach to data analysis. It requires uniting disconnected, diverse data and transcending system, silo and collaboration barriers, and results in impactful insights.
A central tenet of the Integrated Analysis methodology and bioinformatics, in general, is data standardisation which enables researchers to transform raw siloed data into impactful insights. Every robust, multi-step bioinformatics workflow includes three analytical stages: primary, secondary and tertiary analysis.
- Primary Analysis (sequence generation & quality control/assurance)
- Secondary Analysis (sequence processing)
- Tertiary Analysis (results interpretation)
The challenges of standardising & scaling secondary analysis
The lack of standardisation for secondary analysis is widely recognised in the field and can be attributed to the massive and overwhelming amount of constantly-evolving open-source solutions available to researchers. As software development for omics analyses is evolving at an increasingly fast pace (over 3,000 bioinformatics tools were developed in 2018 alone), standardisation becomes of utmost importance to ensure easily reproducible analyses and harmonisation. Furthermore, bioinformatic software solutions are often implemented as single-purpose packages, but can be strung together in order to develop a custom/ad-hoc secondary analysis pipeline. Stringing software manually, however, is an extremely tedious task requiring a lot of work and time. Lastly, secondary analysis is quite resource-intensive as it runs a set of time-consuming algorithms on a per sample basis.
When it comes to genome analysis, the BWA-Genome Analysis Toolkit (BWA-GATK), part of the Broad Institute’s Best Practices analysis pipeline, allows researchers to map and align reads against reference genomes, and perform variant discovery analysis from quality-scored sequence data derived from primary analysis. This is, by far, the most widely used analysis pipeline, enabling researchers to go from raw BAM files to VCF files.
BWA-GATK is not a cloud-native application, resulting in a slow, memory-intensive and unscalable pipeline. The end-user is directly responsible for optimising system utilisation and improving the scalability of the process.
DRAGEN: an ultra-rapid cloud-native implementation of BWA-GATK
DRAGEN is an accelerated and improved cloud-native implementation of the BWA/GATK standard (it can be accessed and run directly on AWS cloud through the AWS Marketplace). It resolves the issue of lengthy compute times for the secondary analysis of NGS data, allowing users to perform the secondary analysis of a whole human genome with 30x coverage in 25 minutes on-premise, which would have previously taken close to 15h with a traditional CPU-based system. When running secondary analysis on the AWS cloud, DRAGEN provides the same speed and accuracy as running on-premise, while also delivering the flexibility and scalability of the cloud.
Besides speed, DRAGEN has also proven itself to be highly sensitive and specific for detecting small variants (named in silico Variant Catcher of the PrecisionFDA Hidden Treasures 2018), leading to its implementation for ultra-rapid secondary analysis by some of the biggest names in genomics research, including Genomics England.
DRAGEN analyses data from different sequencing experiences, such as whole genomes, targeted panels, germline/somatic datasets and RNA sequencing experiments. It can also be implemented in a variety of research applications including population sequencing, newborn intensive care units for rapid genomic testing, and clinical and translational research, among others.
Although DRAGEN is far superior in terms of speed and accuracy to the standard BWA-GATK implementation, it still suffers from the lack of analysis auditing and tracking to achieve truly standardised and reproducible secondary NGS analyses.
Deploit brings the power of DRAGEN to your data, in your environment
At Lifebit, we understand the importance of standardisation, reproducibility, auditability and making all stages of analysis FAIR. That is why we always prioritise adding features to the Deploit platform that will enable users to access the best-in-class tools to standardise all aspects of their analysis. Deploit makes DRAGEN easily accessible to anyone through the Deploit Marketplace, at no extra cost, as users only need to cover the standard DRAGEN-AWS pay-as-you-go fees.
The biggest advantages of running DRAGEN through Deploit, instead of natively over AWS, are:
- Getting easy access to and deploying DRAGEN over your own AWS cloud
- Managing all data, environment and DRAGEN in One Place
- Getting out-of-the-box versioning, 1-click cloning/reproducing and sharing of any analysis performed with DRAGEN
- Enabling real-time monitoring of the analysis progress, owner, resources it uses, its cost, inputs and outputs
- Reducing the cost of running DRAGEN over AWS
When using Deploit to run DRAGEN, you significantly decrease costs and turnaround times, while at the same time improve the accuracy, trackability and reproducibility of your secondary analyses.
We are still only beginning and we have big plans for the future! Don’t miss a thing and let us keep you updated on our upcoming posts by signing up to our Newsletter or by visiting us on Twitter, LinkedIn, Facebook, & Instagram.
We’re also actively looking for great engineers and bioinformaticians who want to help us shape Lifebit – if this is something you’d be interested in, we’d love to hear from you: firstname.lastname@example.org.