In this two-part blog series, we delve into the challenges of secondary analysis in today’s world of bioinformatics and how DRAGEN and Deploit come together to tackle these issues (read Part 1 here). DRAGEN, an accelerated and improved cloud-native (AWS) implementation of the standard BWA/GATK, resolves the issue of lengthy compute times and massive data volumes for the secondary analysis of human genomes. In this second blog post, we give you a step-by-step play of how easy it is to run DRAGEN through Deploit over your data, in your own AWS cloud.
Deploit brings the power of DRAGEN to your data, in your environment.
At Lifebit, we understand the importance of standardisation, reproducibility, auditability and making all stages of analysis FAIR. That is why we always prioritise adding features to the Deploit platform that will enable users to access the best-in-class tools to standardise all aspects of their analysis. Deploit makes DRAGEN easily accessible to anyone through the Deploit Marketplace, at no extra cost, as users only need to cover the standard DRAGEN-AWS pay-as-you-go fees.
The advantages of running DRAGEN through Deploit, instead of natively over AWS, are:
- Getting easy access to and deploying DRAGEN over your own AWS cloud
- Managing all data, environment and DRAGEN in ONE place
- Getting out-of-the-box versioning, 1-click cloning/reproducing and sharing of any analysis performed with DRAGEN
- Enabling real-time monitoring of the analysis progress, its owner, its resources utilisation, its cost, inputs and outputs
- Reducing the cost of running DRAGEN over AWS
- Enabling collaboration and better management of data analysis projects that involve DRAGEN alongside with other type of analysis
- Instead of moving your data to DRAGEN, you bring DRAGEN to your data.
When using Deploit, you ultimately significantly decrease costs and turnaround times, while at the same time, vastly improve the accuracy, versioning/trackability and reproducibility of your secondary analyses.
How to setup DRAGEN in Deploit
Here, we will walk you through the steps we took to run DRAGEN as a Deploit module. In this case, the module takes care of running the current industry standard variant calling pipeline which follows BWA-MEM for read mapping and GATK HaplotypeCaller for variant calling in a sample. The input formats used for the pipeline are FASTQ files.
Pipeline cost breakdown
Since the Deploit platform operates using a federated approach over your AWS cloud account, the costs related to the execution of the pipeline are solely explained by the usage of two elements:
- EC2 Instance type cost
- Software license cost
Both elements are pay-as-you-go and are directly billed to you through the AWS billing system. For instance, if you run DRAGEN in a f1.16xlarge instance for a single 30x dataset for ~30 minutes of compute, the total cost would be around $18, with the following breakdown:
- $11 EC2 costs for f1.16xlarge
- $7 DRAGEN Software license
Getting started with DRAGEN in Deploit
In order to run the DRAGEN pipeline, you only need to cover the following two requirements:
- Having an AWS cloud account at hand to use to setup your account in Deploit
- Subscribing to the DRAGEN software in your AWS Marketplace
1. Setting up access in Deploit
Before running any pipeline, you can follow the general guidelines on how to register and synchronise credentials from an AWS account into the Deploit platform.
In order to connect an existing cloud to Deploit, you simply add your AWS user credentials to provide permissions to manage cloud resources on your behalf and orchestrate DRAGEN analysis. For more information, and a step by step guide on how to correctly setup cloud credentials, refer to this documentation.
2. AWS Marketplace subscription
Before trying out DRAGEN, make sure you have correctly connected your AWS cloud to Deploit. In order to start using DRAGEN, you have to subscribe to DRAGEN, which allows programmatic access to deploy DRAGEN within your user cloud resources. The DRAGEN software is accessible to any AWS user through the AWS Marketplace.
For more information on the usage of DRAGEN within AWS cloud resources, refer to the following guide.
After successfully completing the two previous steps, you will be able to run a fully cloud-managed DRAGEN module in Deploit from the Marketplace section. Deploit takes care of managing the following aspects related to the deployment and execution of DRAGEN software:
- Allocation of compute resources within your AWS cloud
- Setup of network & compute resources
- Scheduling of DRAGEN analysis
- Monitoring of analysis progress and costs
- Versioning of the analysis
- 1-click cloning and sharing of the analysis
- Federated data access
- Data management
The integration of DRAGEN and Deploit within your cloud environment makes it possible to automate and run federated data analysis with DRAGEN and Deploit in only a few clicks. This makes it ideal to be used by data scientists that have no experience on how to deploy such analysis over cloud but also by experts that want to be optimising and automating such analysis deployment to save time and money.
Running DRAGEN in Deploit
Deploit provides a Marketplace similar to the AWS Marketplace, accessible to any Deploit user, and includes several DRAGEN pipeline modules. Each DRAGEN module in Deploit comes with real-time monitoring and reports, which provide insights on alignment and variant calling metrics for each sample processed, including interactive charts.
For the germline variant calling module, you will be prompted to provide the basic input options for running an analysis:
- RAW FASTQ files containing the read sequences for each sample, and
- the reference genome against which alignment and germline variant calling will be performed.
Once the analysis has been launched, Deploit will track, in real-time, the execution and provide metrics on cloud consumption and cost estimates according to the resources being used in your AWS account and the DRAGEN software license you subscribed to in the AWS marketplace.
After an analysis is completed, the sample variant calls in VCF format, the metrics and logs of the DRAGEN analysis execution are stored and directly available in your cloud storage account. Deploit always manages the data transfer within your cloud account, as Deploit runs all analysis in a federated manner, bringing compute to data, never allowing the data to leave your cloud environment or control.
Additionally, once the analysis has been finalised, you can directly visualise a MultiQC report that summarises metrics of the sample variant calls by DRAGEN. The metrics in the report serve as a direct quality control (QC) step, and include variant call quality, distribution of variant types and depth across all samples.
As you can see from this step-by-step tutorial, running DRAGEN on Deploit is straightforward and requires minimal setup on your part. By using Deploit to run DRAGEN, you get end-to-end versioning and real-time monitoring of how your analysis is progressing, in addition to tracking which resources are used and the costs incurred during the run. This is a significant improvement on running DRAGEN natively through your AWS account, as this essential auditing is not available otherwise.
Furthermore, if you are part of a team and need to share results from your DRAGEN run, you can easily do so with the 1-click sharing button on the analysis job page. This effectively removes silos between researchers in the same team, or collaborators from different organisations. Moreover, if the run needs repeating, you can simply clone the analysis by using the 1-click cloning feature on the analysis job page, ensuring true reproducibility between runs.
Lastly, by running DRAGEN directly in your own environment in a federated manner, your data’s safety is guaranteed: instead of moving your data to DRAGEN, you bring DRAGEN to your data.
We would like to know what you think! Please fill out the following form or contact us at email@example.com. We welcome your comments and suggestions!