How to DRAGEN in CloudOS: ultra-rapid & highly accurate secondary NGS analysis at your fingertips

Last modified date

In this two-part blog series, we delve into the challenges of secondary analysis in today’s world of bioinformatics and how DRAGEN and CloudOS come together to tackle these issues (read Part 1 here). DRAGEN, an accelerated and improved cloud-native (AWS) implementation of the standard BWA/GATK, resolves the issue of lengthy compute times and massive data volumes for the secondary analysis of human genomes. In this second blog post, we give you a step-by-step play of how easy it is to run DRAGEN through CloudOS over your data, in your own AWS cloud.

CloudOS brings the power of DRAGEN to your data, in your environment.

At Lifebit, we understand the importance of standardisation, reproducibility, auditability and making all stages of analysis FAIR. That is why we always prioritise adding features to the CloudOS platform that will enable users to access the best-in-class tools to standardise all aspects of their analysis. CloudOS makes DRAGEN easily accessible to anyone through the CloudOS Marketplace, at no extra cost, as users only need to cover the standard DRAGEN-AWS pay-as-you-go fees.

The advantages of running DRAGEN through CloudOS, instead of natively over AWS, are:

  1. Getting easy access to and deploying DRAGEN over your own AWS cloud
  2. Managing all data, environment and DRAGEN in ONE place
  3. Getting out-of-the-box versioning, 1-click cloning/reproducing and sharing of any analysis performed with DRAGEN
  4. Enabling real-time monitoring of the analysis progress, its owner, its resources utilisation, its cost, inputs and outputs
  5. Reducing the cost of running DRAGEN over AWS
  6. Enabling collaboration and better management of data analysis projects that involve DRAGEN alongside with other type of analysis 
  7. Instead of moving your data to DRAGEN, you bring DRAGEN to your data. 

When using CloudOS, you ultimately significantly decrease costs and turnaround times, while at the same time, vastly improve the accuracy, versioning/trackability and reproducibility of your secondary analyses.

How to setup DRAGEN in CloudOS

Here, we will walk you through the steps we took to run DRAGEN as a CloudOS module. In this case, the module takes care of running the current industry standard variant calling pipeline which follows BWA-MEM for read mapping and GATK HaplotypeCaller for variant calling in a sample. The input formats used for the pipeline are FASTQ files.

Pipeline cost breakdown

Since the CloudOS platform operates using a federated approach over your AWS cloud account, the costs related to the execution of the pipeline are solely explained by the usage of two elements:

  • EC2 Instance type cost
  • Software license cost

Both elements are pay-as-you-go and are directly billed to you through the AWS billing system. For instance, if you run DRAGEN in a f1.16xlarge instance for a single 30x dataset for ~30 minutes of compute, the total cost would be around $18, with the following breakdown:

  • $11 EC2 costs for f1.16xlarge
  • $7 DRAGEN Software license

Getting started with DRAGEN in CloudOS

In order to run the DRAGEN pipeline, you only need to cover the following two requirements:

  1.         Having an AWS cloud account at hand to use to setup your account in CloudOS
  2.         Subscribing to the DRAGEN software in your AWS Marketplace

1. Setting up access in CloudOS

Before running any pipeline, you can follow the general guidelines on how to register and synchronise credentials from an AWS account into the CloudOS platform.

dragen_secondary_analysis_genomics1.png

In order to connect an existing cloud to CloudOS, you simply add your AWS user credentials to provide permissions to manage cloud resources on your behalf and orchestrate DRAGEN analysis. For more information, and a step by step guide on how to correctly setup cloud credentials, refer to this documentation.

2. AWS Marketplace subscription

Before trying out DRAGEN, make sure you have correctly connected your AWS cloud to CloudOS. In order to start using DRAGEN, you have to subscribe to DRAGEN, which allows programmatic access to deploy DRAGEN within your user cloud resources. The DRAGEN software is accessible to any AWS user through the AWS Marketplace.

dragen_secondary_analysis_genomics_awsmarketplace

For more information on the usage of DRAGEN within AWS cloud resources, refer to the following guide.

After successfully completing the two previous steps, you will be able to run a fully cloud-managed DRAGEN module in CloudOS from the Marketplace section. CloudOS takes care of managing the following aspects related to the deployment and execution of DRAGEN software:

  1. Allocation of compute resources within your AWS cloud
  2. Setup of network & compute resources
  3. Scheduling of DRAGEN analysis
  4. Monitoring of analysis progress and costs
  5. Versioning of the analysis
  6. 1-click cloning and sharing of the analysis
  7. Federated data access
  8. Data management

The integration of DRAGEN and CloudOS within your cloud environment makes it possible to automate and run federated data analysis with DRAGEN and CloudOS in only a few clicks. This makes it ideal to be used by data scientists that have no experience on how to deploy such analysis over cloud but also by experts that want to be optimising and automating such analysis deployment to save time and money.

Running DRAGEN in CloudOS

CloudOS provides a Marketplace similar to the AWS Marketplace, accessible to any CloudOS user, and includes several DRAGEN pipeline modules. Each DRAGEN module in CloudOS comes with real-time monitoring and reports, which provide insights on alignment and variant calling metrics for each sample processed, including interactive charts.

dragen_secondary_analysis_genomics3
Figure 1: Accessing the CloudOS Marketplace via the “Pipelines’ tab on the left-hand side of the screen.
dragen_secondary_analysis_genomics4
Figure 2: CloudOS marketplace – select the DRAGEN module to proceed.

For the germline variant calling module, you will be prompted to provide the basic input options for running an analysis: 

  • RAW FASTQ files containing the read sequences for each sample, and 
  • the reference genome against which alignment and germline variant calling will be performed.
dragen_secondary_analysis_genomics5
Figure 3: DRAGEN module – Before running a new job using DRAGEN, users need to subscribe to DRAGEN in the AWS marketplace. Once subscribed, users select the dataset they want to work with by clicking the icon on the right of the empty field. To date, we only support one reference genome – GRCh37.
dragen_secondary_analysis_genomics6
Figure 4: Select dataset – The user’s selected dataset will appear in the previously empty field. To proceed, click ‘Next’ in the upper righthand corner of the screen.
dragen_secondary_analysis_genomics7
Figure 5: Select project – Users can either select an existing project from a drop-down menu to add their DRAGEN analysis results to, or they can create a new project by clicking on ‘New”. Once the project directory has been chosen, users can click ‘Run Job’ on the top right-hand corner of the screen to get their DRAGEN analysis started.

Once the analysis has been launched, CloudOS will track, in real-time, the execution and provide metrics on cloud consumption and cost estimates according to the resources being used in your AWS account and the DRAGEN software license you subscribed to in the AWS marketplace.

dragen_secondary_analysis_genomics8
Figure 6: Once the analysis has been deployed, users will be redirected to this screen, where they will get information about the execution time, costs, and machines used for running analysis. If desired, users can abort the analysis by simply clicking on the red icon at the right-hand side of the screen.

After an analysis is completed, the sample variant calls in VCF format, the metrics and logs of the DRAGEN analysis execution are stored and directly available in your cloud storage account. CloudOS always manages the data transfer within your cloud account, as CloudOS runs all analysis in a federated manner, bringing compute to data, never allowing the data to leave your cloud environment or control.

dragen_secondary_analysis_genomics9
Figure 7: Users can click on the job in the jobs table to be redirected to the ‘Analysis Job Page’ where they can gain an overall view of how their job has run, both in graphical and table formats. For reproducibility purposes, we have included a ‘Clone’ feature, at the upper right-hand side of the screen, which makes it easy to clone all aspects of a job.
dragen_secondary_analysis_genomics10
Figure 8: By clicking on the ‘Results’ tab, users can access their results folder and download them for further analysis.

Additionally, once the analysis has been finalised, you can directly visualise a MultiQC report that summarises metrics of the sample variant calls by DRAGEN. The metrics in the report serve as a direct quality control (QC) step, and include variant call quality, distribution of variant types and depth across all samples.

dragen_secondary_analysis_genomics11
Figure 9: Users can further explore more details of their job by accessing the MultiQC report to view general quality control metrics & summaries for their output VCF file.

Conclusions

As you can see from this step-by-step tutorial, running DRAGEN on CloudOS is straightforward and requires minimal setup on your part. By using CloudOS to run DRAGEN, you get end-to-end versioning and real-time monitoring of how your analysis is progressing, in addition to tracking which resources are used and the costs incurred during the run. This is a significant improvement on running DRAGEN natively through your AWS account, as this essential auditing is not available otherwise. 

Furthermore, if you are part of a team and need to share results from your DRAGEN run, you can easily do so with the 1-click sharing button on the analysis job page. This effectively removes silos between researchers in the same team, or collaborators from different organisations. Moreover, if the run needs repeating, you can simply clone the analysis by using the 1-click cloning feature on the analysis job  page, ensuring true reproducibility between runs. 

Lastly, by running DRAGEN directly in your own environment in a federated manner, your data’s safety is guaranteed: instead of moving your data to DRAGEN, you bring DRAGEN to your data. 

Try DRAGEN today!


We would like to know what you think! Please fill out the following form or contact us at hello@lifebit.ai. We welcome your comments and suggestions!

Pablo Prieto, PhD

%d bloggers like this: