The beginning of the 21st century has been shaped by the fast growing amount of biological data which is produced by sequencing technologies and, in particular, by the high throughput next generation sequencing (NGS). Genetic variations in human genomes, especially the very common single nucleotide polymorphisms ( SNPs ), are correlated with many diseases 1, associated with individuality 2 and relevant to many other fields, such as nutrition 3. In the last few years, interest in genomic variations and their effects on phenotype and impact on diseases has been growing. The first step to harness the genotype-phenotype connection is to be able, given an individual, to identify the variations in his/her genome.
Variant Calling tools identify genetic variations of newly sequenced genomes by comparing to the corresponding reference genome. In practice, the output of NGS technologies (reads) are mapped to the reference genome and variations are identified. In the last years, many variant callers have been released in order to obtain accurate and fast results. In December 2017, the team at Google Brain joined the effort by releasing an open source, deep learning based variant caller: DeepVariant. DeepVariant outperforms its competitors by accuracy – it even won the accuracy award at the precision FDA Truth Challenge.
As we started using DeepVariant, we realized that running the binaries provided by Google over our datasets was not efficient enough. Parallelization over multiple files is not handled by default, input files need some preprocessing since not all of them are compatible with DeepVariant, and manually running the binaries or, writing some temporary script on top of them, was not sustainable. Moreover, we needed a way to seamlessly run DeepVariant over cloud in the easiest way possible. That’s when we decided to write a Nextflow-based pipeline and use it on Deploit to solve our problems.
DeepVariant as a Nextflow-based pipeline enables users to run DeepVariant in an easy, fast and reproducible manner that ensures full control over configurations. It supports the user to start running the pipeline correctly and it allows running of multiple variant calling processes in parallel to maximize efficiency. Every step is bundled and computed in a Docker container. Full support to easily run and scale analyses with DeepVariant over AWS and Azure cloud is provided by Deploit.
DeepVariant: an overview
DeepVariant performs variant calling through image recognition. Images representing the mapping of the reads to the reference genome are produced using BAM files. The picture below, taken from the official DeepVariant google blog post, shows examples of such images. These images are then used by the trained machine learning model to identify the variants.
The make_example step uses the reference genome and the BAM file in order to produce the piled-up images needed for the prediction. To speed things up, the make_example step can be parallelized and take advantage of multiple machine’s cores.
The call_variants step performs the real variant calling: it uses the already trained prediction model to identify genomic variants.
This step only converts the output of call variants in the standard and well-known VCF format.
A more detailed description of the steps can be found here.
DeepVariant in Nextflow
As shown in the following picture, the Nextflow version of the DeepVariant workflow contains not only the variant calling steps described above (darker blue ones) but also some preprocessing steps (light blue ones) as well.
This allows for the whole analysis to run seamlessly and reproducibly instead of running manually 3 binaries with all the irreproducible mess, ineffectiveness and frustration that this may cause.
The workflow handles one reference genome and multiple BAM files as input. The variant calling for the several input BAM files will be processed completely independently and will produce independent VCF result files. The minimum set of input parameters is only composed by the version of the reference genome (available on Lifebit) and a folder where your bam files are stored. When running the pipeline in this way, the other needed files will be automatically created and the last release of google’s trained model for whole genome variant calling will be used.
It looks as simple as this:
nextflow run main.nf --hg19 --bam_folder path/to/folder/<span class="hljs-keyword">with</span>/bam/files
Despite the possibility of keeping things this simple, complete control over all the configurations is ensured. Calling the pipeline can be as flexible as this:
nextflow run main.nf <span class="hljs-comment">--fasta path/to/my/genome.fasta </span>
<span class="hljs-comment">--fai path/to/my/genome.fasta.fai</span>
<span class="hljs-comment">--fastagz path/to/my/genome.fasta.gz </span>
<span class="hljs-comment">--gzfai path/to/my/genome.fasta.gz.fai</span>
<span class="hljs-comment">--gzi path/to/my/genome.fasta.gz.gzi </span>
<span class="hljs-comment">--bam_folder path/to/folder/with/bam/files</span>
<span class="hljs-comment">--getBai true</span>
<span class="hljs-comment">--j 64 </span>
<span class="hljs-comment">--modelFolder path/to/the/folder/with/my/model/files--modelName model.ckpt </span>
It allows users to pass all the input files (compressed and indexed), define how many cores should be used for the parallelization and even change the model which should be used. In this way, users can define their level of control.
Finally, the advantage of this approach is that the variant calling of the different BAM files can be parallelized internally by Nextflow and all cores of the machine are taken advantage of in order to get to the results faster.
DeepVariant at its best: Nextflow & Lifebit Deploit
Running pipelines over the cloud can be a struggle and make you lose precious time if you are not an experienced cloud user: it begins with knowing how to fire up an instance and pick a suitable one, then how to transfer your data, how to clone your repositories, install all the dependencies and finally remembering to terminate the machines when the job is done to stay on budget. Honestly, this is not fun!! Which made us wonder in despair: why can’t the cloud be a simple compute tool that anyone could use without being cloud experts? We should not be losing time figuring out how the cloud works but rather concentrating on the results we are trying to obtain.
We built Deploit to help relieve this pain – it saves time and helps you pick the fastest and most cost-effective configuration to run and scale your analysis over cloud.
It allows you, in a matter of a couple of clicks, to synchronize your cloud account as well as your GitHub or Bitbucket account. You can then run your pipeline on your data in a scalable manner over the cloud and still get some sleep at night!
Deploit brings the easiness of running DeepVariant to a whole new level by providing a nice user interface which can guide the user through the journey of successfully and efficiently deploying and running variant calling analyses using DeepVariant over the cloud.
Epilogue on runtimes and costs
Using DeepVariant through Deploit not only makes life easier, it actually speeds up running times and automatically makes the most out of the given resources.
We benchmarked DeepVariant running times over 10 different BAM files from UCSC on 3 different AWS machines with different resources (m4.16xlarge , cr1.8xlarge , c3.4xlarge).
We ran the binaries manually as described in the google DeepVariant documentation taking care manually of preprocessing, transferring of data and cloud configurations (labelled “Bash Script” below) and DeepVariant as a Nextflow-based pipeline on Deploit (labelled “Resource optimized with Deploit” below).
Results show that DeepVariant in Nextflow on Deploit outperforms the manual (bash script) version of DeepVariant across both time and costs. While looking at this results one should also take into account that, as a user, running DeepVariant on Deploit took almost no time: you simply add input parameters and it’s done! On the other hand, running it manually needs time and effort, from generating all the needed input files (automatically handled in the Nextflow pipeline), manually moving data back and forth from Docker containers and having to trigger the computation multiple times in order to run DeepVariant on multiple BAM files. (NOTE: the time needed by these steps is NOT included in the benchmark, where only runtimes are shown). Not only are DeepVariant runtimes much lower by using Deploit, but also the time needed for setup is eliminated.
As can be observed in the benchmarking graphs, Deploit can noticeably reduce time and lower costs. Time decrease is achieved by carefully using auto-scaling and parallelization while low costs are ensured by Deploit’ resource optimization based on spot instances.
All data and material used for this benchmark can be found in this publicly available s3 bucket: s3://lifebit-deepvariant-benchmark-2018 . (The prices used for the benchmarking are based on the 16th June 2018 instances prices).
- Barkur S. Shastry, B. Jochimsen et al. SNP alleles in human disease and evolution. J Hum Genet. 2002;47(11):561.
- Bromberg Y. et al. Neutral and weakly non-neutral sequence variants may define individuality. Proc Natl Acad Sci U S A. 2013 Aug 27; 110(35): 14255–14260.
- Mathers J et al.The Biological Revolution: Understanding the Impact of SNPs on Diet-Cancer Interrelationships.The Journal of Nutrition 2007; 253S–258S.
We are still only beginning and we have big plans for the future! Don’t miss a thing and let us keep you updated on our upcoming posts by signing up to our Newsletter or by visiting us on Twitter, LinkedIn, and Facebook.
We’re also actively looking for great engineers and bioinformaticians who want to help us shape Lifebit – if this is something you’d be interested in, we’d love to hear from you: firstname.lastname@example.org.