How to run any nf-core analysis over the Cloud: an example using the nf-core/rnaseq pipeline

 

nf-core is a great community-driven effort which makes bioinformatics pipelines very standardised and incredibly simple to run (Ewels et al., 2019) (check out our previous blog post which delves into what nf-core really is about). You can run these pipelines with ease, and rest assured that you are following community best practices.

However, for any given project, you still have to make sure you have installed all the required software (Nextflow & Docker/Singularity), manage all of your data, provide the necessary compute resources & wait long queue times if submitting to a computing cluster…

What if you don’t have the resources or are tired of waiting? In this blog post, we will show you how it is possible to run any of the stable release nf-core pipelines with ease over the Cloud by using the Deploit platform. We have used the RNA-seq pipeline as an example because it is the most popular of all the nf-core pipelines. The following can also be done for any of the nf-core pipelines.

The RNA-seq workflow processes raw FastQ inputs, aligns the reads and generates gene counts before performing extensive quality control on the results. (See the output documentation for more details).

How to import a pipeline

Before starting, make sure you have already created your free Deploit account. You can then navigate to the pipelines page on Deploit:

Pipeline_Deploit

Once on the pipelines page, you are able to create a new pipeline. To do this follow the steps below:

  1. Click the green “New” button Screen Shot 2019-08-09 at 14.08.27.png
  2. You can then “Select” the GitHub logo to import the RNA-seq pipeline which is coming from GitHub Screen Shot 2019-08-09 at 14.08.57.png
  3. Paste the URL of the repository from GitHub: https://github.com/nf-core/rnaseq
  4. Name the pipeline, eg “rnaseq”
  5. Optionally: give the pipeline a description
  6. Finally, click “Next”

 

(Optional) Select a pipeline

This step is optional because at the end of the last step you will be taken to the page to select data & parameters for the newly imported pipeline. If this is the case, you don’t need to do anything for this step. 🕺

Your imported pipelines can be found on the pipelines page under the “MY PIPELINES & TOOLS” tab:

my pipelines and tools

select_rnaseq.gif

Selecting data & parameters

We have provided example data within the S3 bucket s3://lifebit-featured-datasets/pipelines/rnaseq-data. Alternatively, you can select your own input S3 bucket/data required you have the correct input files.

To select input data & parameters:

Import the dataset

  1. Click the blue add data button Screen Shot 2019-08-09 at 14.05.06.png
  2. Click the green plus to add a new dataset Screen Shot 2019-08-09 at 14.05.45.png
  3. Optional: enter a name for your new dataset, eg “rnaseq_test” & hit enter
  4. Click “Add files & folders” & “Import” Screen Shot 2019-08-09 at 14.06.12
  5. Double click lifebit-featured S3 bucket & navigate to the folder  “lifebit-featured-datasets/pipelines/rnaseq-data”

import_rnaseq_data.gif

Add & set the following parameters/data:

For any of the nf-core pipelines, you can see a well-documented list of all available parameters. For the RNA-seq pipeline, we will add the following:

  1. reads – Select the folder “rnaseq_test/rnaseq-data/reads” & add the regex “*” to select all FastQ files within the folder
  2. singleEnd – To select single-end reads
  3. fasta – Select the file “rnaseq_test/rnaseq-data/reference/genome.fa
  4. gtf -Select the file “rnaseq_test/rnaseq-data/reference/genes.gtf
  5. max_memory -Type “60.GB” to prevent the pipeline from using too much memory
  6. Click “Next”

rnaseq_data_params.gif

Running an analysis

You’re almost done! The last 3 steps follow and then you’ll you have successfully scheduled and deployed your first job on the Deploit platform!

  1. Select a project
    • This is to group analyses together
    • For example, you can select the existing “Demo” project
  2. Select an instance
    • This is to set the compute resources available for running the analysis
    • For example, you can select the instance “m2.2xlarge”
  3. Finally, click “Run job” 🎉

run_job.gif

Monitoring an analysis

After clicking ”Run Job”, the job will be initialising and will take ~5mins to initialise while the AWS instance is scheduled. Until then you can navigate to the jobs page dashboard to view all jobs (both completed & running). Once the job has finished initialising, you can click on it to view the Job Analysis page. Here, you can view the resource consumption, results & MultiQC HTML quality control report.

View an example completed job

This tutorial shows you how you can import and run the nfcore/rnaseq pipeline on Deploit. We’re pleased to say that we have the released & stable nf-core pipelines already on the Deploit platform with example data and parameters. This means that they are even easier to run!

Thanks for reading & hope you enjoyed the blog post. Now that you’ve learned how you can run any of the nf-core pipelines over Deploit be sure to check out all of the nf-core pipelines so that you can go out and…


We would like to know what you think! Please fill out the following form or contact us at hello@lifebit.ai. We welcome your comments and suggestions!

Phil Palmer

Leave a Reply

%d bloggers like this: