What is the best imputation pipeline?

Comparison of three different imputation pipelines: The good, the bad & the ugly

What is imputation?

Genome imputation enables researchers to use markers that are not directly available after genotyping, and include them in Genome-Wide Association Studies (GWAS). This can be useful for three main applications:

  1. Finding new risk alleles prior to GWAS,
  2. Increasing resolution to identify causal variants, and
  3. Integrating multiple samples and platforms for meta-analysis.

What did we do?

Here, we compared three different imputation pipelines in terms of quality of the imputation (as measured by R2), runtime and cost. The pipelines we compared were Beagle5, The Michigan Imputation Server and Impute2.

Impute2 gave the best quality imputation, Beagle5 was the fastest pipeline and the Michigan Imputation Server was the cheapest. So which pipeline to choose then? We would argue Beagle5, find out why below…

Which gave the best quality imputation?

As shown by the figure above, at very low R2 values The Michigan Imputation Server and Impute2 have far for more SNPs than Beagle5. At an R2 value of 0.2, the three pipelines have approximately the same number of SNPs. Importantly, at higher R2 values (eg 0.9), Impute2 has the highest number of SNPs of over 20,000, then Beagle5 and then The Michigan Imputation Server, which are both around 15,000 SNPs. This is important because it’s for these SNPs that we have the highest confidence in the imputation.

Which was the fastest?

*Run time & costs are shown for jobs run on a m4.2xlarge instance on Deploit

The fastest pipeline was Beagle5 which took 15mins 6s. The next fastest to impute was the Michigan Imputation Server which took 17mins. And the slowest was the Impute2 pipeline which took 33mins 4s.

For the pipelines run on Deploit, job sharing links can be found here for Beagle5 and Impute2.

Which was the cheapest?

The cheapest pipeline was the Michigan Imputation Server which is free. Importantly, unlike the other two pipelines, it is not run on Deploit and there is a job submission limit of three jobs at any one time, which will limit users. The next cheapest pipeline was the Beagle5 pipeline which cost approximately $0.11 per 23andMe file. Finally, by far the most expensive pipeline was the Impute2 pipeline which cost $1.06 per 23andMe file, despite using spot instances.

Which was the best?

Overall, Impute2 gave the best quality imputation, Beagle5 was the fastest pipeline and the Michigan Imputation Server was the cheapest. If you are unsure of which imputation pipeline to use, we would recommend Beagle5 as it gave the best trade-off between these metrics. Beagle5 gave the second best quality of imputation, was the fastest, significantly cheaper than Impute2 and all of this while being more flexible than the Michigan Imputation Server.

Disagree with us? Let us know on Twitter. You can read more about the methods below.

You can see the scripts used to calculate this on GitHub and view the raw data. Documentation for each of the tools can be found here for Michigan Imputation Server, Beagle5 and Impute2.

We are still only beginning and we have big plans for the future! Don’t miss a thing and let us keep you updated on our upcoming posts by signing up to our Newsletter or by visiting us on TwitterLinkedInFacebook, & Instagram.

We’re also actively looking for great engineers and bioinformaticians who want to help us shape Lifebit – if this is something you’d be interested in, we’d love to hear from you: hello@lifebit.ai.

Phil Palmer

Leave a Reply

%d bloggers like this: