All humans are 99% genetically identical. Yet the 1% difference hidden within our genomes is enough to inform us about our origins. Ancestry analysis has undeniably become the #1 genetic analysis performed throughout the world. If the pace of consumer ancestry testing continues the way it is today, over 100M individuals will have been tested in the next 24 months.
Let’s cut to the chase and understand why there is so much demand for Ancestry testing and, at the same time, discover some of its applications.
Innate human curiosity of our origins
It’s human nature to want to know about our family history and where our ancestors came from. Before DNA sequencing technologies were within consumers’ reach, individuals would have to track their origins through genealogy, which involves scouring many public databases and manually piecing data together. And unless you are part of the royal family, who meticulously archives hundreds of years worth of private records, tracking down public documents is tricky because consistent and standardised record-taking was implemented relatively recently.
As such, genetic ancestry testing came as a relief to those intrigued by their genealogy: it’s as easy as purchasing a kit, spitting into a tube and delving into your genetic history!
Besides providing an overview of our origins, Ancestry testing has also led to countless family reunions, from long lost siblings finding each other to adopted individuals discovering their direct family through Ancestry databases.
Interestingly, Ancestry testing is no longer limited to a report in exchange for your spit: companies have been rolling out eye-catching offers, such as the announcement between 23andme and Airbnb with their Heritage Travel feature: 23andme results now include travel recommendations based on your ancestry.
2. Solving cold cases – Genetic forensics
Ancestry testing has also had a more controversial effect on society, as it has reopened many criminal cases that went cold decades ago – most notoriously, The Golden State Killer was identified through Ancestry testing.
3. Drug discovery
The wealth of information gathered from consumer ancestry testing greatly surpasses any current populational study efforts undertaken by public consortia. The sheer number of individuals (over 26M) included in direct-to-consumer (DTC) genetic databases significantly increases the power of genealogical algorithms to infer matches between individuals, also commonly referred to as the network effect. The power of these massive data networks generated by DTC genetic companies has also extended into groundbreaking research.
Case in point: the partnership announced by 23andme and GlaxoSmithKline (GSK) is expected to leverage the 6 billion DNA base pairs amassed by the leading DTC company to power drug discovery research.
4. Patient stratification and cohort selection for better clinical trial outcomes and research
Besides drug discovery, such treasure troves of genetic information also contribute to improved patient stratification – the data-driven matching of patients to the appropriate clinical trials in early drug development initiatives. However, as a community, we must be wary of Eurocentric biases in genome-wide association studies (GWAS), as not all populations will systematically respond in the same way to clinical biomarkers and novel drugs in clinical trials.
The remaining challenges of performing Ancestry analysis
Ancestry testing has revolutionised the way we approach genealogy and our individual ‘stories’, while at the same time, it has made DNA sequencing and genetics popular with the masses. There are still major challenges, however, that impede the seamless integration of such technology at production scale.
Population bias – the lack of inclusive reference populations
The heavy skew of population cohorts towards individuals of European ancestry affects the subsequent translatability of findings. To date, about 78% of individuals included in GWAS studies are of European descent, 10% Asian, 2% African, 1% Hispanic, with all other ethnicities representing less than 1%. Individuals with an underrepresented ethnicity are more likely to receive inaccurate reporting from genetic tests, which can lead to serious negative repercussions.
The community is well aware of this disparity when it comes to population genetic studies, and it’s being addressed with GWAS studies focusing on underrepresented ethnicities. These collaborative efforts are necessary to ensure that genomic research does not conserve historical inequalities or diminish the contribution that genomics could make to humanity as a whole, effectively bringing down racial barriers.
2. Complexity in establishing the right analysis workflow to deliver accurate results
The complex nature of Ancestry workflows represents another stumbling block for DTC companies. In order to implement this kind of workflow, providers need to oversee the processing of raw data to yield ethnicity and matching predictions, while at the same time, delivering stellar performance – who wants to hand out inaccurate ancestry reports? Just last year, it was estimated that 40% of variants identified by DTC genetic testing companies were false positives.
Best-in-class Ancestry workflows should implement the most up-to-date reference populations and statistical algorithms for sample imputation (brush up on imputation methods by reading this blog post).
Currently, this is not the case as most commercial applications are a bit outdated and rely on algorithms from a few years ago (remember, this space moves extremely fast!). To be competitive in this space, it is essential to keep up and implement the latest statistical and scientific methods.
3. Scaling Ancestry analysis over millions of individuals data
Due to the scale of human genetic data, the ultimate challenge to deliver state-of-the-art Ancestry testing services is figuring out how to scale analysis to meet customer demand, which can sometimes be in the hundreds, thousands or even millions of individuals.
Of course this is a good problem to have, as it means that you’re in business! However, this might hinder the success of your DTC business offering or severely delay your research. Therefore it is essential to carefully plan out the scalability of the analysis. You need to take into account all of the engineering work that will go into your backend to scale in the cloud or on HPC or both, data management, analysis progress and cost monitoring, auditing, and support (both technical and scientific – it’s essential to keep your Ancestry algorithms up-to-date, as previously mentioned).
After having gone through the main challenges associated with offering Ancestry testing, I hope that I haven’t scared you too much! But it is always good to know what you will be facing before rolling up your sleeves.
The best piece of advice I can offer on this subject is to avoid reinventing the wheel at all costs! This will cost you time and money, and your offering will most probably be either equivalent or worse than what is already available out there.
Lifebit’s high-performance Ancestry pipeline powers DTC companies & research
This is why at Lifebit, we have developed a new computationally efficient method to infer ancestry effectively utilising existing information about allele frequencies associated with different human populations, and of course using the most relevant reference genomes to date.
Lifebit’s Ancestry pipeline, used by a number of DTC and biotech companies, utilises reference population data from 27 different sub-populations. Finding good reference populations with many sub-populations can be very difficult (as established above), which is why we introduced CloudOS’ new Ancestry pipeline. 😉
Our highly accurate Ancestry pipeline extracts haplo-groups, aggregates ancestry estimates and uses python BaseMap module to create an output map PNG map file annotated with the countries and reference files. Furthermore, it outputs a pie chart for ancestry populations and raw data in JSON format for further analysis/visualisations.
Besides offering the best-in-class pipeline through the Lifebit CloudOS Marketplace, we also deliver the necessary tools in order to make sure you can sustain your customer base in the most time- and cost-efficient way. For this, we provide:
If you are already using Lifebit’s Ancestry pipeline on CloudOS, we would love to know what you think! If you’re interested in running our Ancestry pipeline, contact our Customer Success team below, they would love to help you out!