Ever wonder how much computer processing goes into relative matching? Probably not, but there’s some fun math behind it, so sit with me a spell.
The more individuals in a database, the more pairwise comparisons must be done to fully integrate new test results. At MyHeritage, each new kit must be compared to more than 1 million other people already in the database. At 23andMe, it’s more than 5 million. And AncestryDNA must compare new test results to more than 7 million existing ones. (I track the database sizes here: https://thednageek.com/dna-tests/)
How many comparisons is that total? Let’s think about this a bit. If there are only two people in a database, one comparison is needed, shown by the grey line between the individual dots in the figure below.
For three people, there are a total of three pairwise comparisons.
For four people, there are six.
For five, there are ten.
This is not a simple linear or even exponential increase. It turns out that a formula to describe how many pairwise combinations exist in a sample uses factorials. Factorials are fun, if for no other reason, because they’re written with an exclamation point, like this: n! Recall that the factorial of the number n is equal to n times all of the integers below it. If n = 4, then n! = 4 × 3 × 2 × 1 = 24.
A single factorial to describe the number of pairwise combinations would be too easy, though. The formula we need actually involves two factorials:
n!
2 (n – 2)!
Applying the formula to our simple examples above gives us:
- When n = 2, there is 2! / 2 (2 – 2)!
= (2 × 1) / (2 × 0!)
= 2 / 2 = 1 comparison* - When n = 3, there are 3! / 2 (3 – 2)!
= (3 × 2 × 1) / (2 × 1!)
= 6 / 2 = 3 comparisons - When n = 4, there are 4! / 2 (4 – 2)!
= (4 × 3 × 2 × 1) / (2 × 2!)
= 24 / 4 = 6 combinations - When n = 5, there are 5! / 2 ( 5 – 2)!
= (5 × 4 × 3 × 2 × 1) / (2 × 3!)
= 120 / 12 = 10 combinations
*Fun fact: the factorial of zero is 1. (Don’t ask.)
Now that we’ve assured ourselves that our formula works with our visual examples, let’s consider what happened when MyHeritage recently overhauled their entire matching algorithm. That means they had to re-run every single one-to-one comparison in their database. At the time, they had about 1 million testers.
That’s <scribble, scribble, scribble>
1,000,000!
2 (1,000,000 – 2)!
= 499,999,500,000 comparisons
O! M! G! (← Actual enthusiasm; not math.)
That’s nearly half a trillion calculations. At one comparison per second, that would take nearly 16,000 years. If —if— their computers can do 100,000 such comparisons per second, it would take “only” 2 months!
Those numbers really put into context the investment that MyHeritage and Family Tree DNA have put into offering free autosomal DNA transfers from other companies.
The numbers are even more staggering for 23andMe and AncestryDNA.
5,000,000!
2 (5,000,000 – 2)!
= 12,499,997,500,000 comparisons
= 12.5 trillion comparisons
7,000,000!
2 (7,000,000 – 2)!
= 24,499,996,500,000 comparisons
= 24.5 trillion comparisons
Let’s hope those two never have to revise their matching algorithms!
Sorry, but the rate of increase is not factorial, or even exponential, it is quadratic, proportional to the square of the number of tests. You have two factorials in your equation, but they mostly cancel out. n! / (n-1)! = n * (n-1), so the number of comparisons between n tests is n * (n-1) / 2. Doubling the number of tests means four times the amount of work.
The general formula is n!/r!(n–r)!, where r is the number compared each time. For pairwise comparisons, r = 2, which gives the formula I used in the post.
Jim just had a small typo in his comment. The first time that the expression “n-1” appeared he meant “n-2”.
In generally though, I agree with the spirit of Jim’s comment because this sentence is a bit misleading.
> It turns out that the formula to describe how many pairwise combinations exist in a sample uses factorials.
The issue with this sentence is the use of the definite article “the”. The formula given in the post is not unique. There are other formulas that also express the number of pairwise comparisons among n things. In particular, there are ones that do not use the factorial function. For example, Jim gave one, namely n * (n-1) / 2.
There is something unique about the number of pairwise comparisons among n things though. Jim called it the “rate of increase”. It is also known as its “asymptotic behavior” [1]. Jim also accurately stated this when he said “it is quadratic, proportional to the square of the number of tests”. Roughly speaking (and to quote Jim one last time), this intuitively means that “Doubling the number of tests means four times the amount of work.”
[1] https://en.wikipedia.org/wiki/Big_O_notation#Big_Omega_notation
Thanks Tyson. I changed the article from “the” to “a”. I prefer to leave in the factorial equation because (a) factorials are fun and (b) either way, the point remains that the companies are doing massive amounts of data processing, often for free.
Regardless of the math (I’ll let you and Jim Foley hash that out), it’s not as bad as it seems. Assuming Ancestry has a decently sized server farm to do the comparisons, it’s far more reasonable to assume that they can do something in the neighborhood of 30 billion comparisons per second (which is probably on the low side, but I don’t know how many “instructions” are required for each comparison, so I went conservative). This gets them done in less than 10 days. That’s also assuming that they actually compare each test to each other test. There are sorting and sampling algorithms that could cut down the number of actual comparisons required considerably. All that said, I would guess that for a full re-compare, they rent additional server capacity and get it all done in a few days.
I honestly don’t know how long each comparison takes. It would be interesting to find out. We do know that AncestryDNA is doing somewhere upward of 70 billion per day, just for new matches.
I really enjoy your blog posts. I have a fairly simple question about DNA (I think it is simple) and wonder if you ever answer questions on your blog?
Ask away!
I have an identical twin. Our ethnicity differs by a few percentage points, but someone has explained that to me, saying that sometimes segments of DNA are unreadable on some tests, and that would skew the data. What I do not understand is why quite a few (at least a dozen or more) of my distant cousin matches on Ancestry do not also match her. I don’t always look at distant cousins because there are so many, but often I share no matches with those that I do check out. How is it possible that they would not also match my twin? Is it also due to unreadable segments n her test of mine? While we are at it, Ancestry shows 3489 centimorgans matching for us over 25 segments. I am guessing that if everything matches that far, they don’t bother to compare the rest? Obviously we share DNA in more than 25 segments. My daughter and I share 59 segments. My twin and I should share all of them.
Sorry– on her test or mine.
Actually, 24,499,996,500,000 = 24.5 trillion, not 24.5 quadrillion. You need 3 more zeros on the end for quadrillion. But I still love your article. I’ve never seen anyone look at the match comparisons this way before.
Thanks for catching that, Louis! I’ve corrected it.