DNA Databases Continue to Grow Rapidly

Well, it’s been a whirlwind couple of weeks in DNA World, what with AncestryDNA reporting that they’d surpassed 6 million testers in their database by November 1, new information from MyHeritage that they have more than 200,000 people in their database, and now 23andMe announcing that they have broken 3 million. We also have an updated estimate for Family Tree DNA, which does not state how large their autosomal database is.

I’ve had to update the Autosomal Testing Growth graph twice already this month, and now I’m doing it a third time!

I’ve also decided to “smooth” the lines a bit. Because of a weirdness in how Excel graphs the data, the lines for some companies have horizontal sections that give the false impression that their databases aren’t growing for long periods of time.

Like so:

Obviously, it doesn’t make sense that 23andMe’s database had a huge jump in size from February to April 2017 after not growing at all for a year prior. That’s just an artifact of the fact that Excel won’t bend to my will.

To correct for the artifact, I’ve guesstimated the database sizes between “official” data points to give a better impression of how quickly the databases are growing. I’ll be using this format on my DNA Tests page and in future graphs.

As always, feel free to use the graph in presentations to genealogy societies and DNA interest groups.


18 thoughts on “DNA Databases Continue to Grow Rapidly”

  1. Great table but I think your estimate for FTDNA is far too low. In the normal numbering system they have approx 800,000 kits (say 600,000 of these are family finder, probably an underestimate). Then there are the B (transfer) kits which are all family finder and there are 260,000 of them. Then N (National Geographic kits), A (African), E (IGenea), H (DNA Heritage), M (Middle Eastern) and U (DNA Worldwide). If you estimate about 140,000 more across these platforms then that makes it:
    600,000 + 260,000 + 140,000 = 900,000

    1. Tim Janzen and both arrived at similar estimates for the Family Finder database using different methods. In the absence of a formal announcement from Family Tree DNA, I’m hesitant to change the estimate in the table. I’ll be thrilled to update it the moment they give us an official count.

  2. The information on DNA database sizes versus time, which has been interesting and helpful, may not be quite so useful in the future if information on the size of the database that is actually available for DNA matching is not also included. This is because Ancestry has recently initiated a major change in procedures (see their Nov. 2, 2017 Blog post entitled “Continued Commitment to Customer Privacy and Control” along with Comments. As the company with the largest and fastest growing DNA database, what Ancestry does can have large effect on genetic genealogy. Unfortunately, it seems from my personal experience over that last few weeks that Ancestry’s change in procedures has resulted in blocking of access to DNA matching from a majority of their new DNA tested customers.

    What is needed now more than ever are the best methods to estimate the size of the actual DNA databases that are available for matching from the various companies since it is doubtful that those DNA testing companies that partially block access will make know how much of their database they have blocked. The part of a DNA database that is not available for DNA matching is of no value to genetic genealogy, and estimates of total database size that include this part could in the future become of negative value by providing misleading information.

    1. AncestryDNA is not “blocking” access to matching; they are simply giving their customers the same choice (to opt out of matching) that they already have at FTDNA, 23andMe, and MyHeritage.

      You’re correct that AncestryDNA may not distinguish how many people are in their database overall from how many are participating in matching. 23andMe certainly doesn’t, and FTDNA has never announced how large their atDNA database is. That’s why in April I conducted this study to estimate the “effective database sizes” of the different companies.

      I will have to modify my assumptions if I want to repeat the analysis.

      1. I hope you do repeat the analysis with modified assumptions in the future as I have found your blog posts on the subject useful. In addition to the DNA database size, the rate of growth of the database size that is available for matching is also of interest, as your graphs imply. However, if only the rate of growth of the total database size at Ancestry were to be shown in the future, it would theoretically be possible to see an impressive rate of growth while the actual size of the database available for matching did not change much or even became smaller.

        As an FTDNA customer (and an Ancestry customer), I have confirmed that they also have procedures for opting out of DNA matching, but the procedures used by FTDNA are not the same as those used by Ancestry, and the existence of procedures from the four DNA testing companies you named does not mean that these procedures have been optimized for the combined considerations of customer privacy and the important goals of genetic genealogy. DNA analysis of this type is a relatively new activity. The Comments at the end of the above described Nov. 2nd blog at Ancestry are mostly negative. Regardless of the terminology used to describe the new procedures at Ancestry, a key point is that there are significantly fewer new matches because of the new procedures.

        1. Customer privacy always takes precedence over someone else’s genealogy. No one should be expected to participate in matching unless they want to.

          I will be interested to see what happens to the growth trajectory at Ancestry. I’ve been tracking my own matches for a while, so I should be able to see a change if a significant number of their customers opt out. So far, my numbers are still climbing.

        2. I agree that customer privacy is necessary, but customers have never been required to identify themselves or provide personal information. I do not believe an anonymous DNA match is personal information. The issue is more complicated and in my opinion not that well framed by suggesting that it is just about expecting someone to participate in matching. Some of the comments at the Ancestry blog support this.

          I am glad to hear that you have been tracking your matches at Ancestry and I look forward to seeing the results. The rate of opting out for previously existing customers and for new customers are both of interest. We can learn what the new customers have been doing from the change in rate of obtaining new matches, and the new matches are arguably the more important once you have been doing this for a while. My observation for my own matches is that the rate of obtaining new matches has been much lower in recent weeks. Although I do not have quantitative data, I look at my new matches daily or more often, so I think I have a pretty good overall take on it. Granted there could have been a slowdown in anticipation of the Holiday sales, but I think the rate of obtaining new matches is so much lower that the change in procedures is the main thing.

        3. It’s up to the customer to decide whether anonymous DNA matching is personal information. If they don’t want to participate, they shouldn’t have to.

  3. I think that what constitutes anonymous information is something that can be reasonably evaluated and adjustments can be made to improve privacy without taking extreme measures. Of course if the customer disagrees they are free to do what they want, as has always been the case. However, the rapid growth seen at Ancestry prior to the procedural change does not seem to indicate that this had been so big a problem that a major change was needed.

    1. I know of people who have deleted their test results at AncestryDNA because they didn’t want what was revealed to be known. Better to give them the option to opt out.

      1. It is very important that people can remove their results. In terms of designing improved procedures it would be of interest to know specifically and in detail what the concern of these individuals were since adjustments to procedures might be possible to accommodate the concern without making a major change to procedures that may have a negative effect on genetic genealogy. Got to go.

    1. Two independent estimates using different methods (one by me, one by Tim Janzen) came to very similar conclusions about FTDNA’s database, so in the absence of an official count from them, we must assume the estimate is reasonable. If FTDNA’s database is indeed larger than MyHeritage’s, it would be in their best interests to make that fact public.

      1. No offense, but you and Tim Janzen were spectacularly wrong with your My Heritage prediction (200,000-300,000 and it turns out to be 670,0000…) Why should the FTDNA prediction be viewed with any confidence?

        A person could easily conclude that these lowball estimates are being made to encourage (or force) My Heritage and FTDNA to make their numbers public.

        1. We were wrong about MyHeritage precisely because the matching algorithm is so bad. Do you think the matching algorithm at FTDNA is equally bad?

          Are you suggesting that Tim Janzen and I are intentionally spreading false information about FTDNA? Tim has been estimating their database size since at least 2012, and I’ve explicitly described how I came to my numbers. If you disagree, feel free to critique specific aspects of our methods or to perform and publish your own study. It’s certainly possible that we underestimated the database, but so far your argument amounts to “You’re wrong because I don’t like your conclusions”, which is not scientifically justifiable.

  4. Everything I’ve seen suggests that there are a fair amount of false positive matches on My Heritage. Many blogs and forum users have mentioned this… You are the person that I’ve seen who claimed that a lot of matches are being missed by My Heritage (and I believe that this claim is based largely on your desire to rationalize your claim that your FTDNA estimate is still correct).

    1. Feel free to read the review I wrote about MyHeritage matching back in July, when I described the problem of false positives in some detail:

      False negatives are harder to gauge, but it would be silly to assume they don’t exist simply because they don’t fit your narrative. If we assume that the number of matches is proportional to the size of the database, either MyHeritage is lying about how many people they’ve tested or their matching algorithm has a lot of false negatives. Given what a stupid move it would be to lie to the genealogy community and how unrefined their matching algorithm currently is, the latter seems much more likely.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.