Genealogical Database Growth Slows

The past year has seen a chilling in the genetic genealogy industry:  DNA kit sales are down drastically since April 2018.

I don’t work for any of the testing companies, nor do I have any special insight into their official sales figures.  However, I have been tracking their database sizes for a couple of years now (with data retroactive to 2013), and the decline in growth rate is obvious.  Yes, the databases are still growing, but they’re growing more slowly than before.

Consider the graph below.  The slope (steepness) of each line indicates how fast the database is growing at that point in time.  Notice that the slopes for AncestryDNA (green) and 23andMe (purple) got steeper and steeper until April 2018, after which growth of both databases slowed (the region in the grey box).

Growth at the other databases (with the possible exception of MyHeritage; see below) has also slowed since April 2018, although it’s harder to see from the graph because the scale on the y-axis is set to the larger companies.

How much has growth slowed?

 

Curve Fitting

We can predict how large the databases would be had they continued to grow at the rates prior to April 2018 using curve fitting.  Curve fitting is a mathematical process in which an equation is found that “fits” the real-life data as best as possible.  Once a good equation is found, it can be used to extrapolate the expected values beyond the range of the existing data.

We can gauge how well the equation fits the data using a metric called R² (pronounced R-squared).  The R² value is always between zero and one.  The closer it is to one, the better the equation fits the real data.

I used an online curve fitting tool called MyCurveFit to fit exponential equations to each company’s growth trajectory through April 2018.  In exponential growth, the rate of change increases over time, which is what we see in the graph above prior to April 2018.

For each database, I plotted the database sizes as currently known on the same graph as the values calculated from the fitted equation.

 

AncestryDNA

AncestryDNA has the largest database of the genealogical testing companies, larger than the others combined.  In May 2019, they announced that their database contained more than 15 million people.  Previously, they’ve announced growth milestones three or four times per year, giving me 19 data points prior to April 2018 and two after that date.

The graph below compares the actual reported values from Ancestry with the values projected by the equation.

The two lines overlap nearly perfectly prior to April 2018.  In fact, the R² value is 0.9970, or almost one.  However, the lines diverge sharply after that date.  Had AncestryDNA’s database continued to grow at the previous rate, the equation projects it would have had more than 21 million people in May 2019 rather than the reported 15 million.

Put another way, from April 2018 to May 2019, the database added 6 million people, when it was predicted to add more than 12 million. That’s a decline in growth of 51%.

 

23andMe

23andMe is the second largest genealogical database, with more than 10 million people as of April 2019.  The company reports their database size once or twice a year, giving 14 data points prior to April 2018 and one after.

The graph below compares the actual reported values from 23andMe with the values calculated from the equation.

The two lines overlap well prior to April 2018, with an R² value of 0.9612.  As with AncestryDNA, the lines then follow different trajectories.  Had 23andMe’s database continued to grow according to the equation, it would have had nearly 14 million people rather than the 10 million reported in April 2019.

From February 2018 to April 2019, the database added 5 million people.  It was projected to add nearly 9 million, a decline in growth of 43%.

 

FamilyTreeDNA

FamilyTreeDNA is the smallest of the databases discussed here, and they have never officially announced how many autosomal DNA testers they have.  The values used here were estimated by Tim Janzen, a long-time customer of the company, and published on the ISOGG wiki. There were 20 data points prior to April 2018 and three after.

The graph below compares Tim Janzen’s estimates for FamilyTreeDNA’s autosomal database with the values projected by the equation.

The R² value prior to April 2018 is 0.9901 and, again, we see a decline in growth after that point.  Had FamilyTreeDNA’s autosomal database continued to grow according to the equation, it would have had about 1.5 million people rather than the 1 million estimated in February 2019.

Assuming the estimated numbers are correct, FamilyTreeDNA added 200,000 people from March 2018 to February 2019, when it was projected to add 700,000.  In other words, growth declined 71%.

 

GEDmatch

The owners of GEDmatch have kindly reported their database size to me personally at intervals since January 2016, giving seven data points prior to April 2018.  Either directly from GEDmatch or from media reports, I had five data points after April 2018.

The graph below compares the actual values with those projected by the best-fit equation for GEDmatch.

The two lines are almost identical before April 2018, with an R² value is 0.9977.  After that point, the GEDmatch database initially grows faster than expected, then declines below the values predicted by the equation.  Had GEDmatch continued to grow as projected, it would have had more than 1.4 million people in May 2019 rather than 1.2 million.

GEDmatch added 387,000 people from February 2018 to May 2019, when it was projected to add nearly 650,000.  Growth declined 40%.

 

MyHeritage

MyHeritage is the most recent entry to the DNA testing market that is discussed here.  Thus, there were only four data points prior to April 2018, not enough to fit a reliable curve. (The projected database size based on those four points was 79 million, which is simply not credible.)  Thus, for MyHeritage—and only for MyHeritage—I included one data point from May 2018.

The graph below compares the actual growth trajectory for MyHeritage with that projected based on those five points.

The R² value prior to May 2018 is 0.9891.  Like the other databases, growth was slower than expected after that point.  Had MyHeritage’s database continued to grow according to the equation, it would have had nearly 3.8 million people rather than the 3 million reported in May 2019.

Between May 2018 and May 2019, MyHeritage added 1.6 million people. If the projections are correct, it was expected to add nearly 2.4 million, a decline in growth of 32%.

 

The Obvious Question

The pattern is clear:  something happened early in 2018 to cause database growth to slow across the board, from 32% at MyHeritage to as much as 71% at FamilyTreeDNA.  The question is:  What?  What caused the decline?

One possibility is market saturation.  Perhaps genetic genealogy is approaching its natural consumption level, where those who are inclined to purchase a test already have.  The counter to that argument is that 23andMe is not a genealogy company; it’s a biomedical one.  Theirs is a different market, yet the company’s growth declined along with those of the genealogy companies.

What’s more, one might reasonably argue that the market in the United States is approaching saturation, but relatively few people in Europe have tested, meaning there’s still ample room for growth there.  Yet MyHeritage, which is based in Israel and sells most of their DNA kits in Europe, also experienced a decline in growth.

It’s also possible that my numbers for past growth are wrong because the testing companies don’t report the exact database size on a specific date.  Rather, they usually report that their database is “larger than X”, and the precise date it hit that threshold is not publicly known.  However, the numbers I have for GEDmatch are largely date-specific, and GEDmatch’s growth slowed, as well.

The elephant in the room, of course, is the use of some genealogy databases, specifically GEDmatch and FamilyTreeDNA, by law enforcement.  That fact first became public knowledge on April 25, 2018, when the Golden State Killer story broke.  And April 2018 is precisely when we see a decline in growth across the board.

Public concern over law enforcement using genetic databases seems the most likely explanation for the cooling of the market.  In fact, Anne Wojcicki, the CEO of 23andMe, publicly speculated that law enforcement and privacy concerns were indeed behind their decline in growth.  And she actually does have inside information on their sales figures!

This explanation fits a few observations well.  First, the company showing the smallest decline, MyHeritage (32%), is also the company whose market is furthest removed from the American judicial system and thus either unaware of or unthreatened by US law enforcement using genealogy databases.

Second, the company most welcoming of law enforcement, FamilyTreeDNA, showed the largest decline in growth (71%).

Third, the graph for GEDmatch shows an increase in slope immediately after the Golden State Killer arrest—a change widely attributed to the positive press that GEDmatch received at the time—followed by a sustained decline.  If the GSK case could have caused the short-term increase, fallout from that case could also have caused the long-term decline.

Whether public concerns over law enforcement truly are the explanation for the market slow-down is still an open question.  The community should be aware that a decline in growth is occurring and discuss rationally and maturely the possible reasons behind it.

 

42 thoughts on “Genealogical Database Growth Slows”

  1. I have not fully examined this issue, but another possibility is that with the prices of full genome sequencing coming down, people are waiting to test everything in one shebang rather than piecemeal (i.e. mtDNA then atDNA then yDNA then individual SNPs then Big Y, etc…..). If I were these companies I would be sending the prices for the piecemeal tests (such as Y111, mtDNA, and base level yDNA haplogroup panels) way down to get more people in the door and profit from the volume rather than the price.

    1. Hardcore testers who are interested in WGS with mtDNA and yDNA are a tiny fraction of the market. More than 25 million people have done atDNA test, while FTDNA reports only about 716,000 yDNA records and 336,000 mtDNA records in their database.

      1. I did the Y-DNA 37 with FTDNA,
        and have been rather disappointed in the extremely
        limited matching capabilities.
        (Not only that, I do not think FTDNA is reporting
        accurately on the Shrinkage in their Database, from
        people who have voided their kits out of “Mistrust”.
        I have noticed what appears to be shrinkage.
        FTDNA’s lack of PR Strategy, left them looking greedy
        and corrupt, willing to take blatant advantage of their
        Customer Base. This alone could have destroyed the
        hopes for a usable Y-DNA database in the future.

        I have admired GEDMatch for their honesty.
        RR.

  2. I feel like there’s more nervousness about “privacy” now – without any real defined idea of what that means. It doesn’t seem to be directly related to law enforcement uses, in that the folks I’ve spoken to are very much in favor of using DNA to catch criminals, but it seems like the GSK case opened their eyes to the fact that these companies have their DNA data. What they are concerned the companies will do… I’m not sure. They’re not sure either, because for the most part DNA mystifies them, so trying to think what the faceless corporations might do with theirs is totally beyond them. But it could be bad!

    1. I’ve spoken with many people who are expressly concerned about law enforcement and have refused to test as a result. I’m sure the bigger companies have done market analysis, but we’re not privy to that data.

  3. One factor might be the date when a company opened up its database to receive raw data from one of the other companies (or specifically when they advertised the ability to do so). Another possibility could be discounted Christmas season sales in December 2017.
    The Ancestry line could be affected by the second of these — and the curve looks as if it’d be smooth if the April 2018 datum were removed.

    1. Neither Ancestry nor 23andMe accepts raw data transfers, yet all of the databases showed similar rate decreases at about the same time. It’s hard to come up with an explanation other than concerns over law enforcement.

  4. In my experience, yDNA can be an extremely interesting research path, just as enticing as atDNA, and more people, both men and women (women, using their male relatives as subjects for their own research) should be lured into this area with lower pricing. For example, at FTDNA the atDNA test is $79, the mtDNA is $89, but the intro yDNA test (Y37) is $170 (Y37 is really not adequate however and really the Y111 is what you need, which is $359!). Meanwhile, FTDNA offers a few individual SNPs for $40 while a company such as YSEQ offers a ton of SNPs for $18 apiece. There is more and more competition from outside the big three and maybe more people are going there?

  5. I have noticed the same thing, specifically on AncestryDNA. I have been tracking the number of new 4th cousin or closer matches since Dec 2017. That number dropped sharply between May and June 2018. Only one month since, Feb 2019, has exceeded any of the months between Dec 2017 and May 2018.

  6. I don’t really understand this article. You say that the sale of DNA kits is down and that the growth of databases has slowed. But the graphs do not show this at all. Growth is much stronger than before 2017 and seems very steady to me. The projected increase in the rate of growth that was based on an upturn in 2017 was unrealistic.

    1. Yes, growth has slowed, apparently across the board. The databases are still growing, they’re just growing more slowly. It’s analogous to driving 75 miles per hour on a major highway then slowing to 35 mph on a back road. You’re still moving, but you’re not moving nearly as fast.

  7. Certainly explains the apparent proliferation of discount kit sales in the last year or so. Almost every event or occasion being highlighted with a kit sale by Ancestry or MyHeritage.

    I have also noticed some “cousins” on Ancestry and GEDmatch have removed their data, so some of the change in slope may be attributed to this. Growth may be nearly the same (as I have not noticed a drop off in new connections) but more people are deleting their kits. It would be interesting to know what proportion of the change in slope is due to fewer testers versus deletions from databases. Most people seem to test for ethnicity and then don’t want to be involved in gengen discussions.

    1. I’m not sure that sales are more frequent than before. I’ve been tracking prices since October 2017, and the daily average is actually higher in 2019 than in 2018 or 2017 for most companies. (That could be because I only have a full year of data for 2018, though.)

      I agree that it would be interesting to know how many kits have been deleted.

    1. You’re right that sales are seasonally cyclical. Most kits are sold before Christmas and show up in the databases from January to April. The graphs, though, extend back 3–6 years (depending on the company), so seasonality is already factored in.

  8. Possible reason: consumer debt in the US now higher than it was just before the 2008 recession. Wages are not rising but the cost of food and housing is. People are beginning to realize they don’t have the discretionary income for a hobby like this, especially with subscription costs to some of the sites being quite high.

    1. Oh, that’s an interesting hypothesis! If you’re right, then consumer spending in the US (the lion’s share of the DNA testing markets) should have started tapering off around April 2018 and continued through today. I found a graph on the website of the Federal Reserve Bank of St Louis (https://fred.stlouisfed.org/series/PCEC96#0) that shows a dip from Dec 2017 to Feb 2018, when kit sales at Ancestry were robust, followed by solid growth in spending from Feb–Nov 2018, when kit sales across the board were down.

      I’m not an economist. Am I interpreting the FRED graph correctly?

      1. I do have an undergraduate degree in econ but I don’t think the FRED graph has enough info to know (and that degree was back when George Washington counted on an abacus!) It’s possible that debt is still growing as people are paying for essentials and “keeping up with the Joneses) stuff and DNA kits is just one too many extras. Very hard to know. The reality with most economic analyses is that the truth is much more complex than can be shown in a chart.

  9. Thank you for taking your time to put this together and sharing. I have been curious as to when there would be a slow-down.

  10. Or, as Vince Gill said in his country song decades ago, people are just moving on to the “next big thing” whatever that is.

    There is not only a dollar commitment involved in genealogy, but also a bigger time commitment; and maybe “word of mouth” advertising has slowed down……

    1. Most people who test aren’t genealogists. They’re either interested in their ethnicity estimates (Ancestry, etc.) or in health reports (23andMe), which are separate markets. That growth in both markets in both the US and Europe (MyHeritage) would slow at the exact same time the GSK story broke seems more than a coincidence.

  11. I downloaded my match list at FTDNA the morning after Buzzfeed broke the story about cooperating with the FBI on January 31st

    https://www.buzzfeednews.com/article/salvadorhernandez/family-tree-dna-fbi-investigative-genealogy-privacy

    Nineteen out of the original 4374 are no longer visible (0.4%). I have no way of knowing if they just opted out of matches or removed their kits entirely. There are a handful that seem to pop in and out at various times.

    I did a similar exercise for GEDmatch starting in May 2018. Over the course of the year, about 2% were no longer visible. I could tell by checking their kit numbers that roughly half set their kits to Research and half deleted their kits.

  12. Likewise, my new matches at Gedmatch have slowed considerably.

    But, I still have 3000 matches, all over 15.2 cMs, largest segment, with much Colonial American ancestry.

  13. instead of looking at this from a differential calculus POV (i.e., focusing on the rate of change in the curves), what if we look at the area under ALL the curves for ALL the companies? We are still seeing lots of growth in the testing population, and that growth is now happening at more companies (MyHeritage being the most significant recent development). Without a well-designed study, it’s impossible to know if the recent decreases in what are nonetheless still extremely positive rates of change in the tested population are due to law enforcement concerns, other privacy concerns, or market saturation. And it’s extremely likely that there are multiple factors involved. As an amateur genetic genealogist I am more interested in the “quality” of my matches — i.e. their knowledge about their ancestry and level of engagement in genealogy — than in amassing ever greater numbers of clueless 6-7 cM matches.

    1. I agree that we don’t have the information we need to say with certainty what caused the decline, but it’s sharp and worrisome, and the circumstantial evidence that concerns over law enforcement played a part is striking. That Ancestry, 23andMe, and Helix partnered recently to lobby Congress on best practices in the industry also suggests that they’re concerned about the impact.
      https://thehill.com/regulation/lobbying/450124-dna-testing-companies-launch-new-privacy-coalition

    2. I have an interest in some 6-7 cMs matches only because I am sometimes 2 generations older than my match; and I find quite a few tree matches in lower cMs matches.

      My logic is that if the parent of that 6 cMs match had tested, depending on re-combination, the parent and I could have potentially had a 12 cMs match. And, if the grandparent of my match had tested, then that would be the level playing field, generationally………. with the possibility of even more shared cMs. Just a “what-if” to factor in to the equasion…….

      1. Unfortunately, you can’t assume that a 6 cM match to you was 12 cM in their parent nor 24 cM in their grandparent. A 24-cM segment is more likely to be passed down in its entirety (or not at all) than to be passed on in part.

  14. Hi
    If it is concern over law enforcement then that will dissipate with time and growth will return. If it is saturation then growth will never return to what it was. It could be a combination of both of course. In my own case when I started out I had many questions on my maternal side. Those questions have been largely answered because sufficiently close matches eventually showed up. I don’t need to do more autosomal tests at this time. On my paternal side the issue remains as to my gg grandfather Hicks lineage. Autosomal data can’t be as helpful there since the matches I would be using are 4th cousin and up. I did just purchase an upgrade from 67 to 111 markers on the Y. I doubt I will ever do the big Y because I have read that will not be more refined than 111, it’s just more data to consider. That is stated on the FTDNA site.

    Dennis Hicks

    1. I am dubious that concerns over LE will dissipate on their own. The new lobbying coalition among Ancestry, 23andMe, and Helix suggests that they agree.

  15. Super useful information Dr. Larkin!
    When I look at the Ancestry curve, I see one deviation in Jan-jun ’18. Can you re-run the curve-fit and share the projection(s) for 5 and 10 years out? Also, can you add dots to the sample dates?

    This doesnt really look like an inflection or decreasing 2nd derivative to me. It kinda looks like Ancestry got a short-term bump. I remember seeing gobs of advertisements. I’m thinking (guessing), that Ancestry saw their curve was doing just fine all by itself … and possibly outpacing their compute capacity. For every new kit, they have to run 15 Million match cycles. Each comparing 1 million SNPs between each pair. Not to mention, their web servers have to grow linear to number of accounts, And, they have to run the Hints, Common Ancestors and Shared Matches for each new Account’s whole set of DNA matches.
    Finding out how they do that, would be really cool.

    Thanks

    1. I disagree that Ancestry had a short-term bump. Their growth fits an exponential curve almost perfectly through the first quarter of 2018, when the rate declines sharply. If it were just them, your arguments about advertising spend or computing capacity might be compelling, but we see the same trend across the board.

      1. I am accustomed to getting 50-100 or more matches per day. In the last 7 days, I have received only a small handful of matches, days without any matches.

        Just a small sampling, I realize.

        1. Others have reported not getting new matches at Ancestry in the past few days, as well. I suspect that has more to do with programming updates they’re making in the background than with the slow-down in sales, which wouldn’t give an abrupt change like you’re seeing. We’ve seen similar periods without new matches before when they were updating their system.

      2. Good point. : ) The Ancestry bump is possibly not isolated.
        But, what youre showing is that the rate of change is decreasing. So, if you plot the rate-of-change derivative of those curves, we could see it all sort of normalized.

        And, if it would be very very interesting if you could re-generate the fitting, to show where the new trend takes us in 5-10 years. I’m just eyeballing it, and it looks like
        Aug 16 – Aug 17: 3M
        Aug 17 – Aug 18: 4M
        Aug 18 – Jul 19 : 5M
        Or, in general, sum N,
        So, if that is remotely close, we can expect n(n+1)/2
        If the USA adult population is 240M, then 21.6 years will sum to ~240M. Or, given that 21.6 – 5 already past is 16.6 years, at this rate we will reach full saturation (of todays living adults).
        Its kind of unlikely 100% of adults will do this. About half of my relatives have. I’d kind of expect an S-curve in about the mid-point between saturation (50%?) and the start.

        What do you think?

        1. I agree that the market will saturate for the genealogy companies, and it’s possible that’s what we’re seeing. After all, there are only so many genealogists, and we don’t need to test every single relative to achieve our research goals. The reason I’m not inclined to accept saturation as an explanation for the widespread slow-down is that 23andMe’s market isn’t genealogy, it’s health. The cap on their market is every person alive (except maybe identical twins).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.