Estimating the Sizes of the Genealogical atDNA Databases

The past few months have brought some exciting news from two of the main autosomal DNA (atDNA) companies: AncestryDNA exceeded 3 million testers in January 2017, and 23andMe broke 2 million in April. The larger the database, the more likely we are to find DNA relatives with information that might enhance our family trees, so this information is an important consideration in deciding where to spend our testing dollars. However, Family Tree DNA (FTDNA) does not publicize how large their atDNA database is, and the newest player in the market, MyHeritage, hasn’t released figures either.  Another wrinkle is that not everyone at 23andMe participates in DNA Relatives, the matching tool there. Thus, there is a difference between the total size of a database and what I’ll call the effective size, that is, the number of people available for cousin matching.

When the same person has tested at multiple companies, we can get a sense of the relative sizes of the databases by comparing how many matches they have at each site.  If they have ten times as many matches at Company A as at Company B, it’s not unreasonable to assume that Company A’s database is roughly ten times larger. Of course, there are other considerations. For example, if the person lives in a country where Company A does not sell tests but Company B does, their results may not represent the true sizes of the databases.  If we can get dozens or hundreds of people to all contribute data, though, our estimates should be pretty accurate.

To that end, I collected self-reported information from genetic genealogists using an online survey form. The data were used to estimate the effective sizes of the databases at FTDNA, 23andMe, and MyHeritage relative to that at AncestryDNA.

 

Methods

From 10–17 April, 2017, I offered a Google Forms survey to members of various genetic genealogy groups on Facebook. The final version of the questionnaire is shown below. (The survey is currently closed to new submissions.)

 

To make the survey easy to answer, I asked respondents to report how many pages of AncestryDNA matches they had rather than the total number of matches.  AncestryDNA reports 50 matches per page. I estimated the number of AncestryDNA matches as (p-1) x 50 + 25, where p is the number of pages of matches. For example, if someone had 101 pages of matches, I knew that they had at least 100 x 50 = 5000 matches, and the last page could have between 1 and 50 additional relatives, which averages to about 25. The formula would estimate (101 – 1) x 50 + 25 = 5025 matches.

For each of the respondents who had tested at AncestryDNA and at least one other company, I calculated a point estimate of the total effective database size of that other company as follows.  I assumed that AncestryDNA’s database has 3 million participants and calculated a simple proportion as (number of matches at Company X) / (number of matches at AncestryDNA) x 3,000,000.  For example, if someone had 10,000 matches at AncestryDNA and 1,000 at 23andMe, the point estimate of the effective database size at 23andMe would be 1000 / 10000 x 3000000 = 300,000.

I then averaged all of the point estimates for that particular database to determine the effective database size.

 

Results

As of 17 April, 243 people have responded to the survey, providing 216 data points for AncestryDNA, 126 for 23andMe, and 100 for MyHeritage. Of the 235 reports for FTDNA tests, 80 were originally run by FTDNA’s labs, 44 were transfers of either Ancestry v1 or 23andMe v3 results (v1/v3), 17 were transfers of either Ancestry v2 or 23andMe v4 results, and the remaining did not specify. Only the 80 FTDNA-run results and the 44 v1/v3 transfers were used in the following analyses, for a total of 124 FTDNA data points. Not every respondent had tested at all of the companies, and 27 either had not tested at AncestryDNA or incorrectly reported their matches there, so they could not be used to calculate point estimates.

A few reported numbers were omitted from calculations because they appeared to be errors. For example, one respondent reported 1424 matches at MyHeritage, when the next highest number was 667. That respondent did not report a value for 23andMe at all, and I assumed that 1424 had been entered in the wrong field. Another respondent filled out the form six times; only one was counted in the calculations. A third reported 366 pages of matches at AncestryDNA, 2087 matches at FTDNA, and zero matches at 23andMe. The latter value was omitted.

Most respondents (164) reported being from the USA. An additional 21 were from the British Isles; 15 from Canada; 10 from Australia; three each from the Netherlands, Sweden, and New Zealand; and one each from France, Germany, Guatemala, and Saudi Arabia. The remainder did not report a country of origin.

This table summarizes the numbers of self-reported matches at each of the main testing companies and the estimated effective database sizes.

 

 

 

Discussion

My initial plan was to estimate the absolute database sizes of the main atDNA companies relative to that of a known benchmark (AncestryDNA). However, that goal proved illusory because I could think of no way to adjust for the different algorithms and policies of each company. (See below.) Instead, I decided to consider the effective database size as a gauge of the potential to find matches. After all, that’s what a genealogist cares about. If a hypothetical database will only match you to people who share enough DNA to be 2nd cousins or better, you won’t get very many matches, no matter how many people have tested there.

 

The Effective Database Sizes of the Main DNA Testing Companies

With the assumption that AncestryDNA’s effective database holds 3 million people, the analysis indicated that FTDNA’s has about 440,000, 23andMe’s about 397,000, and MyHeritage’s about 12,000. AncestryDNA’s database is undoubtedly larger now (the 3 million figure was announced in January), so all of the other databases are proportionally larger, as well.  The calculations can be redone easily when AncestryDNA reports their current database size.

23andMe has an arbitrary cap of 2000 matches in their DNA Relatives feature, although it does not seem to have affected the results of this analysis, as only one respondent had reached the cap.  That person reported 2312 matches and was able to exceed the cap by making contact with the matches low on their list.

 

Effective Database Size Versus Actual Database Size

A number of factors will cause the actual database sizes to differ from the effective database sizes. First, anything that lowers the proportion of matches at AncestryDNA relative to the total database size there will artificially inflate the estimates of effective database size at the other companies.  Two such factors are AncestryDNA’s use of statistical phasing to reduce the number of false positive segments and their Timber tool to remove excess IBD (also known as pile-ups). Timber was reported to have removed 75–80% of matches when it was introduced in 2014, a drop so drastic it earned the nickname AutosomalGeddon.

FTDNA’s higher segment size threshold should lower the proportion of matches relative to the full database, but their inclusion of small segments will have the opposite effect.  An unknown percentage of 23andMe customers don’t participate in DNA Relatives at all. And MyHeritage’s imputation algorithm still needs a lot of refinement. Precisely how these factors affect the relationship between actual and effective databases is an unanswered question.

 

Country of Origin

Both FTDNA and 23andMe have been selling DNA tests in countries other than the USA for longer than AncestryDNA has. Thus, one might expect that non-US testers would have more DNA matches at the former two companies than at AncestryDNA. However, only one respondent reported similar numbers at the three companies: the Saudi Arabian had 375 matches at AncestryDNA, 340 at FTDNA, 331 at 23andMe, and none at MyHeritage. Everyone else who answered the survey and who had tested at AncestryDNA plus at least one other company had many more matches at AncestryDNA, regardless of their country of origin. The survey did not address how closely related those matches are.

 

Summary and Recommendations

AncestryDNA remains the largest atDNA genealogical database by far, reporting 3 million testers in January of this year. As measured by effective database size (e.g., the expected number of matches), FTDNA’s is the next largest, with approximately 440,000 people. 23andMe recently reported that their database had more than 2 million customers, but based on this analysis, fewer than 20% of them opt in to family matching. Their effective database size was less than 400,000. Finally, MyHeritage is a new entrant in the field and still has a very small database of around 12,000 people.

For someone in the US whose primary interest is genetic genealogy, and especially for someone searching for biological family, I recommend testing at AncestryDNA, then transferring the raw data to FTDNA and MyHeritage to get onto those other databases for free. For citizens of other countries, AncestryDNA will probably still give you more matching relatives than the other databases, although your matches at FTDNA and/or 23andMe may be more closely related. Regardless of where you live, if your primary interest is health reports, 23andMe is your best option.

 

Future Analyses and Predictions

The survey for data collection is currently closed. When AncestryDNA announces an update to their database size, I will re-open it and collect another round of data. I expect to see FTDNA’s database grow rapidly now that they are able to accept transfers of raw data from the newer versions of both AncestryDNA and 23andMe tests. Conversely, I predict that 23andMe’s effective database will grow more slowly, or possibly even shrink, as inactive customers log in and opt out of DNA Relatives. (I have lost more than 100 matches there since November.) I am interested to see how MyHeritage fares as the newest player in an established market.

 

UPDATE: On 22 April, AncestryDNA announced via their Twitter feed that their database had reached 4 million. I have updated the estimates accordingly.

19 thoughts on “Estimating the Sizes of the Genealogical atDNA Databases”

  1. I believe that there has been an increase in matches at 23andme when they started selling kits for ancestry only for 99. It is my understanding that the US is the only country that option of 99 for ancestry or 199 for health and ancestry. All other countries the 199 for health and ancestry is the only option.

    1. I would have expected to see an increase at 23andMe, too, but I had 1,583 matches there on 13 Nov and today I have 1477. One example is just anecdotal evidence, of course.

    2. I am in australia and 23andme is $149USD for us, ancestry only. It never goes on sale. I don’t think there are many Aussies who use 23andme

  2. The 23andMe number you published is highly inaccurate. This method cannot be used to estimate the number of people participating in DNA Relatives because of the cap on matches there and the effect this has on the number shown on the match lists.

    1. The effective database size is not a measure of the total number of people in the matching database; it’s an indicator of how many matches you’re likely to get. The cap lowers that number, so 23andMe has a relatively low effective database size.

      1. The amount of DNA shared with the matches at 23andMe will be much more significant, however. For example, my lowest new matches are 20 cM, as is the case with most of my accounts there. Since it is typically a waste of time to investigate the multitude of unphased matches between 7 cM and about 15 cM, this means the quality of 23andMe’s effective database is higher for the genealogist and, therefore, a direct comparison is not adequately conveyed by your method.

  3. Ancestry seems more interested in selling their product (as in their ads) as a way to discover your “Ethnic Origin”. A very large number of my matches have no family tree attached and if you contact them they do not reply. They appear not to be interested in making dna connections. A friend of mine bought the kits as a gifts for her parents. They were surprised when people started contacting them and have not replied.

    So they may have a larger database, but of what good is it when this is the case.

    If Ancestry were truly interested in the dna angle of family heritage they would make it more clear in their advertising and they would accept FTDNA raw data.

    1. My experience is different. Far more of my own top matches at AncestryDNA have trees (52%, not counting trees that aren’t attached to the DNA) than at FTDNA (32%), and I solve a lot more adoptee cases at Ancestry than elsewhere (although I would never recommend that an adoptee skip a database if they can afford to be in it).

  4. FTDNA matches a higher percentage than it should because it scores cmg for matches down to 500 SNP. When you examine shared child-parent matches and ask the question how many of the minor matches listed for the child also match the parent you get a sense of how big a problem this is. I administer 23 FTDNA accounts and as of today the total matches number range from 231 for my mother (25% Austrian, 25% German, 505 Hungarian) to amounts more than 10 times that much for people with large percentages of DNA from the British Isles. Some of the distant relatives I have tested have ancestries I haven’t investigated fully but my uncle is 1/8 Irish, 3/8 Scottish, 1/4 Swiss and 1/4 Bohemian and currently has 1948 matches. I disagree that low cmg matches are a waste of time since I have low cmg matches with relatives I have tested and known relatives tested by others. However to the extent possible I do grandparent and greatgrandparent mapping of my DNA and analyze match groups geographically to achieve higher efficiency. I also look for matches that cross probable inherited crossovers.

    1. FTDNA’s matching algorithm has two problems: (1) it has a higher matching threshold than at the other companies, reducing the total number of matches, and (2) it includes segments down to 1 cM, which are more likely than not to be false positives, so all but the closest cousins are estimated to be nearer than they truly are.

      Their matching thresholds are nicely summarized here: http://thegeneticgenealogist.com/2016/05/24/family-tree-dna-updates-matching-thresholds/

  5. I have not worked very hard on which combination of cmg and SNP thresholds that best reduce false match likelihood but it is unclear to me the mechanism by which crossover probability (cmg) affects false match likelihood. False match likelihood would seem to be more directly affected by sample size but I have noticed some correlation of cmg to false match likelihood independent of sample size. Using FTDNA’s shared match report allows me to resolve some issues but I agree that FTDNA’s high threshold is an issue. I also wish the test companies would use and report effective sample size which excludes positions where either matching party is heterozygous ambiguous or a no test.

  6. There is an updated plot of “Number of people in DNA registries” using data attributed to thednageek.com in Science 18 May 2018 Vol. 360 Issue 6390 p.691

    Shows AncestryDNA over 8,000,000.

    Is this new data anywhere in this blog?

    Thanks for doing this survey and publishing results….

    1. I keep an updated graph of database sizes, along with other quick-reference information on the available tests, here: https://thednageek.com/dna-tests. Ancestry currently says they’re at “almost 10 million”. I rounded down to 9 million for the current version of the graph. I’d rather under-estimate than over-estimate.

  7. I do not think you should have eliminated the data for the person who reported 1424 matches at MyHeritage. I currently have 9851 matches at MyHeritage. People who are Jewish, or who come from other endogamous population will always have much higher numbers. You might want to re-think your methodology.

    1. That MyHeritage number was omitted because it was disproportionately large compared to the number of matches in other databases for the same person. It was clearly an error. In any case, MyHeritage now reports its database size, so this approach is no longer needed (except maybe to estimate the database size at FTDNA).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.