The past few months have brought some exciting news from two of the main autosomal DNA (atDNA) companies: AncestryDNA exceeded 3 million testers in January 2017, and 23andMe broke 2 million in April. The larger the database, the more likely we are to find DNA relatives with information that might enhance our family trees, so this information is an important consideration in deciding where to spend our testing dollars. However, Family Tree DNA (FTDNA) does not publicize how large their atDNA database is, and the newest player in the market, MyHeritage, hasn’t released figures either. Another wrinkle is that not everyone at 23andMe participates in DNA Relatives, the matching tool there. Thus, there is a difference between the total size of a database and what I’ll call the effective size, that is, the number of people available for cousin matching.
When the same person has tested at multiple companies, we can get a sense of the relative sizes of the databases by comparing how many matches they have at each site. If they have ten times as many matches at Company A as at Company B, it’s not unreasonable to assume that Company A’s database is roughly ten times larger. Of course, there are other considerations. For example, if the person lives in a country where Company A does not sell tests but Company B does, their results may not represent the true sizes of the databases. If we can get dozens or hundreds of people to all contribute data, though, our estimates should be pretty accurate.
To that end, I collected self-reported information from genetic genealogists using an online survey form. The data were used to estimate the effective sizes of the databases at FTDNA, 23andMe, and MyHeritage relative to that at AncestryDNA.
From 10–17 April, 2017, I offered a Google Forms survey to members of various genetic genealogy groups on Facebook. The final version of the questionnaire is shown below. (The survey is currently closed to new submissions.)
To make the survey easy to answer, I asked respondents to report how many pages of AncestryDNA matches they had rather than the total number of matches. AncestryDNA reports 50 matches per page. I estimated the number of AncestryDNA matches as (p-1) x 50 + 25, where p is the number of pages of matches. For example, if someone had 101 pages of matches, I knew that they had at least 100 x 50 = 5000 matches, and the last page could have between 1 and 50 additional relatives, which averages to about 25. The formula would estimate (101 – 1) x 50 + 25 = 5025 matches.
For each of the respondents who had tested at AncestryDNA and at least one other company, I calculated a point estimate of the total effective database size of that other company as follows. I assumed that AncestryDNA’s database has 3 million participants and calculated a simple proportion as (number of matches at Company X) / (number of matches at AncestryDNA) x 3,000,000. For example, if someone had 10,000 matches at AncestryDNA and 1,000 at 23andMe, the point estimate of the effective database size at 23andMe would be 1000 / 10000 x 3000000 = 300,000.
I then averaged all of the point estimates for that particular database to determine the effective database size.
As of 17 April, 243 people have responded to the survey, providing 216 data points for AncestryDNA, 126 for 23andMe, and 100 for MyHeritage. Of the 235 reports for FTDNA tests, 80 were originally run by FTDNA’s labs, 44 were transfers of either Ancestry v1 or 23andMe v3 results (v1/v3), 17 were transfers of either Ancestry v2 or 23andMe v4 results, and the remaining did not specify. Only the 80 FTDNA-run results and the 44 v1/v3 transfers were used in the following analyses, for a total of 124 FTDNA data points. Not every respondent had tested at all of the companies, and 27 either had not tested at AncestryDNA or incorrectly reported their matches there, so they could not be used to calculate point estimates.
A few reported numbers were omitted from calculations because they appeared to be errors. For example, one respondent reported 1424 matches at MyHeritage, when the next highest number was 667. That respondent did not report a value for 23andMe at all, and I assumed that 1424 had been entered in the wrong field. Another respondent filled out the form six times; only one was counted in the calculations. A third reported 366 pages of matches at AncestryDNA, 2087 matches at FTDNA, and zero matches at 23andMe. The latter values was omitted.
Most respondents (164) reported being from the USA. An additional 21 were from the British Isles; 15 from Canada; 10 from Australia; three each from the Netherlands, Sweden, and New Zealand; and one each from France, Germany, Guatemala, and Saudi Arabia. The remainder did not report a country of origin.
This table summarizes the numbers of self-reported matches at each of the main testing companies and the estimated effective database sizes.
My initial plan was to estimate the absolute database sizes of the main atDNA companies relative to that of a known benchmark (AncestryDNA). However, that goal proved illusory because I could think of no way to adjust for the different algorithms and policies of each company. (See below.) Instead, I decided to consider the effective database size as a gauge of the potential to find matches. After all, that’s what a genealogist cares about. If a hypothetical database will only match you to people who share enough DNA to be 2nd cousins or better, you won’t get very many matches, no matter how many people have tested there.
The Effective Database Sizes of the Main DNA Testing Companies
With the assumption that AncestryDNA’s effective database holds 3 million people, the analysis indicated that FTDNA’s has about 440,000, 23andMe’s about 397,000, and MyHeritage’s about 12,000. AncestryDNA’s database is undoubtedly larger now (the 3 million figure was announced in January), so all of the other databases are proportionally larger, as well. The calculations can be redone easily when AncestryDNA reports their current database size.
23andMe has an arbitrary cap of 2000 matches in their DNA Relatives feature, although it does not seem to have affected the results of this analysis, as only one respondent had reached the cap. That person reported 2312 matches and was able to exceed the cap by making contact with the matches low on their list.
Effective Database Size Versus Actual Database Size
A number of factors will cause the actual database sizes to differ from the effective database sizes. First, anything that lowers the proportion of matches at AncestryDNA relative to the total database size there will artificially inflate the estimates of effective database size at the other companies. Two such factors are AncestryDNA’s use of statistical phasing to reduce the number of false positive segments and their Timber tool to remove excess IBD (also known as pile-ups). Timber was reported to have removed 75–80% of matches when it was introduced in 2014, a drop so drastic it earned the nickname AutosomalGeddon.
FTDNA’s higher segment size threshold should lower the proportion of matches relative to the full database, but their inclusion of small segments will have the opposite effect. An unknown of 23andMe customers don’t participate in DNA Relatives at all. And MyHeritage’s imputation algorithm still needs a lot of refinement. Precisely how these factors affect the relationship between actual and effective databases is an unanswered question.
Country of Origin
Both FTDNA and 23andMe have been selling DNA tests in countries other than the USA for longer than AncestryDNA has. Thus, one might expect that non-US testers would have more DNA matches at the former two companies than at AncestryDNA. However, only one respondent reported similar numbers at the three companies: the Saudi Arabian had 375 matches at AncestryDNA, 340 at FTDNA, 331 at 23andMe, and none at MyHeritage. Everyone else who answered the survey and who had tested at AncestryDNA plus at least one other company had many more matches at AncestryDNA, regardless of their country of origin. The survey did not address how closely related those matches are.
Summary and Recommendations
AncestryDNA remains the largest atDNA genealogical database by far, reporting 3 million testers in January of this year. As measured by effective database size (e.g., the expected number of matches), FTDNA’s is the next largest, with approximately 440,000 people. 23andMe recently reported that their database had more than 2 million customers, but based on this analysis, fewer than 20% of them opt in to family matching. Their effective database size was less than 400,000. Finally, MyHeritage is a new entrant in the field and still has a very small database of around 12,000 people.
For someone in the US whose primary interest is genetic genealogy, and especially for someone searching for biological family, I recommend testing at AncestryDNA, then transferring the raw data to FTDNA and MyHeritage to get onto those other databases for free. For citizens of other countries, AncestryDNA will probably still give you more matching relatives than the other databases, although your matches at FTDNA and/or 23andMe may be more closely related. Regardless of where you live, if your primary interest is health reports, 23andMe is your best option.
Future Analyses and Predictions
The survey for data collection is currently closed. When AncestryDNA announces an update to their database size, I will re-open it and collect another round of data. I expect to see FTDNA’s database grow rapidly now that they are able to accept transfers of raw data from the newer versions of both AncestryDNA and 23andMe tests. Conversely, I predict that 23andMe’s effective database will grow more slowly, or possibly even shrink, as inactive customers log in and opt out of DNA Relatives. (I have lost more than 100 matches there since November.) I am interested to see how MyHeritage fares as the newest player in an established market.
UPDATE: On 22 April, AncestryDNA announced via their Twitter feed that their database had reached 4 million. I have updated the estimates accordingly.