The size of the autosomal database at a DNA testing company is an important consideration when deciding where to test. The larger the database, the more likely you are to have genealogically relevant matches. (There are exceptions, of course, for example when one company has better representation in the geographic region of your ancestors than its competitors.) However, not all of the companies release this information. The sizes of specific databases are frequent topics of debate within the genetic genealogy community.
I wondered whether I could adapt a method used in ecology to estimate the number of species in a region. Imagine that you manage a small city park and you need to know how many different tree species grow there. Trees are big and they don’t run away, so you could simply walk the paths with a clipboard and write down the name of every new species you saw.
Now imagine that you manage a large nature reserve in Costa Rica, and you need to determine how many species of insects live there … the task is no longer a walk in the park. (Sorry). Insects are small, they move, and most cannot be identified while they’re still alive. (Although killing all the bugs might make the tourists happy, everything else in your nature reserve would go extinct pretty quickly, and then you’d have bigger problems than squeamish tourists.) What you need is an estimate of the number of insect species without having to examine every square meter of the reserve and without having to identify every individual insect.
Think back to the trees in the city park. At the beginning of your inventory, you’d be adding lots of names to the list, but by the end, most or all of the species you encountered would already be on the list. That is, the rate at which you added new species to the inventory would decline the closer you got to the final tally. If you graphed how many species you’d found versus your sampling effort (how far you’d walked), the curve would look something like this:
This type of graph is called a species discovery curve. (Note that the “growth” in this curve does not mean that the absolute number of trees in your park is increasing in real time, just that the number of trees you’ve counted increases during the sampling period.)
Turns out, ecologists construct similar graphs when they need to estimate totals that are difficult or impossible to count, like how many small critters are in a large space. You don’t need to count every insect in the entire forest; you just need to sample enough random spots (called plots) in the forest to get a good feel for the shape of the curve. In this case, effort (the x-axis on the graph) might be the number of plots or person-hours of labor. Then, you can apply a mathematical curve fit to the data and estimate where rate of increase (the slope) is zero. That point is known as the asymptote in geek language.
With me so far?
In theory, we should be able to apply the same logic to estimating the size of a genealogical database. In this case, the “number of insect species” becomes “number of people in the database” and a “plot” is “one person’s match list “. Consider GEDmatch. A “One-to-many DNA comparison” gives you 2000 unique matching kit numbers, analogous to species in a rainforest. (Let’s ignore that some may be duplicate uploads for the same person.) If I started my species discovery curve with my own kit, I would do a one-to-many and plot the point x = 1, y = 2000, where 1 is my sampling effort (one kit examined) and 2000 is the total number of unique kits found so far. In geek languate, the point is written (1, 2000). If I did another one-to-many with another kit number, I might find that 50 kits in the second match list were also in the first one, meaning only 1950 were new. My grand total of unique kits would be 2000 + 1950 = 3950, and I would plot the point (2, 3950). A third one-to-many might overlap the first match list by 100 and the second by 50, giving 1850 new “species” to add to the running total. This point on the graph would be (3, 5800), where 3 is for the third sample and 5800 is the sum of my previous total (3950) and the 1850 new, unique kits I just found.
If you’re wondering who the heck would do something like this, my answer is … Hello, my name is Leah.
I solicited 100 GEDmatch kit numbers from volunteers with the goal of representing as much human genetic diversity as I could. The volunteers represented African, Ashkenazi, European (Eastern and Western), Korean, Middle Eastern, and Polynesian ancestry. For each GEDmatch number, I did a “one-to-many” with the default threshold of 7 cM. When a kit had fewer than 2000 matches (the maximum), I lowered the threshold to 3 cM to get more matches. (Remember, the goal here was not to identify IBD segments but to sample as many unique kit numbers as possible.) After each sample, I added to my species discovery curve as described in the theoretical example above. To get more data, I haphazardly pulled kit numbers from my growing list and analyzed them as well. The final dataset comprised 300 samples.
The graph is show below. A negative exponential function was fitted to the curve, and the asymptote was calculated.
The predicted asymptote was 230,650, suggesting that GEDmatch had roughly that many unique, public kits available for matching when I did the analysis in June 2016. The actual public database size at the time was greater than 300,000 (personal communication, Curtis Rogers; note that the database is larger now), meaning that I underestimated the size of the GEDmatch database by about 23%. Looked at another way, the true size of GEDmatch’s database was 30% larger than my estimate.
Having the estimate that far off is humbling, but the scientific approach is to try to understand why the method didn’t work. My best guess for what went wrong is that my sampling strategy covered some of the genetic diversity in the GEDmatch database too well and other ethnicities not well enough, so that the curve started to level off although there were large “pockets” of kits that didn’t get sampled. Going back to the insects-in-the-rainforest analogy, it would be as if I’d sampled bugs by the beach, in the clearings, and in the old-growth forest, but neglected to sample along the river and on the mountainsides, where I’d expect different species to live.
The solution to this problem (if it is indeed why my estimate was too low) would be to randomly generate kit numbers rather than use those of volunteers. I tried doing this, but too few of the random kit numbers turned out to be real.
I meant this project to serve as a proof of concept that species discovery curves can be used to estimate the size of a genetic genealogy database. I chose to do this trial run at GEDmatch because one person could do all of the work, whereas a similar analysis at one of the testing companies would require a large-scale collaboration of citizen scientists. Put another way, I wanted to show that it worked before I recruited scores of people to help.
While I’m disappointed that my estimate was not more accurate, I’m sharing this story to spur the genetic genealogy community to improve this method or to come up with new strategies to gauge how big the databases are. This method would not work for newer databases — like DNA.Land or MyHeritage — where participants do not yet have large numbers of matches, nor would it tell us the total database size at 23andMe, where many of their customers remain anonymous. However, it might give general insight into how many people at 23andMe are actively sharing their DNA results, and it could offer a rough gauge of the size of the autosomal database at Family Tree DNA to complement the existing estimates.
Acknowledgments
I thank several volunteers from the International Society of Genetic Genealogy Facebook group for providing kit numbers. Curtis Rogers of GEDmatch kindly told me the size of their database after I’d done the analysis.
Thanks for a very interesting blog, and for doing this patient work which kind of boggles the mind. You found 200,000 of the 300,000 kits –kind amazing.
Getting back to the start of your post, a person is sometimes interested in the size of the data base relevant to their geneology. For this purpose counting the number of matches with cM > a certain size (like 20cM or 15cM, small enough to be dominated by single segments but not so small to have noise) might help in estimating RELATIVE SIZE.
But your technique estimates the ‘real size’ which is kind of neat.
Thanks! A colleague suggested a similar method that seems to be giving up a better estimate, so stay tuned as we work out the kinks.
I think you might have been right, actually! I have found many, many of my matches are doubles of the same person and dna, but with two different testing companies , or slightly edited gedcoms, or just uploaded twice for no reason. I don’t quite understand why, though and it is certainly eating up space in the 2000 matches.
Some people upload from different test companies to see if they get slightly different matches. (The test aren’t identical.) Others upload multiple times because they don’t think it worked the first time. I agree, it’s messy.