Relationship Predictor Survey
There are now several tools that predict which genealogical relationships are most likely for given amounts of shared DNA. With the recent introduction of the cM ExplainerTM at MyHeritage, Blaine Bettinger and I launched a citizen science initiative to evaluate them. Volunteers have been contributing match data for known relationships then running each match through each of the tools in the study. The goal is to determine which predictor, if any, is best. You can read more about the project here.
To date, the survey has more than 6,000 entries for 50 unique relationships ranging from parent–child to 8th cousins twice removed. We’re putting the finishing touches on the analyses and are excited to share them with the genealogy community soon. In the meantime, it helps to understand how relationship predictors work.
What Could Go Wrong?
Biology likes to throw curveballs, so even the best DNA-based predictions will be off sometimes. On average, 1st cousins (1Cs) share roughly 875 cM, but a 1st cousin once removed (1C1R) can share that much on the rare occasion. If you happen to be that rare 1C1R, all of the known predictive tools will peg you as a 1C over a 1C1R. That doesn’t mean the tools are faulty—nor that you are!—just that this is a particularly tough case.
Of course, not all predictive tools are equal. They are based on computer models, and those models make assumptions. If the assumptions are wrong, the predictions can be off as well.
By understanding those assumptions, we can get a better sense of how much credence to give a predictive tool. Let’s consider some of them.
The genealogy companies don’t all analyze the same subset of the human genome. For example, a parent–child match at AncestryDNA is about 3,470 cM while the same match at FamilyTreeDNA is roughly 3,560 cM (see table). A predictive tool that assumes a genome of 3,470 cM might give slightly less accurate predictions for a match from a different company.
For simplicity, in this post I will assume that a parent–child match shares 3500 cM and that the amount is halved each generation.
Sex-specific Inheritance Patterns
We inherit exactly 50% (3500 cM) of our autosomal DNA from each parent but not exactly 25% (1750 cM) from each grandparent. That’s because a process called “crossing over” in a parent’s body divvies up the grandparents’ DNA before passing it on, and that division is rarely equitable. Although the average is ≈1750 cM, some grandparent–grandchild matches will share more and some will share less.
It turns out that egg production involves more crossover events than sperm production. Just as you’re more likely to get a 50-50 split of heads versus tails if you flip a coin 20 times than 10 times, you’re more likely to share close to the average of 1750 cM with your maternal grandparents than your paternal ones; the crossover events are analogous to coin flips. That means relationship prediction for your maternal relatives should be slightly more accurate, because those matches will be closer to the average, on average.
How likely a DNA match is to be any given relationship depends, in part, on how many such relatives you have. For example, a match of 1750 cM could be a grandparent, grandchild, aunt or uncle, niece or nephew, or half sibling, and which is more likely will depend on how many of each you have.
On paper, I have four grandparents, one aunt, one uncle, and one half sister. Based on those numbers, a 1750-cM match to me has a 57% chance of being a grandparent, a 29% chance of being an aunt or uncle, and a 14% chance of being a half sibling. A younger cousin of mine has four grandparents, six aunts/uncles, and two half siblings, so for her the probabilities are 33%, 50%, and 17%, respectively. Those probabilities are quite different.
Any tool that tries to differentiate those relationships must make assumptions about the family structure, usually based on what’s typical for the population. If your family structure doesn’t fit those assumptions, the predictions can be off for you yet be quite good for someone else.
Population growth also plays a key role here. Consider an unknown match who shares somewhere between the averages for 1C and 1C1R. Which of those two relationships is more likely depends, in part, on how many 1Cs and 1C1Rs you have. If each of your 1Cs had one child each, then you should have an equal number of 1Cs and 1C1Rs. (Let’s ignore our parents’ first cousins for now, just to get the point across.) The distributions will look like this:
The break-even point, where a match is equally likely to be either relationship, is around 625 cM.
However, if each of those family members had two kids each, you have twice as many 1C1Rs as 1Cs, and the distributions would look like this:
In this case, the break-even point is closer to 650 cM and a 625-cM match is more likely to be a 1C1R. If your family averages four kids per couple, the probabilities are shifted even more.
The most popular predictive tools are based on proprietary data from either AncestryDNA or MyHeritage, neither of which has publicly stated the population growth factor they use. However, it’s safe to assume it’s around 2.5, which is a fairly standard fertility rate for developed countries over the past century. Again, if your family doesn’t align well with that assumption, the predictors may not work as well for you.
Perhaps the biggest challenge for relationship predictors is endogamy, the practice of marrying within a cultural or geographic group. Endogamy is common around the world and causes people to be related in more than one way. Those “extra” connections can increase the amount of shared DNA and throw off predictor tools that assume each match is related only once. Thus, those of us from endogamous populations will get less reliable relationship predictions.
The Best Tool
The ideal relationship predictor would be tailored to your particular family structure, your population’s growth rate, your level of endogamy, and so on. Such a tool does not exist yet. (But stay tuned!) In its absence, we would like to know which of the available tools gives most people the correct relationships most of the time.
The next post in this series will reveal how the tools did!
Whether you love math or hate it, you’ll get more out of genetic genealogy if you understand some basic concepts and how they apply to our DNA. Join me for my upcoming class “No One Told Me There Would Be Math! DNA Numbers Made Easy.” It’s meant to be accessible to everyone, even if you haven’t done math since high school. The same lecture is offered on two dates/times, so sign up for the one that bet fits your schedule.
Posts in This Series
- Relationship Prediction Tools: Which Is Best?
- The Relationship Predictor Comparison: A First Peek
- In Which Citizen Science Finds an Error, and the Error Is Fixed
- The Science Behind Relationship Predictors