The Science Behind Relationship Predictors

Relationship Predictor Survey

There are now several tools that predict which genealogical relationships are most likely for given amounts of shared DNA.  With the recent introduction of the cM ExplainerTM at MyHeritage, Blaine Bettinger and I launched a citizen science initiative to evaluate them.  Volunteers have been contributing match data for known relationships then running each match through each of the tools in the study.  The goal is to determine which predictor, if any, is best.  You can read more about the project here.

To date, the survey has more than 6,000 entries for 50 unique relationships ranging from parent–child to 8th cousins twice removed.  We’re putting the finishing touches on the analyses and are excited to share them with the genealogy community soon.  In the meantime, it helps to understand how relationship predictors work.

What Could Go Wrong?

Biology likes to throw curveballs, so even the best DNA-based predictions will be off sometimes.  On average, 1st cousins (1Cs) share roughly 875 cM, but a 1st cousin once removed (1C1R) can share that much on the rare occasion.  If you happen to be that rare 1C1R, all of the known predictive tools will peg you as a 1C over a 1C1R.  That doesn’t mean the tools are faulty—nor that you are!—just that this is a particularly tough case.

Of course, not all predictive tools are equal.  They are based on computer models, and those models make assumptions.  If the assumptions are wrong, the predictions can be off as well.

The Assumptions

By understanding those assumptions, we can get a better sense of how much credence to give a predictive tool.  Let’s consider some of them.

Genome Size

The genealogy companies don’t all analyze the same subset of the human genome.  For example, a parent–child match at AncestryDNA is about 3,470 cM while the same match at FamilyTreeDNA is roughly 3,560 cM (see table).  A predictive tool that assumes a genome of 3,470 cM might give slightly less accurate predictions for a match from a different company.

For simplicity, in this post I will assume that a parent–child match shares 3500 cM and that the amount is halved each generation.

Sex-specific Inheritance Patterns

We inherit exactly 50% (3500 cM) of our autosomal DNA from each parent but not exactly 25% (1750 cM) from each grandparent.  That’s because a process called “crossing over” in a parent’s body divvies up the grandparents’ DNA before passing it on, and that division is rarely equitable.  Although the average is ≈1750 cM, some grandparent–grandchild matches will share more and some will share less.

Distributions of atDNA shared with maternal (pink) and paternal (blue) grandparents. The higher the bar, the more likely that centimorgan amount is for the relationship.

It turns out that egg production involves more crossover events than sperm production.  Just as you’re more likely to get a 50-50 split of heads versus tails if you flip a coin 20 times than 10 times, you’re more likely to share close to the average of 1750 cM with your maternal grandparents than your paternal ones; the crossover events are analogous to coin flips.  That means relationship prediction for your maternal relatives should be slightly more accurate, because those matches will be closer to the average, on average.

Family Structure

How likely a DNA match is to be any given relationship depends, in part, on how many such relatives you have.  For example, a match of 1750 cM could be a grandparent, grandchild, aunt or uncle, niece or nephew, or half sibling, and which is more likely will depend on how many of each you have.

On paper, I have four grandparents, one aunt, one uncle, and one half sister.  Based on those numbers, a 1750-cM match to me has a 57% chance of being a grandparent, a 29% chance of being an aunt or uncle, and a 14% chance of being a half sibling.  A younger cousin of mine has four grandparents, six aunts/uncles, and two half siblings, so for her the probabilities are 33%, 50%, and 17%, respectively.  Those probabilities are quite different.

Any tool that tries to differentiate those relationships must make assumptions about the family structure, usually based on what’s typical for the population.  If your family structure doesn’t fit those assumptions, the predictions can be off for you yet be quite good for someone else.

Population Growth

Population growth also plays a key role here.  Consider an unknown match who shares somewhere between the averages for 1C and 1C1R.  Which of those two relationships is more likely depends, in part, on how many 1Cs and 1C1Rs you have.  If each of your 1Cs had one child each, then you should have an equal number of 1Cs and 1C1Rs.  (Let’s ignore our parents’ first cousins for now, just to get the point across.)  The distributions will look like this:

Histograms for 1C and 1C1R when you have the same numbers of each. The two are equally likely around 625 cM.

The break-even point, where a match is equally likely to be either relationship, is around 625 cM.

However, if each of those family members had two kids each, you have twice as many 1C1Rs as 1Cs, and the distributions would look like this:

Histograms for 1C and 1C1R when you have twice as many 1C1Rs. The two are equally likely around 650 cM.

In this case, the break-even point is closer to 650 cM and a 625-cM match is more likely to be a 1C1R.  If your family averages four kids per couple, the probabilities are shifted even more.

The most popular predictive tools are based on proprietary data from either AncestryDNA or MyHeritage, neither of which has publicly stated the population growth factor they use.  However, it’s safe to assume it’s around 2.5, which is a fairly standard fertility rate for developed countries over the past century.  Again, if your family doesn’t align well with that assumption, the predictors may not work as well for you.

Endogamy

Perhaps the biggest challenge for relationship predictors is endogamy, the practice of marrying within a cultural or geographic group.  Endogamy is common around the world and causes people to be related in more than one way.  Those “extra” connections can increase the amount of shared DNA and throw off predictor tools that assume each match is related only once.  Thus, those of us from endogamous populations will get less reliable relationship predictions.

The Best Tool

The ideal relationship predictor would be tailored to your particular family structure, your population’s growth rate, your level of endogamy, and so on.  Such a tool does not exist yet.  (But stay tuned!)  In its absence, we would like to know which of the available tools gives most people the correct relationships most of the time.

The next post in this series will reveal how the tools did!

Learn More!

Whether you love math or hate it, you’ll get more out of genetic genealogy if you understand some basic concepts and how they apply to our DNA.  Join me for my upcoming class “No One Told Me There Would Be Math!  DNA Numbers Made Easy.”  It’s meant to be accessible to everyone, even if you haven’t done math since high school.  The same lecture is offered on two dates/times, so sign up for the one that bet fits your schedule.

Title slide for the upcoming talk on math in genealogy

Posts in This Series

8 thoughts on “The Science Behind Relationship Predictors”

  1. Thank you for the elegant explanation. Family structure would seem to be the major factor making us misfits in the current models. I would love to contribute to research in that direction.

  2. You say “your parent’s 1Cs had one child each” would count in the total number of 1C1R’s. However, the children of one of my parent’s 1st cousins are the same number of generations removed from my grandparents and are therefore my 2nd cousins, correct? The only way to get a 1C1R is the children of your 1st cousins, although technically, if we were to treat 1st cousins the same way we do other cousins, we could say we could refer to their parents as 1C1R as well, but we more commonly refer to them as aunts and uncles. 😉

    1. Good catch! And a reminder that relationship prediction is often more complicated than a simplified description can capture. The children of your 1Cs are 1C1Rs, but so are your parent’s 1Cs. I’ll edit the text to be a bit more clear.

  3. > The most popular predictive tools are based on proprietary data from either AncestryDNA or MyHeritage, neither of which has publicly stated the population growth factor they use. However, it’s safe to assume it’s around 2.5, which is a fairly standard fertility rate for developed countries over the past century.

    You should be able to calculate this from public trees, given estimated birthdates for currently-living people, and maybe given family, regional / cultural, and generational variations.

  4. A small pedantic correction: I think you are more likely to get a(n exact) 50-50 split from 10 coin tosses than 20 (~17.6% against ~12.5%, if I’ve done the sums right). With 2 coin tosses its 50% of course. But we know what you mean: the distribution is “tighter” around 50-50 with increasing n.

  5. Differentiating between the 3 bundled relationship categories within that 2nd (25%) degree of relatedness is what needs to be done to figure out how the testers are related. It can obviously be done successfully the way you describe either with advanced calculation methods or with one of the good 3rd party tools available. Another way to differentiate the relationship categories that belong to the 25% degree is by matching the generational separation between the two testers to the generational separation that the relationship categories describe. Grandparent Grandchild is a 2 generation separation category, Uncle-Aunt/Niece-Nephew is a 1 generation separation category and 1/2 sibling is a 0 generation separation category. In each degree of relatedness there won’t be more than one category per number of generations separated. The calculators rank probability based on the number of people related in that category, the more people the more likely the testers are to be related in that category which is a liklihood based on statistical probability. The alternate approach requires knowledge or an estimated guess at generational separation by the user to select the category that aligns to their situation with the other tester. I think the latter approach has a better chance of turning out to be correct on the first try just because it involves the user and makes them look at their specific situation in a way that no tool maker can anticipate. However as assisted reproduction becomes more and more prevalent (there goes your volume) our perception of generational separation based on age alone are less and less likely to be correct. I think at the moment the user has a better chance of picking the correct category by knowing or estimating how many generations separate the two testers than by looking at the chances based on the number of people possible in the category but our ability to guess at how many generations separate testers based on their ages is going to keep declining until, at some point in the future statistical probability based on an estimated head count overtakes the simplicity of eliminating the grandparent/grandchild category completely for two 20 year old testers and rolling the dice that the two 20 year old testers are 0 generation rather than 1 generation apart (because of course both are possible but 0 generation of separation would be more likely if we go back to estimating by a % of the population method). A hybrid of the two methodologies might work well now and in the future. The one thing that troubled me about the analogy of counting my own relatives in those categories to get a percentage of liklihood is that it does not take the fact that the other tester calculating from their side might have completely different percentages of liklihood which means counting known relatives is the wrong thing to be counting. The known relatives need to be excluded from the equation completely so that the focus is only on the unkown relationship vs those population factors you mention and of course the maximum path count (number of ways two people can be related in those categories). Anyway there are a lot of ways to go about the same problem. Thanks for explaining how most of the tools work!

    1. If you have a priori reasons to gauge which generation(s) the matches are in, that’s a great way of narrowing the field. That can be hard to tell, though. For example, my grandfather was one of 14 children, with a spread of 25 years between the youngest and oldest. The spread in the next generation down is even bigger. An alternate way of going about it is to use multiple DNA matches to start to hone in on a generation. There’s no one-size approach. We need to use all the tools in our toolbox!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.