The Limits of Predicting Relationships Using DNA  

The one thing we genealogists probably want most from our autosomal DNA matches is something they can’t give us: an exact relationship prediction based on shared DNA alone. Unfortunately, with the exceptions of identical-twin, parent–child and full-sibling matches, that’s simply not possible.

Why not? One reason is that multiple different relationships can give the same patterns of shared DNA. For example, a woman who shares 1750 cM with you could be your grandmother, granddaughter, aunt, or half sister. Those relationships are indistinguishable based solely on the amount of shared DNA. (In this case, you can narrow the possibilities using age.) Someone sharing 950 cM with you could be a great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece.

The DNA Detectives Facebook team has designed a nifty chart that categorizes relationships into groups based on the expected amounts of shared DNA. In the two examples above, grandparent/child, aunt/uncle, and half sibling would be Group B, and great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece would be Group C. I will use the DNA Detectives group names in the rest of this post for ease of reference.

Shared centimorgan ranges for different relationship groups. The original chart is available in the files of the DNA Detectives Facebook group.

 

To complicate matters, each group is defined not so much by an average or “expected” amount of shared DNA but by a range. That is, someone in Group B might share 1750 cM with you, but they could also share as little as 1300 cM or as much as 2300 cM, according to the DNA Detectives chart. Group C can range from 575 cM to 1330 cM.

Notice another problem? The low end of the Group B range overlaps the high end of the Group C range. Put another way, someone who shares 1315 cM with you could be in either group (and remember that each group includes multiple possible relationships). Worse, the more distantly related the group, the broader the range of shared centimorgans relative to the average and the more overlap there is with other groups. Someone who shares 3015 cM with you can only fall into Groups B or C, but someone who shares 100 cM could belong to Group E, F, or G, according to the DNA Detectives chart.

When you have a match in an overlap zone, the best approach is to consider the most likely group first. AncestryDNA’s Matching White Paper (31 March 2016) presents an informative graph (their Figure 5.2) that shows the likelihood of each group (the x axis) given the amount of shared DNA (the y axis). Their graph is based on simulated data, rather than empirical (real) data, but as long as the model they used to do the simulations is reasonable, the data should be reliable.

Distributions of shared centimorgans for different relationship categories based on simulated data. This graph was taken from the AncestryDNA Matching White Paper published 31 March 2016 (their Figure 5.2).

 

Unfortunately, they used a logarithmic scale, which is a great space saver but is intuitive to precisely no one. They also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are. To make the information easier to understand, I edited the image labels to use the groups from the DNA Detectives chart. Here’s what the modified figure looks like.

Figure 5.2 from the AncestryDNA Matching White Paper edited to use the groups defined by the DNA Detectives chart. Note that the numbered ranges to the right of the graph mark regions where that group is the most likely one, not the full range for that group.  For example, between 200 cM and 340 cM, the most probable relationship is Group E, but the full range for that group is 65–600 cM (see below).

 

The figure gives you a visual sense of how broad the ranges are for each relationship group and how much overlap there is. It also shows us which centimorgan values represent only one possible group; those are the zones along the vertical y axis that only have one colored line crossing them. Between about 2400 cM and 3200 cM, the only line is the medium blue one for Group A, and between about 1550 cM and 2000 cM, the only line is the forest green one for Group B. There’s a short interval around 1000 cM that can only be Group C, but for all other centimorgan values, more than one group of relationships could apply.

Because of the log scale, the graph is hard to interpret if you’re interested in a specific centimorgan amount. To get around the problem, I approximated x and y values for each curve using an online plot digitizer. Geek power!

 

What does this tell us? It gives us an indication of which group of relationships is most likely to apply to a match who shares a specific amount of DNA. For example, a match sharing 750 cM with you is in an overlap zone, but they are far more likely to be in Group C (probability p = 0.85, or 85% chance) than in the overlapping Group D (p = 0.15, or 15% chance). Of course, the numbers don’t guarantee that the match is in Group C, but that’s where I’d start looking for the connection.

The probabilities can be more complicated. Consider a match who shares 110 cM with you. That person could belong to Group E (p = 0.08, 8% chance), Group F (p = 0.39, 39% chance), Group G (p = 0.30, 30% chance), Group H (p = 0.20, 20% chance), or Group I (p = 0.06, 6% chance). Again, the best approach would be to look for a shared ancestor in the most likely relationship range first, so Group F > Group G > Group H > Group E > Group I.

You may also be familiar with the Shared cM Project by Blaine Bettinger. This project compiles self-reported data from the genetic genealogy community for different relationships. Thus, it gives us both the extremes (maximum and minimum values) as well as histograms (bar graphs showing how common given centimorgan values are for each relationship). The histograms are comparable to the colored lines on the AncestryDNA graph.

For comparison, I’ve aligned the ranges from the three datasets below. For the Shared cM Project, I’ve combined data for relationships that belong to the same group (e.g., first cousins once removed and second cousins both belong to Group E, so they were treated together).

 

The ranges given by the DNA Detectives are consistently narrower than those from the other two sources. That is mainly due to the fact that the DNA Detectives chart intentionally omits extreme outliers, which are especially challenging to deal with in the unknown parentage searches for which the chart was created. Their dataset is also the smallest, although it has the advantage that each datapoint has been carefully vetted by an expert. The Shared cM Project ranges are similar to those of AncestryDNA, but not exactly the same. Differences between the two could result from errors in the self-reported data of the former, the relative sizes of the datasets (the simulated dataset is almost certainly much larger than the empirical data), or assumptions made by AncestryDNA’s scientists in designing the simulations. Regardless of which source of information you prefer to use in your own genealogical work, keeping in mind the strengths and weaknesses of each dataset is wise.

Note:  The probabilities and cM ranges discussed in this post assume little or no endogamy.  Endogamy is the practice of members of a population marrying within the same group over multiple generations.  If practiced for enough time, the present-day members of the population will all be related to one another multiple different ways.

Acknowledgements: Thanks to Dr. Tracy Vogler for alerting me to the online plot digitizer. CeCe Moore and Christa Stalcup kindly agreed to let me reproduce the DNA Detectives chart here.

 

31 thoughts on “The Limits of Predicting Relationships Using DNA  ”

  1. Dear DNAgeek,

    Nice story! Perhaps it’s an idea to convert this information to an online tool or even a phone app? Let users fill in the largest cM, total cM etc etc and the tool gives a nice visual explanation what would be the most probable connection.

    Best

    EJ

  2. Excellent article. Is the table you constructed with the online digitizer available in a spreadsheet? I think I could use it to assist in an effort to help a distant DNA match to identify her birth parents. It would save me from having to key in your data from the graphic.

  3. In this area we should also be including the number of matching segments as a further determinate. For example, grandmother-matches and niece-matches should have the same expected percentage of DNA but the niece match is expected to have the matching DNA broken up into more segments.

    1. That’s an excellent point. The number (and size) of matching segments can help distinguish between grandparent and avuncular relationships, but not other relationships. Scientists from 23andMe published a paper in 2012 that includes simulation data showing the distinction. I’ve digitized that data as well, but it was too much to tackle in this blog post.

      It’s Figure 3A in this open-access paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267

  4. Thank you for the table of probabilities. I am currently working on my own DNA search for a biologic parent and this will help guide me a bit more…This is actually one of the most understandable charts for the lay person who understands some basic stats that I have seen.

    1. I hope it helps. If you haven’t already, join the “DNA Detectives” group on Facebook for free advice and moral support in your search. Good luck!

  5. Nicely done. I do wonder why you feel that the AncestryDNA white paper authors “also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are.”

    While it is a bit startling to see the term “meiosis” or its plural “meioses” without any introduction or explanation, I do not see what constitutes misuse, or why someone who understands the process would find the use of the word confusing. Isn’t the number of meiotic divisions at the base of all theoretical tables or formulae showing expected shared DNA? With every transfer of DNA from parent to child, the child receives half of the parent’s DNA, divided through the process known as meiosis. Because of crossing over during that meiotic division (as well as upstream meiotic divisions), however, the child does not receive equal amounts of DNA of each line above the parents, and the accumulated “error” with each meiosis explains the increased range in expected or actual shared DNA.

    1. These are good questions. I thought about addressing them in the post, but the explanation would have distracted from the main points I wanted to make here. I used the word meiosis/meioses, because AncestryDNA’s Figure 5.2 uses it. I then switched to the DNA Detectives’ term “group”, because it is both more accurate and less intimidating to the non-biologist.

      As you know, meiosis is the process of forming the egg or sperm in the parent’s body. It results in the egg/sperm getting half of the parent’s DNA. When the mother’s egg fuses with the father’s sperm, the offspring is restored to a full complement of DNA (half from each parent).

      The relationship between a parent and a child involves a single meiosis. That between a grandparent and grandchild involves two meiosis (one in the grandparent, one in the parent). Similarly, half siblings are separated by two meioses, one in the shared parent to produce the first child, and a second in that same parent to produce the second child. This is where AncestryDNA misuses the term. In their figure, they label the group that includes half siblings and grandparents/grandchildren (forest green in the figure, Group B per the DNA Detectives) as three meioses, not two.

      AncestryDNA labels full siblings as being separated by two meiosis, but that’s not the right way to look at it. They *are* separated by two meioses, but they’re also related twice over: once through their mother and once through their father. Essentially, they are double half-sibs, which isn’t quite the same as two meioses. (I’m sure AncestryDNA made this decision to try to make the concept easier for the novice to understand rather than out of ignorance. Unfortunately, in doing so, they’ve used the term incorrectly.)

      Interestingly, although a full aunt/uncle is expected to share the same amount of DNA as a grandparent or half sibling, the aunt/uncle is separated by three meioses, not two. The reason they share in that closer range is because they’re a double relative (i.e., double half aunt/uncle). This fact is potentially useful for relationship predictions; although a full aunt/uncle is in the same group as a half sib or grandparent, the extra round of meiosis means that the shared segments will be smaller, on average, so in most cases, we should be able to distinguish an aunt/uncle from those other two possibilities. (See Figure 3A in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267)

      This is obviously a topic worthy of its own post!

  6. Another thing to take into account is that in earlier times it was common for siblings from one family to marry siblings of another family, all descendants now likely sharing more DNA than typically expected, and impacting accuracy of estimations, depending on how far back this happened.

  7. There is a typo in the asterisk note in your final table comparing the three sources of data. I don’t know if this correctable here. “* The DNA Detectives… and is not comparable to the other to sources.” In the final “to”, the “w” went missing. I realize this is not particularly pertinent to the Blog, but it bugs me.

  8. I really enjoyed your blog. My mother and I recently tested on 23andMe and we were lucky enough two find 2 really close matches. My mother matched 15.1%, 1124cm to a male, 32 segments, and no match on the X. She also matched to a female (the sister to the male above) at 15.9% 1181cm , 33 segments and 3 of those segments are on the X. 23andMe lists both of them to my mother as 1st cousins. What gets me, is that she is in the over lap of all of the charts that I looked at or borderline from one to the other. I know you said when there is an overlap, that one group has a higher probability than the other but it is still hard to not feel 100% sure and leaving doubt. Would you happen to have any suggestions as to what we should focus on? Where we should look? Please forgive any typos as it is hard to focus when your daughter is climbing on you while typing. 😉

    On a positive note, My mom and I have reached out to them, shared photos, and received a family tree of their known relatives. The comparisons between my mom and their family are scary due to how much they resemble each other. We just don’t know who in the family to focus on at this point.

    What do you think?

    1. For both of them, the most likely relationship is one in Group C, with a (much) lower chance of being in Group B. Group C includes first cousins, great grandparent/child, half aunt/uncle/niece/nephew, and great aunt/uncle/niece/nephew. You can probably rule some possibilities out based on their ages.

      1. Thanks. My mom is 20 yrs older than her predicted 1st cousin match. Her 1st cousin matches’ father is the youngest of five and has two male uncles and two female aunts. I’m thinking that one of her two uncles is my mom’s dad. One of the uncles was 19 yrs old and the other was 17 in 1943 when my mom was born. They both registered in the military in 1942. The oldest uncle was sterilized due to the Eugenics program in California, never married, and doesn’t have any known children but could have had relations before joining the military in 1942. The younger uncle did marry, had two daughters (which are still alive), one is 70 and the other is in her 60’s. my mom is 73 so I’m he could have also had relations before he joined the military in 1942. Is it possible that the grand father of my mom’s predicted 1st cousin could also be her dad due to the 1181cm match? He and his wife were also in the Eugenics program. They let him out but his wife supposedly lived out the rest of her life in the state hospital.

  9. Nice work! This is very interesting.

    I’d like to know if there’s more info anywhere on probabilities for the more distant relations. It seems 23andme considers any match in the range of (about) 15 to 42 cM (0.20% to 0.57%) on a single segment as a predicted 4th cousin with a “range” of “3rd to 6th” or “3rd to Distant”. Is there any info on what this really means probabilistically?

    I’d love to see a table like the above that goes down several more rows to 10 or 15 cM and has more columns to show 5th and 6th cousins.

    Is there a “standard” definition of “distant” cousin? It seems the 23andMe uses it to mean beyond 6th cousin, while the table here seems to mean beyond 4th cousin. Obviously the more distant the relation the more different possibilities there are, but if someone’s got a 25 cM overlap, they can’t be only, say 10th, cousins, can they? I can see why 23andMe doesn’t show more beyond a certain point, but I want to see all the geeky details!

    1. The challenge with distant cousins is that we’re not likely to share DNA with them at all. For 4th cousins, estimates range 25% to 50% chance that they won’t match. For 5th cousins, it’s 70–85%, and for 6th it’s 10% or less. Of course, we typically have enough 4th and 5th and 6th cousins that we’ll find a few who match, but by that point they’re statistically likely to share only one segment with us. Basically, there’s no way to distinguish a more-likely 4th cousin from a rare 6th cousin or even rarer 8th cousin based on a single segment.

      Here are some sources for those estimates:
      https://isogg.org/wiki/Cousin_statistics
      https://gcbias.org/2013/12/02/how-many-genomic-blocks-do-you-share-with-a-cousin/

      Another complication that doesn’t get addressed much is that a match who shares one 25 cM segment is likely to be closer than one who shares three 8.3 cM segments, even though the total cM shared is the same. In the latter case, I’d suspect that there are, in fact, multiple connections between the two DNA testers, possibly quite distant.

  10. Can you extrapolate on what the relationships may look like for endogamous populations and for those who then marry outside the endogamous population? Does the count of shared Cms revert to the standard population or will endogamy play a part for many generations?

    1. That’s an excellent question. And, of course, there’s no easy answer, for a few reasons. First, even in non-endogamous populations, there is a range of shared DNA that’s normal for any given relationship (other than parent–child, which is always 50%). We expect the *average* to be higher in an endogamous group, but any two people could still be on the low end of that ranges and therefore not look like the endogamy affected their shared amount of DNA. Second, different populations have different amounts of endogamy. Cajuns and Polynesians and Ashkenazim and Puerto Ricans are all endogamous, but we wouldn’t necessarily expect them all to have the same outcome, because they have different overall population sizes and have been endogamous for different lengths of time. Third, the expected amount of shared DNA is affected by how many relationships there are, and most of our matches won’t have 100% complete trees past about 2nd-great grandparents. As a result, there will be connections that can’t be accounted for.

      Ultimately, we’ll need a combination of simulated data and crowdsourced information from well-studied populations to tease these issues apart.

      As for those that marry outside the population, the effects of endogamy do tapers off, but you can still find yourself matched to very distant cousins.

      Like I said, no easy answers. I wish there were.

Leave a Reply

Your email address will not be published. Required fields are marked *