The Limits of Predicting Relationships Using DNA  

The one thing we genealogists probably want most from our autosomal DNA matches is something they can’t give us: an exact relationship prediction based on shared DNA alone. Unfortunately, with the exceptions of identical-twin, parent–child and full-sibling matches, that’s simply not possible.

Why not? One reason is that multiple different relationships can give the same patterns of shared DNA. For example, a woman who shares 1750 cM with you could be your grandmother, granddaughter, aunt, or half sister. Those relationships are indistinguishable based solely on the amount of shared DNA. (In this case, you can narrow the possibilities using age.) Someone sharing 950 cM with you could be a great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece.

The DNA Detectives Facebook team has designed a nifty chart that categorizes relationships into groups based on the expected amounts of shared DNA. In the two examples above, grandparent/child, aunt/uncle, and half sibling would be Group B, and great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece would be Group C. I will use the DNA Detectives group names in the rest of this post for ease of reference.

Shared centimorgan ranges for different relationship groups. The original chart is available in the files of the DNA Detectives Facebook group.

 

To complicate matters, each group is defined not so much by an average or “expected” amount of shared DNA but by a range. That is, someone in Group B might share 1750 cM with you, but they could also share as little as 1300 cM or as much as 2300 cM, according to the DNA Detectives chart. Group C can range from 575 cM to 1330 cM.

Notice another problem? The low end of the Group B range overlaps the high end of the Group C range. Put another way, someone who shares 1315 cM with you could be in either group (and remember that each group includes multiple possible relationships). Worse, the more distantly related the group, the broader the range of shared centimorgans relative to the average and the more overlap there is with other groups. Someone who shares 3015 cM with you can only fall into Groups B or C, but someone who shares 100 cM could belong to Group E, F, or G, according to the DNA Detectives chart.

When you have a match in an overlap zone, the best approach is to consider the most likely group first. AncestryDNA’s Matching White Paper (31 March 2016) presents an informative graph (their Figure 5.2) that shows the likelihood of each group (the x axis) given the amount of shared DNA (the y axis). Their graph is based on simulated data, rather than empirical (real) data, but as long as the model they used to do the simulations is reasonable, the data should be reliable.

Distributions of shared centimorgans for different relationship categories based on simulated data. This graph was taken from the AncestryDNA Matching White Paper published 31 March 2016 (their Figure 5.2).

 

Unfortunately, they used a logarithmic scale, which is a great space saver but is intuitive to precisely no one. They also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are. To make the information easier to understand, I edited the image labels to use the groups from the DNA Detectives chart. Here’s what the modified figure looks like.

Figure 5.2 from the AncestryDNA Matching White Paper edited to use the groups defined by the DNA Detectives chart. Note that the numbered ranges to the right of the graph mark regions where that group is the most likely one, not the full range for that group.  For example, between 200 cM and 340 cM, the most probable relationship is Group E, but the full range for that group is 65–600 cM (see below).

 

The figure gives you a visual sense of how broad the ranges are for each relationship group and how much overlap there is. It also shows us which centimorgan values represent only one possible group; those are the zones along the vertical y axis that only have one colored line crossing them. Between about 2400 cM and 3200 cM, the only line is the medium blue one for Group A, and between about 1550 cM and 2000 cM, the only line is the forest green one for Group B. There’s a short interval around 1000 cM that can only be Group C, but for all other centimorgan values, more than one group of relationships could apply.

Because of the log scale, the graph is hard to interpret if you’re interested in a specific centimorgan amount. To get around the problem, I approximated x and y values for each curve using an online plot digitizer. Geek power!

 

What does this tell us? It gives us an indication of which group of relationships is most likely to apply to a match who shares a specific amount of DNA. For example, a match sharing 750 cM with you is in an overlap zone, but they are far more likely to be in Group C (probability p = 0.85, or 85% chance) than in the overlapping Group D (p = 0.15, or 15% chance). Of course, the numbers don’t guarantee that the match is in Group C, but that’s where I’d start looking for the connection.

The probabilities can be more complicated. Consider a match who shares 110 cM with you. That person could belong to Group E (p = 0.08, 8% chance), Group F (p = 0.39, 39% chance), Group G (p = 0.30, 30% chance), Group H (p = 0.20, 20% chance), or Group I (p = 0.06, 6% chance). Again, the best approach would be to look for a shared ancestor in the most likely relationship range first, so Group F > Group G > Group H > Group E > Group I.

You may also be familiar with the Shared cM Project by Blaine Bettinger. This project compiles self-reported data from the genetic genealogy community for different relationships. Thus, it gives us both the extremes (maximum and minimum values) as well as histograms (bar graphs showing how common given centimorgan values are for each relationship). The histograms are comparable to the colored lines on the AncestryDNA graph.

For comparison, I’ve aligned the ranges from the three datasets below. For the Shared cM Project, I’ve combined data for relationships that belong to the same group (e.g., first cousins once removed and second cousins both belong to Group E, so they were treated together).

 

The ranges given by the DNA Detectives are consistently narrower than those from the other two sources. That is mainly due to the fact that the DNA Detectives chart intentionally omits extreme outliers, which are especially challenging to deal with in the unknown parentage searches for which the chart was created. Their dataset is also the smallest, although it has the advantage that each datapoint has been carefully vetted by an expert. The Shared cM Project ranges are similar to those of AncestryDNA, but not exactly the same. Differences between the two could result from errors in the self-reported data of the former, the relative sizes of the datasets (the simulated dataset is almost certainly much larger than the empirical data), or assumptions made by AncestryDNA’s scientists in designing the simulations. Regardless of which source of information you prefer to use in your own genealogical work, keeping in mind the strengths and weaknesses of each dataset is wise.

Note:  The probabilities and cM ranges discussed in this post assume little or no endogamy.  Endogamy is the practice of members of a population marrying within the same group over multiple generations.  If practiced for enough time, the present-day members of the population will all be related to one another multiple different ways.

Acknowledgements: Thanks to Dr. Tracy Vogler for alerting me to the online plot digitizer. CeCe Moore and Christa Stalcup kindly agreed to let me reproduce the DNA Detectives chart here.

 

42 thoughts on “The Limits of Predicting Relationships Using DNA  ”

  1. Dear DNAgeek,

    Nice story! Perhaps it’s an idea to convert this information to an online tool or even a phone app? Let users fill in the largest cM, total cM etc etc and the tool gives a nice visual explanation what would be the most probable connection.

    Best

    EJ

  2. Excellent article. Is the table you constructed with the online digitizer available in a spreadsheet? I think I could use it to assist in an effort to help a distant DNA match to identify her birth parents. It would save me from having to key in your data from the graphic.

  3. In this area we should also be including the number of matching segments as a further determinate. For example, grandmother-matches and niece-matches should have the same expected percentage of DNA but the niece match is expected to have the matching DNA broken up into more segments.

    1. That’s an excellent point. The number (and size) of matching segments can help distinguish between grandparent and avuncular relationships, but not other relationships. Scientists from 23andMe published a paper in 2012 that includes simulation data showing the distinction. I’ve digitized that data as well, but it was too much to tackle in this blog post.

      It’s Figure 3A in this open-access paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267

  4. Thank you for the table of probabilities. I am currently working on my own DNA search for a biologic parent and this will help guide me a bit more…This is actually one of the most understandable charts for the lay person who understands some basic stats that I have seen.

    1. I hope it helps. If you haven’t already, join the “DNA Detectives” group on Facebook for free advice and moral support in your search. Good luck!

  5. Nicely done. I do wonder why you feel that the AncestryDNA white paper authors “also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are.”

    While it is a bit startling to see the term “meiosis” or its plural “meioses” without any introduction or explanation, I do not see what constitutes misuse, or why someone who understands the process would find the use of the word confusing. Isn’t the number of meiotic divisions at the base of all theoretical tables or formulae showing expected shared DNA? With every transfer of DNA from parent to child, the child receives half of the parent’s DNA, divided through the process known as meiosis. Because of crossing over during that meiotic division (as well as upstream meiotic divisions), however, the child does not receive equal amounts of DNA of each line above the parents, and the accumulated “error” with each meiosis explains the increased range in expected or actual shared DNA.

    1. These are good questions. I thought about addressing them in the post, but the explanation would have distracted from the main points I wanted to make here. I used the word meiosis/meioses, because AncestryDNA’s Figure 5.2 uses it. I then switched to the DNA Detectives’ term “group”, because it is both more accurate and less intimidating to the non-biologist.

      As you know, meiosis is the process of forming the egg or sperm in the parent’s body. It results in the egg/sperm getting half of the parent’s DNA. When the mother’s egg fuses with the father’s sperm, the offspring is restored to a full complement of DNA (half from each parent).

      The relationship between a parent and a child involves a single meiosis. That between a grandparent and grandchild involves two meiosis (one in the grandparent, one in the parent). Similarly, half siblings are separated by two meioses, one in the shared parent to produce the first child, and a second in that same parent to produce the second child. This is where AncestryDNA misuses the term. In their figure, they label the group that includes half siblings and grandparents/grandchildren (forest green in the figure, Group B per the DNA Detectives) as three meioses, not two.

      AncestryDNA labels full siblings as being separated by two meiosis, but that’s not the right way to look at it. They *are* separated by two meioses, but they’re also related twice over: once through their mother and once through their father. Essentially, they are double half-sibs, which isn’t quite the same as two meioses. (I’m sure AncestryDNA made this decision to try to make the concept easier for the novice to understand rather than out of ignorance. Unfortunately, in doing so, they’ve used the term incorrectly.)

      Interestingly, although a full aunt/uncle is expected to share the same amount of DNA as a grandparent or half sibling, the aunt/uncle is separated by three meioses, not two. The reason they share in that closer range is because they’re a double relative (i.e., double half aunt/uncle). This fact is potentially useful for relationship predictions; although a full aunt/uncle is in the same group as a half sib or grandparent, the extra round of meiosis means that the shared segments will be smaller, on average, so in most cases, we should be able to distinguish an aunt/uncle from those other two possibilities. (See Figure 3A in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267)

      This is obviously a topic worthy of its own post!

  6. Another thing to take into account is that in earlier times it was common for siblings from one family to marry siblings of another family, all descendants now likely sharing more DNA than typically expected, and impacting accuracy of estimations, depending on how far back this happened.

  7. There is a typo in the asterisk note in your final table comparing the three sources of data. I don’t know if this correctable here. “* The DNA Detectives… and is not comparable to the other to sources.” In the final “to”, the “w” went missing. I realize this is not particularly pertinent to the Blog, but it bugs me.

  8. I really enjoyed your blog. My mother and I recently tested on 23andMe and we were lucky enough two find 2 really close matches. My mother matched 15.1%, 1124cm to a male, 32 segments, and no match on the X. She also matched to a female (the sister to the male above) at 15.9% 1181cm , 33 segments and 3 of those segments are on the X. 23andMe lists both of them to my mother as 1st cousins. What gets me, is that she is in the over lap of all of the charts that I looked at or borderline from one to the other. I know you said when there is an overlap, that one group has a higher probability than the other but it is still hard to not feel 100% sure and leaving doubt. Would you happen to have any suggestions as to what we should focus on? Where we should look? Please forgive any typos as it is hard to focus when your daughter is climbing on you while typing. 😉

    On a positive note, My mom and I have reached out to them, shared photos, and received a family tree of their known relatives. The comparisons between my mom and their family are scary due to how much they resemble each other. We just don’t know who in the family to focus on at this point.

    What do you think?

    1. For both of them, the most likely relationship is one in Group C, with a (much) lower chance of being in Group B. Group C includes first cousins, great grandparent/child, half aunt/uncle/niece/nephew, and great aunt/uncle/niece/nephew. You can probably rule some possibilities out based on their ages.

      1. Thanks. My mom is 20 yrs older than her predicted 1st cousin match. Her 1st cousin matches’ father is the youngest of five and has two male uncles and two female aunts. I’m thinking that one of her two uncles is my mom’s dad. One of the uncles was 19 yrs old and the other was 17 in 1943 when my mom was born. They both registered in the military in 1942. The oldest uncle was sterilized due to the Eugenics program in California, never married, and doesn’t have any known children but could have had relations before joining the military in 1942. The younger uncle did marry, had two daughters (which are still alive), one is 70 and the other is in her 60’s. my mom is 73 so I’m he could have also had relations before he joined the military in 1942. Is it possible that the grand father of my mom’s predicted 1st cousin could also be her dad due to the 1181cm match? He and his wife were also in the Eugenics program. They let him out but his wife supposedly lived out the rest of her life in the state hospital.

  9. Nice work! This is very interesting.

    I’d like to know if there’s more info anywhere on probabilities for the more distant relations. It seems 23andme considers any match in the range of (about) 15 to 42 cM (0.20% to 0.57%) on a single segment as a predicted 4th cousin with a “range” of “3rd to 6th” or “3rd to Distant”. Is there any info on what this really means probabilistically?

    I’d love to see a table like the above that goes down several more rows to 10 or 15 cM and has more columns to show 5th and 6th cousins.

    Is there a “standard” definition of “distant” cousin? It seems the 23andMe uses it to mean beyond 6th cousin, while the table here seems to mean beyond 4th cousin. Obviously the more distant the relation the more different possibilities there are, but if someone’s got a 25 cM overlap, they can’t be only, say 10th, cousins, can they? I can see why 23andMe doesn’t show more beyond a certain point, but I want to see all the geeky details!

    1. The challenge with distant cousins is that we’re not likely to share DNA with them at all. For 4th cousins, estimates range 25% to 50% chance that they won’t match. For 5th cousins, it’s 70–85%, and for 6th it’s 10% or less. Of course, we typically have enough 4th and 5th and 6th cousins that we’ll find a few who match, but by that point they’re statistically likely to share only one segment with us. Basically, there’s no way to distinguish a more-likely 4th cousin from a rare 6th cousin or even rarer 8th cousin based on a single segment.

      Here are some sources for those estimates:
      https://isogg.org/wiki/Cousin_statistics
      https://gcbias.org/2013/12/02/how-many-genomic-blocks-do-you-share-with-a-cousin/

      Another complication that doesn’t get addressed much is that a match who shares one 25 cM segment is likely to be closer than one who shares three 8.3 cM segments, even though the total cM shared is the same. In the latter case, I’d suspect that there are, in fact, multiple connections between the two DNA testers, possibly quite distant.

      1. I can see your point about the complication of matching on multiple segments. It’s not obvious to me whether having multiple segments totaling a certain share amount would be closer or more distant than a single segment share of the same amount, but it seems that 23andMe does interpret this issue oppositely from your assertion.

        If I download my DNA relatives from their site and sort by % of shared DNA, I can see several cases where they predict a closer relationship when there are multiple, short segments shared vs one longer one. For example, I have a predicted 3rd cousin there with whom I share 3 segments which total 37.5 cM. There is another person there with whom I share a single segment of 38.2 cM, and they predict 4th cousin.

        Almost everyone out of 1000+ DNA relatives I have there are predicted 4th cousins, but they also provide a “range” where they say “3rd to 5th”, “3rd to 6th”, or “3rd to Distant”. I can see many cases of predicted 4th cousins of mine there where the range moves up by one (closer) when there are multiple segments shared for a given total shared DNA amount. For example, a 27.2 cM share on 1 segment is predicted as (4th cousin with a range of) 3rd to 6th, but a total 27.2 cM share on 2 segments is a (4th cousin with a range of) 3rd to 5th.

        I tried to figure out what their ranges are for mapping shared DNA % (or cM) to particular predictions, and I couldn’t find any exact thresholds, even when I separate the groups by the number of shared segments. Among my predicted 4th cousins with only a single segment shared, there are several with a predicted range of “3rd to Distant” which have a higher cM amount than many others which are predicted as “3rd to 6th”. I wonder if there is some error in their algorithms here or if there really is some legitimate reason for this. I have not found any explanations on their site.

  10. Can you extrapolate on what the relationships may look like for endogamous populations and for those who then marry outside the endogamous population? Does the count of shared Cms revert to the standard population or will endogamy play a part for many generations?

    1. That’s an excellent question. And, of course, there’s no easy answer, for a few reasons. First, even in non-endogamous populations, there is a range of shared DNA that’s normal for any given relationship (other than parent–child, which is always 50%). We expect the *average* to be higher in an endogamous group, but any two people could still be on the low end of that ranges and therefore not look like the endogamy affected their shared amount of DNA. Second, different populations have different amounts of endogamy. Cajuns and Polynesians and Ashkenazim and Puerto Ricans are all endogamous, but we wouldn’t necessarily expect them all to have the same outcome, because they have different overall population sizes and have been endogamous for different lengths of time. Third, the expected amount of shared DNA is affected by how many relationships there are, and most of our matches won’t have 100% complete trees past about 2nd-great grandparents. As a result, there will be connections that can’t be accounted for.

      Ultimately, we’ll need a combination of simulated data and crowdsourced information from well-studied populations to tease these issues apart.

      As for those that marry outside the population, the effects of endogamy do tapers off, but you can still find yourself matched to very distant cousins.

      Like I said, no easy answers. I wish there were.

  11. Great article; but one should speak of DNA in terms of quantity not “amount”.

    1. I think it’s perfectly acceptable to speak of the “amount” of shared DNA. I suspect you’re suggesting that the word “quantity” is more accurate because many aspects of DNA are discrete and countable. While it is true that base pairs, SNPs, STRs, and physical distance on a chromosome are all countable, shared DNA is measured in centimorgans, which is a calculation based on many factors. Shared DNA in cM is not a discrete quantity.

  12. Mea culpa! On further reflection, I think either amount or quantity could be correct, depending on the context.

    1. If you are on Facebook, a great group to share images and get feedback is “Genetic Genealogy Tips & Techniques”. If you’re not on Facebook, you can email me at theDNAgeek (at) gmail.com.

  13. Hello, I have some very specific needs, and maybe you could help me with choosing
    the correct kit, and what possibilities / expectations to have.
    (Maternal mitochodrial line is not so important)
    Father’s Line, (main interests)
    Pennsylvania, most likely before 1700 (traced to 1771on paper) apparently cousins
    to politically active Rush. Wish to locate Rush relationships from this Pennsylvania
    era, to confirm where I fit in. (6 and 7 generations back).
    ALSO looking to find how this family fits in with the Rush’s in the Tudor Era England.
    I have done fairly extensive research in this area, but there are some holes, and the
    information on Ancestry (from members) is horribly corrupted.
    (side note, Benjamin Rush used a Coat of Arms that tied him to one specific family
    branch, but seemed unaware in his writings of this tie-in.)

    Mother’s side, I have one branch solidly back to 1770, which is most interesting.
    Apparently a heavily intermarried cluster in Lancashire where I lose the paper trail.
    The families are documented to the 1200’s, but I cannot make the exact fit. Is there
    anything that can give hints that far back (roughly 1600’s mother side), does the
    extensive intermarriage help or hinder???

    ALSO, on the Father’s side, Father’s Grandmother, born 1879, Bohemia, has four names
    on Catholic Baptism papers that tie into Ashkenazi Jewish / Catholic Conversion /
    (Frankism) I would like to get some idea if her bloodlines were Jewish, as it appears.

    For any help directing me to which testing can help with these specific questions,
    I am most Thankful. If you guide me to one of the kits on your website, I will purchase
    it here.

    Thank You.
    R. Rush

    1. To investigate your Rush surname line, take the Y-37 DNA test at Family Tree DNA. There’s no guarantee you’ll find matches, of course, but if you do, they can be quite valuable. The test is currently on sale for $129, and you can save an additional $10 with a coupon code.

      I don’t sell DNA tests myself, but if you use this referral link to make a purchase, I will get a small commission. The cost is the same for you:
      https://affiliate.familytreedna.com/idevaffiliate.php?id=1830
      Use this coupon code: R23SGIZZUZY5. It expires today (11/19/17), so if it doesn’t work for you, let me know and I’ll get you an updated one tomorrow.

      If you’re interested in your mother’s maiden surname line, you could ask someone from her family to do the Y-27 test (i.e., her brother, her brother’s son, or maybe a cousin with her maiden surname). I can get you another coupon code if you go that route.

      The Y-DNA tests can be upgraded later if you have too many matches and want to refine the results.

      Y-DNA will only track the direct paternal lineages (the ones usually associated with surname). It can’t help you with your father’s mother’s line, for example. For that, you need what’s called autosomal DNA. With autosomal DNA, the further back in generations you go, the harder it is to find evidence for your ancestors. For that reason, your best bet is to test members of the oldest generation in your family still living. Testing your two parents, if you can, would be better than testing yourself. Testing your four grandparents (if possible) would be better than testing your two parents.

      For autosomal DNA, I recommend starting with the AncestryDNA test. You can transfer those results to other sites, usually for free, to get more bang for your buck. Right now, it’s $79 for the first test, $69 for subsequent ones. They should go on sale for even less over Black Friday weekend. You can buy them through this link (again, I will get a commission, no extra cost for you): http://www.tkqlhce.com/mt80hz74z6MVQQSRPVMONRSTVVS

      1. OK Thank You,
        I got the answers that I had expected. Ancestry the
        best with autosomal, Family Tree for Y.
        I think that with the combination of generations, plus
        distant past intermarriage, much of the autosomal
        will turn into a fog, but in a way, that alone is a positive
        answer. It seems like it might take some time just to
        sort out the data and match it up to info in archives.
        I’ll try to get back by email in the future if this comes out
        interesting.
        And Sadly, I am sixty, and there is no one older to go to,
        except my first cousins.

Leave a Reply

Your email address will not be published. Required fields are marked *