The Limits of Predicting Relationships Using DNA  

The one thing we genealogists probably want most from our autosomal DNA matches is something they can’t give us: an exact relationship prediction based on shared DNA alone. Unfortunately, with the exceptions of identical-twin, parent–child and full-sibling matches, that’s simply not possible.

Why not? One reason is that multiple different relationships can give the same patterns of shared DNA. For example, a woman who shares 1750 cM with you could be your grandmother, granddaughter, aunt, or half sister. Those relationships are indistinguishable based solely on the amount of shared DNA. (In this case, you can narrow the possibilities using age.) Someone sharing 950 cM with you could be a great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece.

The DNA Detectives Facebook team has designed a nifty chart that categorizes relationships into groups based on the expected amounts of shared DNA. In the two examples above, grandparent/child, aunt/uncle, and half sibling would be Group B, and great-grandparent/grandchild, first cousin, great-uncle/aunt/nephew/niece, or half-uncle/aunt/nephew/niece would be Group C. I will use the DNA Detectives group names in the rest of this post for ease of reference.

Shared centimorgan ranges for different relationship groups. The original chart is available in the files of the DNA Detectives Facebook group.

 

To complicate matters, each group is defined not so much by an average or “expected” amount of shared DNA but by a range. That is, someone in Group B might share 1750 cM with you, but they could also share as little as 1300 cM or as much as 2300 cM, according to the DNA Detectives chart. Group C can range from 575 cM to 1330 cM.

Notice another problem? The low end of the Group B range overlaps the high end of the Group C range. Put another way, someone who shares 1315 cM with you could be in either group (and remember that each group includes multiple possible relationships). Worse, the more distantly related the group, the broader the range of shared centimorgans relative to the average and the more overlap there is with other groups. Someone who shares 1315 cM with you can only fall into Groups B or C, but someone who shares 100 cM could belong to Group E, F, or G, according to the DNA Detectives chart.

When you have a match in an overlap zone, the best approach is to consider the most likely group first. AncestryDNA’s Matching White Paper (31 March 2016) presents an informative graph (their Figure 5.2) that shows the likelihood of each group (the x axis) given the amount of shared DNA (the y axis). Their graph is based on simulated data, rather than empirical (real) data, but as long as the model they used to do the simulations is reasonable, the data should be reliable.

Distributions of shared centimorgans for different relationship categories based on simulated data. This graph was taken from the AncestryDNA Matching White Paper published 31 March 2016 (their Figure 5.2).

 

Unfortunately, they used a logarithmic scale, which is a great space saver but is intuitive to precisely no one. They also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are. To make the information easier to understand, I edited the image labels to use the groups from the DNA Detectives chart. Here’s what the modified figure looks like.

Figure 5.2 from the AncestryDNA Matching White Paper edited to use the groups defined by the DNA Detectives chart. Note that the numbered ranges to the right of the graph mark regions where that group is the most likely one, not the full range for that group.  For example, between 200 cM and 340 cM, the most probable relationship is Group E, but the full range for that group is 65–600 cM (see below).

 

The figure gives you a visual sense of how broad the ranges are for each relationship group and how much overlap there is. It also shows us which centimorgan values represent only one possible group; those are the zones along the vertical y axis that only have one colored line crossing them. Between about 2400 cM and 3200 cM, the only line is the medium blue one for Group A, and between about 1550 cM and 2000 cM, the only line is the forest green one for Group B. There’s a short interval around 1000 cM that can only be Group C, but for all other centimorgan values, more than one group of relationships could apply.

Because of the log scale, the graph is hard to interpret if you’re interested in a specific centimorgan amount. To get around the problem, I approximated x and y values for each curve using an online plot digitizer. Geek power!

 

What does this tell us? It gives us an indication of which group of relationships is most likely to apply to a match who shares a specific amount of DNA. For example, a match sharing 750 cM with you is in an overlap zone, but they are far more likely to be in Group C (probability p = 0.85, or 85% chance) than in the overlapping Group D (p = 0.15, or 15% chance). Of course, the numbers don’t guarantee that the match is in Group C, but that’s where I’d start looking for the connection.

The probabilities can be more complicated. Consider a match who shares 110 cM with you. That person could belong to Group E (p = 0.08, 8% chance), Group F (p = 0.39, 39% chance), Group G (p = 0.30, 30% chance), Group H (p = 0.20, 20% chance), or Group I (p = 0.06, 6% chance). Again, the best approach would be to look for a shared ancestor in the most likely relationship range first, so Group F > Group G > Group H > Group E > Group I.

You may also be familiar with the Shared cM Project by Blaine Bettinger. This project compiles self-reported data from the genetic genealogy community for different relationships. Thus, it gives us both the extremes (maximum and minimum values) as well as histograms (bar graphs showing how common given centimorgan values are for each relationship). The histograms are comparable to the colored lines on the AncestryDNA graph.

For comparison, I’ve aligned the ranges from the three datasets below. For the Shared cM Project, I’ve combined data for relationships that belong to the same group (e.g., first cousins once removed and second cousins both belong to Group E, so they were treated together).

 

The ranges given by the DNA Detectives are consistently narrower than those from the other two sources. That is mainly due to the fact that the DNA Detectives chart intentionally omits extreme outliers, which are especially challenging to deal with in the unknown parentage searches for which the chart was created. Their dataset is also the smallest, although it has the advantage that each datapoint has been carefully vetted by an expert. The Shared cM Project ranges are similar to those of AncestryDNA, but not exactly the same. Differences between the two could result from errors in the self-reported data of the former, the relative sizes of the datasets (the simulated dataset is almost certainly much larger than the empirical data), or assumptions made by AncestryDNA’s scientists in designing the simulations. Regardless of which source of information you prefer to use in your own genealogical work, keeping in mind the strengths and weaknesses of each dataset is wise.

Note:  The probabilities and cM ranges discussed in this post assume little or no endogamy.  Endogamy is the practice of members of a population marrying within the same group over multiple generations.  If practiced for enough time, the present-day members of the population will all be related to one another multiple different ways.

Acknowledgements: Thanks to Dr. Tracy Vogler for alerting me to the online plot digitizer. CeCe Moore and Christa Stalcup kindly agreed to let me reproduce the DNA Detectives chart here.

 

99 thoughts on “The Limits of Predicting Relationships Using DNA  ”

  1. Dear DNAgeek,

    Nice story! Perhaps it’s an idea to convert this information to an online tool or even a phone app? Let users fill in the largest cM, total cM etc etc and the tool gives a nice visual explanation what would be the most probable connection.

    Best

    EJ

      1. Pretty wrong results. Entered my info and results showed Parent/ Child result when the actual results should have been Half-brother

        1. Are you sure you entered the correct information? There is no overlap between the amount of DNA a parent–child share and the amount half siblings share.

  2. Excellent article. Is the table you constructed with the online digitizer available in a spreadsheet? I think I could use it to assist in an effort to help a distant DNA match to identify her birth parents. It would save me from having to key in your data from the graphic.

  3. In this area we should also be including the number of matching segments as a further determinate. For example, grandmother-matches and niece-matches should have the same expected percentage of DNA but the niece match is expected to have the matching DNA broken up into more segments.

    1. That’s an excellent point. The number (and size) of matching segments can help distinguish between grandparent and avuncular relationships, but not other relationships. Scientists from 23andMe published a paper in 2012 that includes simulation data showing the distinction. I’ve digitized that data as well, but it was too much to tackle in this blog post.

      It’s Figure 3A in this open-access paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267

  4. Thank you for the table of probabilities. I am currently working on my own DNA search for a biologic parent and this will help guide me a bit more…This is actually one of the most understandable charts for the lay person who understands some basic stats that I have seen.

    1. I hope it helps. If you haven’t already, join the “DNA Detectives” group on Facebook for free advice and moral support in your search. Good luck!

  5. Nicely done. I do wonder why you feel that the AncestryDNA white paper authors “also misuse the word “meioses”, confusing people who aren’t familiar with the term as well as those who are.”

    While it is a bit startling to see the term “meiosis” or its plural “meioses” without any introduction or explanation, I do not see what constitutes misuse, or why someone who understands the process would find the use of the word confusing. Isn’t the number of meiotic divisions at the base of all theoretical tables or formulae showing expected shared DNA? With every transfer of DNA from parent to child, the child receives half of the parent’s DNA, divided through the process known as meiosis. Because of crossing over during that meiotic division (as well as upstream meiotic divisions), however, the child does not receive equal amounts of DNA of each line above the parents, and the accumulated “error” with each meiosis explains the increased range in expected or actual shared DNA.

    1. These are good questions. I thought about addressing them in the post, but the explanation would have distracted from the main points I wanted to make here. I used the word meiosis/meioses, because AncestryDNA’s Figure 5.2 uses it. I then switched to the DNA Detectives’ term “group”, because it is both more accurate and less intimidating to the non-biologist.

      As you know, meiosis is the process of forming the egg or sperm in the parent’s body. It results in the egg/sperm getting half of the parent’s DNA. When the mother’s egg fuses with the father’s sperm, the offspring is restored to a full complement of DNA (half from each parent).

      The relationship between a parent and a child involves a single meiosis. That between a grandparent and grandchild involves two meiosis (one in the grandparent, one in the parent). Similarly, half siblings are separated by two meioses, one in the shared parent to produce the first child, and a second in that same parent to produce the second child. This is where AncestryDNA misuses the term. In their figure, they label the group that includes half siblings and grandparents/grandchildren (forest green in the figure, Group B per the DNA Detectives) as three meioses, not two.

      AncestryDNA labels full siblings as being separated by two meiosis, but that’s not the right way to look at it. They *are* separated by two meioses, but they’re also related twice over: once through their mother and once through their father. Essentially, they are double half-sibs, which isn’t quite the same as two meioses. (I’m sure AncestryDNA made this decision to try to make the concept easier for the novice to understand rather than out of ignorance. Unfortunately, in doing so, they’ve used the term incorrectly.)

      Interestingly, although a full aunt/uncle is expected to share the same amount of DNA as a grandparent or half sibling, the aunt/uncle is separated by three meioses, not two. The reason they share in that closer range is because they’re a double relative (i.e., double half aunt/uncle). This fact is potentially useful for relationship predictions; although a full aunt/uncle is in the same group as a half sib or grandparent, the extra round of meiosis means that the shared segments will be smaller, on average, so in most cases, we should be able to distinguish an aunt/uncle from those other two possibilities. (See Figure 3A in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034267)

      This is obviously a topic worthy of its own post!

      1. I’m sorry that I not reply for almost a year! I came back to this blog post today to see the stats underlying an online calculator and read through the comments until I found your response. Your response to the comment about number of segments brought my response into clearer focus.

        I think Ancestry did use the term meiosis correctly if one is counting back to a shared ancestral couple as the MRCA. This is parallel to the use of “gen” by GEDmatch. Thinking about it this way also helps explain why grandparent/grandchild and avuncular relationships can be differentiated by number of segments, I suspect, if you count number of meiosis back to a shared couple.

        Counting meioses back to the common ancestral couple, in my opinion, makes it clear that we are looking at matching segments on shared lines, not on total DNA. The furthest back we can go between two matches is the ancestral couple they share. For half-siblings sharing a father, for instance, that would be the paternal grandparents, not the father. Recognizing this also should help in trying to use more distant relationships in searching for an unknown biological line.

        Thinking about it this way, and recognizing that segments with full matches are not double-counted (except in 23andMe, apparently), clarifies the confusion people have with “gen” with close relationships. I still have no kits on 23andMe so am not sure whether one can adjust for the double-counting in close relationships such as full sibs in tables and calculators.

        I think the term “meiosis,” including as used in the context of a shared ancestral couple, clarifies so much that it would be worth neophytes learning one more term. And it might jog memories of learning about the two major forms of recombination that occur during meiosis (crossing over and independent assortment) and therefore set the stage for a more advanced discussion of other aspects of inheritance of matching segments that puzzle people, for neophytes who want to go further.

        Thanks for the extended discussion; it has clarified a lot for me.

      2. The meiosis topic is one that I’ve only recently been giving thought to at this now my 2 year “DNA Deep Dive” into finding my birth parents. And, yes it would be an excellent subject for a post of it’s own.

        While pondering how to sort my matches into family lines it simply jumped out at me that there should be a way to discriminate between removed relationships of the same degree (or meioses) because of exactly what you mention about aunt/uncle relationships. For example a 1C1R+ (forward one generation from self) and a 1C1R- (back 1 gen from self) have two different lineages and DNA inheritance patterns from each other as well as from yourself, even though the number of meioses (and some of the people) are the same. Seems like we need a better & more discriminatory term for the individual DNA passed by each parent during meiosis. Also as is unfortunately the case of removed cousins, knowing if there actually IS an easily detected difference (segment totals or segment sizes?) is impossible on a large scale because I have yet to see a study or survey that uses terminology that would discriminate between the two, and therefor hasn’t been studied. (I’m hoping you know of one and can refer me to it if it exists 🙂 ).

        1. A 1C1R will have the same inheritance pattern whether you’re the “up” generation or the “down” one. There’s a chance that a half 1C could be different, but I think there’s too much overlap between total cM shared and segment size for any clear differentiation.

          We can sometimes differentiate grandparent from aunt/uncle from half sibling, but the latter two are only distinguishable from one another when they’re on your father’s side.
          http://thednageek.com/escape-from-the-overlap-zone/

        2. Thank you for your reply. I read the link and I understand the differences between the 3 relatives mentioned. Can’t quite get my head around why paternal is different, but I believe the results and will ponder it, lol.

          Back to the removed relationships, it would be interesting if Dr. Millard has done (or would do) simulations on say, 1C1R relationships. All of the surveys/studies I’ve run across have not provided a means for respondents to differentiate between the two, and I understand why.

          But, here is my logic. I’m sure I may be overlooking something, but here it is anyway.

          The MRCA between these two and I would be Gr Grandparents in the case of 1C1R-, passed via Gr A/U and both Gr Grandparents as well as Grandparents in the case of 1C1R+, passed via A/U. In addition to the “double dose” of DNA from the avuncular connection, I understand that my 1C1R- would inherit ~25% from Gr Grandparent only, while my 1C1R+ inherits ~3.125% from Gr Grandparent as well as ~12.5% from Grandparents.

          I understand that I would share 6.25% with either 1C1R, but it would be differing portions of these MRCA, thus it just seems to me that it might manifest itself somehow other than through 1 to 1 chromosome comparison.

          Perhaps you can elaborate in a future blog post?

        3. The paternal side is different because women have a higher crossover rate than men. Fewer crossovers means fewer (but larger on average) segments passed down.

  6. Another thing to take into account is that in earlier times it was common for siblings from one family to marry siblings of another family, all descendants now likely sharing more DNA than typically expected, and impacting accuracy of estimations, depending on how far back this happened.

  7. There is a typo in the asterisk note in your final table comparing the three sources of data. I don’t know if this correctable here. “* The DNA Detectives… and is not comparable to the other to sources.” In the final “to”, the “w” went missing. I realize this is not particularly pertinent to the Blog, but it bugs me.

  8. I really enjoyed your blog. My mother and I recently tested on 23andMe and we were lucky enough two find 2 really close matches. My mother matched 15.1%, 1124cm to a male, 32 segments, and no match on the X. She also matched to a female (the sister to the male above) at 15.9% 1181cm , 33 segments and 3 of those segments are on the X. 23andMe lists both of them to my mother as 1st cousins. What gets me, is that she is in the over lap of all of the charts that I looked at or borderline from one to the other. I know you said when there is an overlap, that one group has a higher probability than the other but it is still hard to not feel 100% sure and leaving doubt. Would you happen to have any suggestions as to what we should focus on? Where we should look? Please forgive any typos as it is hard to focus when your daughter is climbing on you while typing. 😉

    On a positive note, My mom and I have reached out to them, shared photos, and received a family tree of their known relatives. The comparisons between my mom and their family are scary due to how much they resemble each other. We just don’t know who in the family to focus on at this point.

    What do you think?

    1. For both of them, the most likely relationship is one in Group C, with a (much) lower chance of being in Group B. Group C includes first cousins, great grandparent/child, half aunt/uncle/niece/nephew, and great aunt/uncle/niece/nephew. You can probably rule some possibilities out based on their ages.

      1. Thanks. My mom is 20 yrs older than her predicted 1st cousin match. Her 1st cousin matches’ father is the youngest of five and has two male uncles and two female aunts. I’m thinking that one of her two uncles is my mom’s dad. One of the uncles was 19 yrs old and the other was 17 in 1943 when my mom was born. They both registered in the military in 1942. The oldest uncle was sterilized due to the Eugenics program in California, never married, and doesn’t have any known children but could have had relations before joining the military in 1942. The younger uncle did marry, had two daughters (which are still alive), one is 70 and the other is in her 60’s. my mom is 73 so I’m he could have also had relations before he joined the military in 1942. Is it possible that the grand father of my mom’s predicted 1st cousin could also be her dad due to the 1181cm match? He and his wife were also in the Eugenics program. They let him out but his wife supposedly lived out the rest of her life in the state hospital.

  9. Nice work! This is very interesting.

    I’d like to know if there’s more info anywhere on probabilities for the more distant relations. It seems 23andme considers any match in the range of (about) 15 to 42 cM (0.20% to 0.57%) on a single segment as a predicted 4th cousin with a “range” of “3rd to 6th” or “3rd to Distant”. Is there any info on what this really means probabilistically?

    I’d love to see a table like the above that goes down several more rows to 10 or 15 cM and has more columns to show 5th and 6th cousins.

    Is there a “standard” definition of “distant” cousin? It seems the 23andMe uses it to mean beyond 6th cousin, while the table here seems to mean beyond 4th cousin. Obviously the more distant the relation the more different possibilities there are, but if someone’s got a 25 cM overlap, they can’t be only, say 10th, cousins, can they? I can see why 23andMe doesn’t show more beyond a certain point, but I want to see all the geeky details!

    1. The challenge with distant cousins is that we’re not likely to share DNA with them at all. For 4th cousins, estimates range 25% to 50% chance that they won’t match. For 5th cousins, it’s 70–85%, and for 6th it’s 10% or less. Of course, we typically have enough 4th and 5th and 6th cousins that we’ll find a few who match, but by that point they’re statistically likely to share only one segment with us. Basically, there’s no way to distinguish a more-likely 4th cousin from a rare 6th cousin or even rarer 8th cousin based on a single segment.

      Here are some sources for those estimates:
      https://isogg.org/wiki/Cousin_statistics
      https://gcbias.org/2013/12/02/how-many-genomic-blocks-do-you-share-with-a-cousin/

      Another complication that doesn’t get addressed much is that a match who shares one 25 cM segment is likely to be closer than one who shares three 8.3 cM segments, even though the total cM shared is the same. In the latter case, I’d suspect that there are, in fact, multiple connections between the two DNA testers, possibly quite distant.

      1. I can see your point about the complication of matching on multiple segments. It’s not obvious to me whether having multiple segments totaling a certain share amount would be closer or more distant than a single segment share of the same amount, but it seems that 23andMe does interpret this issue oppositely from your assertion.

        If I download my DNA relatives from their site and sort by % of shared DNA, I can see several cases where they predict a closer relationship when there are multiple, short segments shared vs one longer one. For example, I have a predicted 3rd cousin there with whom I share 3 segments which total 37.5 cM. There is another person there with whom I share a single segment of 38.2 cM, and they predict 4th cousin.

        Almost everyone out of 1000+ DNA relatives I have there are predicted 4th cousins, but they also provide a “range” where they say “3rd to 5th”, “3rd to 6th”, or “3rd to Distant”. I can see many cases of predicted 4th cousins of mine there where the range moves up by one (closer) when there are multiple segments shared for a given total shared DNA amount. For example, a 27.2 cM share on 1 segment is predicted as (4th cousin with a range of) 3rd to 6th, but a total 27.2 cM share on 2 segments is a (4th cousin with a range of) 3rd to 5th.

        I tried to figure out what their ranges are for mapping shared DNA % (or cM) to particular predictions, and I couldn’t find any exact thresholds, even when I separate the groups by the number of shared segments. Among my predicted 4th cousins with only a single segment shared, there are several with a predicted range of “3rd to Distant” which have a higher cM amount than many others which are predicted as “3rd to 6th”. I wonder if there is some error in their algorithms here or if there really is some legitimate reason for this. I have not found any explanations on their site.

  10. Can you extrapolate on what the relationships may look like for endogamous populations and for those who then marry outside the endogamous population? Does the count of shared Cms revert to the standard population or will endogamy play a part for many generations?

    1. That’s an excellent question. And, of course, there’s no easy answer, for a few reasons. First, even in non-endogamous populations, there is a range of shared DNA that’s normal for any given relationship (other than parent–child, which is always 50%). We expect the *average* to be higher in an endogamous group, but any two people could still be on the low end of that ranges and therefore not look like the endogamy affected their shared amount of DNA. Second, different populations have different amounts of endogamy. Cajuns and Polynesians and Ashkenazim and Puerto Ricans are all endogamous, but we wouldn’t necessarily expect them all to have the same outcome, because they have different overall population sizes and have been endogamous for different lengths of time. Third, the expected amount of shared DNA is affected by how many relationships there are, and most of our matches won’t have 100% complete trees past about 2nd-great grandparents. As a result, there will be connections that can’t be accounted for.

      Ultimately, we’ll need a combination of simulated data and crowdsourced information from well-studied populations to tease these issues apart.

      As for those that marry outside the population, the effects of endogamy do tapers off, but you can still find yourself matched to very distant cousins.

      Like I said, no easy answers. I wish there were.

  11. Great article; but one should speak of DNA in terms of quantity not “amount”.

    1. I think it’s perfectly acceptable to speak of the “amount” of shared DNA. I suspect you’re suggesting that the word “quantity” is more accurate because many aspects of DNA are discrete and countable. While it is true that base pairs, SNPs, STRs, and physical distance on a chromosome are all countable, shared DNA is measured in centimorgans, which is a calculation based on many factors. Shared DNA in cM is not a discrete quantity.

  12. Mea culpa! On further reflection, I think either amount or quantity could be correct, depending on the context.

    1. If you are on Facebook, a great group to share images and get feedback is “Genetic Genealogy Tips & Techniques”. If you’re not on Facebook, you can email me at theDNAgeek (at) gmail.com.

  13. Hello, I have some very specific needs, and maybe you could help me with choosing
    the correct kit, and what possibilities / expectations to have.
    (Maternal mitochodrial line is not so important)
    Father’s Line, (main interests)
    Pennsylvania, most likely before 1700 (traced to 1771on paper) apparently cousins
    to politically active Rush. Wish to locate Rush relationships from this Pennsylvania
    era, to confirm where I fit in. (6 and 7 generations back).
    ALSO looking to find how this family fits in with the Rush’s in the Tudor Era England.
    I have done fairly extensive research in this area, but there are some holes, and the
    information on Ancestry (from members) is horribly corrupted.
    (side note, Benjamin Rush used a Coat of Arms that tied him to one specific family
    branch, but seemed unaware in his writings of this tie-in.)

    Mother’s side, I have one branch solidly back to 1770, which is most interesting.
    Apparently a heavily intermarried cluster in Lancashire where I lose the paper trail.
    The families are documented to the 1200’s, but I cannot make the exact fit. Is there
    anything that can give hints that far back (roughly 1600’s mother side), does the
    extensive intermarriage help or hinder???

    ALSO, on the Father’s side, Father’s Grandmother, born 1879, Bohemia, has four names
    on Catholic Baptism papers that tie into Ashkenazi Jewish / Catholic Conversion /
    (Frankism) I would like to get some idea if her bloodlines were Jewish, as it appears.

    For any help directing me to which testing can help with these specific questions,
    I am most Thankful. If you guide me to one of the kits on your website, I will purchase
    it here.

    Thank You.
    R. Rush

    1. To investigate your Rush surname line, take the Y-37 DNA test at Family Tree DNA. There’s no guarantee you’ll find matches, of course, but if you do, they can be quite valuable. The test is currently on sale for $129, and you can save an additional $10 with a coupon code.

      I don’t sell DNA tests myself, but if you use this referral link to make a purchase, I will get a small commission. The cost is the same for you:
      https://affiliate.familytreedna.com/idevaffiliate.php?id=1830
      Use this coupon code: R23SGIZZUZY5. It expires today (11/19/17), so if it doesn’t work for you, let me know and I’ll get you an updated one tomorrow.

      If you’re interested in your mother’s maiden surname line, you could ask someone from her family to do the Y-27 test (i.e., her brother, her brother’s son, or maybe a cousin with her maiden surname). I can get you another coupon code if you go that route.

      The Y-DNA tests can be upgraded later if you have too many matches and want to refine the results.

      Y-DNA will only track the direct paternal lineages (the ones usually associated with surname). It can’t help you with your father’s mother’s line, for example. For that, you need what’s called autosomal DNA. With autosomal DNA, the further back in generations you go, the harder it is to find evidence for your ancestors. For that reason, your best bet is to test members of the oldest generation in your family still living. Testing your two parents, if you can, would be better than testing yourself. Testing your four grandparents (if possible) would be better than testing your two parents.

      For autosomal DNA, I recommend starting with the AncestryDNA test. You can transfer those results to other sites, usually for free, to get more bang for your buck. Right now, it’s $79 for the first test, $69 for subsequent ones. They should go on sale for even less over Black Friday weekend. You can buy them through this link (again, I will get a commission, no extra cost for you): http://www.tkqlhce.com/mt80hz74z6MVQQSRPVMONRSTVVS

      1. OK Thank You,
        I got the answers that I had expected. Ancestry the
        best with autosomal, Family Tree for Y.
        I think that with the combination of generations, plus
        distant past intermarriage, much of the autosomal
        will turn into a fog, but in a way, that alone is a positive
        answer. It seems like it might take some time just to
        sort out the data and match it up to info in archives.
        I’ll try to get back by email in the future if this comes out
        interesting.
        And Sadly, I am sixty, and there is no one older to go to,
        except my first cousins.

  14. I’ve got a crazy puzzle dealing with DNA, dealing with the unknown parents of my great grandfather, John Crumpton (1871-1935). DNA testing has shown me who his grandfather was (William Crumpton, Jr. (1805-1860)). However, there are descendants from three different children of William Crumpton who all match me at higher values than should be likely. They can’t all be my 2nd great grandparent, but the DNA makes them all look like they should be.

    As an example, one DNA cousin (R.A.M.) is the 2nd-great grandchild of William Crumpton through his son Lemuel. William Crumpton is my mother’s 2nd-great grandfather, as well. (I use my mother’s DNA results for comparison, as she is a generation closer to the source.) If R.A.M. and my mother descended from different children of William Crumpton, they would be third cousins. R.A.M. and my mother share 171 cM (probability value 0.13).

    Second, I have two matches, R.T. and A.O., who are 3rd-great grandchildren of William Crumpton, through his daughter Mary Ann Crumpton. Mary Ann is definitely not a candidate as my 2nd-great grandmother, as she was busy having a legitimate child with her husband the same year my great grandfather was born, and had three more after that. R.T. and A.O. should therefore be my mother’s third cousins once removed, and yet they match at 157 cM and 113 cM (probabilities of about 0.04 and 0.16).

    The closest relationship found with DNA is to R.E.C., a great-grandson of William Crumpton through his son Jonathan. He could, therefore, either be my mother’s second cousin once removed. Their match is 303 cM, for a probability of 0.10 of being 2C1R.

    There is one line of Crumpton matches which is complicated by the fact that I’m doubly related to them—my great-grandfather’s mother was an Emerson (based on DNA matches), and one Crumpton son (Marion) married an Emerson daughter. My great-grandfather is not one of their children, but was at least a double first cousin to their children, skewing the DNA results. Marion is also a good candidate as my 2nd-great grandfather, if he was cheating on his wife with his sister-in-law, which would make John Crumpton about a 3/4 sibling to his legitimate children. The double DNA connection makes it hard to figure out. One match, who is a 4th great grandson of William Crumpton, shares a DNA match with my mother at 174 cM. (No idea how to calculate the relationship probabilities here.)

    I lean towards the R.E.C. match, since there are fewer generations between him and my mother, so the variation in DNA match values ought to be less. The probability of their match being Group E (half 1C1R) is 0.57. (Not to mention that Jonathan was single at the time of John Crumpton’s conception, though he did marry not long after, and not to an Emerson.)

    1. Sounds like you’re making great progress with this mystery! It’s difficult when you’re dealing with double relationships, because the probability table does not apply. At I4GG this weekend, I’ll be presenting a new approach for working with cases like this, and I hope to blog about it soon after. Stay tuned!

  15. I have a question on total segment length associated with cutoff point and what it means. FTDNA and GEDmatch typically use a cutoff of 7 CM. I was “playing” around and set the GEDmatch cutoff point as two cM. I went from no match to 44 matching segments, longest 4.8 cM, total 119.9 cM. This was with a person I have a Y DNA match. This is with a person that I have a high probability of absolutely no relation of any kind for at least 225 – 250 years and likely longer than that.
    Further, I looked at several people who have the same ancestral surname. On a few of the segments there was some segment overlap for some of the people (not all).
    What am I seeing or imagining. Thanks

    1. Small segments (smaller than 7 cM) are statistically more likely to be false positives than real IBD matches. If you’ve tested your parents and can phase your DNA — and better yet, if your match can also phase — you will see that most, if not all, of those segments will no longer match.
      There are some good blogs on small segments here:
      https://thegeneticgenealogist.com/2014/12/02/small-matching-segments-friend-foe/
      http://www.yourgeneticgenealogist.com/2014/12/the-folly-of-using-small-segments-as.html
      https://thegeneticgenealogist.com/2017/09/03/sharing-large-segments-with-a-match-does-not-validate-small-segments-shared-with-that-match/

  16. First cousins share, on average about 900cM but ranging from around 575-1300. I share 1513 with my first cousin and 1305 with her brother.

    We know why it’s so high. Our mutual grandparents were full siblings.

    My question is: what would the expected shared range be given that scenario?

    1. That’s an interesting question technically and mathematically (not to mention all the other ways!). I thought about this for a while and this is what I came up with, purely theoretically:

      Note that I’m ignoring the X chromosome difference between the sexes and using the round number of 3600 cM for each copy of the genome for a total of 7200 cM in each person.

      I would first consider how related two children of full siblings would be.

      For two full siblings whose parents are not related, the answer would be 50% on average, or 3600 cM. Of course for such close relatedness, you must consider full DNA matches and half matches separately. I think ignoring that distinction is when you get people saying siblings have 2700 cM shared on average. That 2700 cM is really 1800 cM of half match and 900 cM of full match (and the other 900 cM of no match). You have to double the 900 cM of full match and add to the 1800 cM of half match to get 3600 out of 7200 or 50%.

      I won’t show all the details here, but I think using those averages, you can predict that two full siblings of two full siblings would have 1125 cM of full match sharing and 2025 cM of half match. Doubling the 1125 and adding to 2025 gives 4275 cM, or 59.375% (of 7200) shared DNA.

      It makes intuitive sense that the amount of shared DNA between such siblings would be higher than 50%, so this number seems reasonable to me. So this means that such siblings would share 18.75% more DNA than siblings whose parents are not related, on average.

      In the situation described above, the 59.4% (on average) related siblings are the children of the mutual grandparents. I think it’s valid to just scale up the typical cousinship relatedness by that 18.75%, so the 900 cM “standard” for 1st cousins becomes 1069 cM and the 575 to 1300 cM range becomes 682 to 1544 cM. The two relationships listed are indeed both in this range, so maybe this does make sense.

      1. Sorry, I just realized I left out part of a sentence in my last post. When I said “two full siblings of two full siblings” I meant to say “two full siblings who are children of two full siblings”.

        1. I would think that would make your cousins “double first cousins” (group B in first chart)

        2. Hi Tanya, to clarify: double first cousins are cousins who share all 4 grandparents (and are not otherwise more closely related). Being double cousins doesn’t require any marriage between related people. If your mom’s sister and your dad’s brother were married to each other, their children would be double cousins to you. The situation Mick presented here is very different, where a brother and sister had children together. Those children would have only 2 grandparents. Their children would still only be regular (not double) first cousins, but they would be more closely related than typical first cousins because of their grandparents being siblings to each other.

  17. Thanks, Nick, for your thoughts, and for expressing them so clearly for someone who is fairly new to this game.

    Of course, the situation I describe will be fairly rare and, probably, many people with this sort of ancestry will be unaware of it anyway. Over the generations, though, unusual unions will have occurred in many families. It makes me wonder to what extent such unions contribute to the range extremities shown on, eg the Shared cM Project. Had I submitted my data, without qualification, to that project, would we have a different range for first cousins?

  18. Hi,
    Is there any info as to how much fully-identical by chance we might share, on average, with someone else? What follows may be impressionistic, but I don’t think so. With some of my more distant matches, eg, a third cousin once removed, I seem to have quite a few fully-identical regions, quite, or very, short it’s true. Often they seem to cluster quite close together and comprise 30-50% of a 7+ cM block which GedMatch shows with its blue marker. If these are really identical by chance (and I suppose they must be) then are they distorting the ‘longest segment’ calculation, and also the shared cM calculation?

    My background is UK, so endogamy ought not to be an issue.

    1. A fully identical region (FIR) will be solid green. If you’re seeing a large number of thin vertical green lines in an otherwise yellow block in the one-to-one comparison, that’s not an FIR. That’s just someone who coincidentally shares the same SNP(s) at those spots. It’s not unusual if you both share deep roots (thousands of years) in the same area.

      1. Thanks. That’s clarified it somewhat. One further question, if I may, how wide does the thin green line need to be to be considered a FIR?

  19. For the sake of correctness…
    It would seem you have a typo in the fifth paragraph… “Someone who shares 3015 cM with you can only fall into Groups B or C, but someone who shares 100 cM could belong to Group E, F, or G, according to the DNA Detectives chart.”
    – Easy to see that ‘1315 cM’ (rather than 3015) would correctly continue the subject of the paragraph. Perhaps an error in speech recognition?

  20. My older daughter Sarah recently tested and no surprise that we share 3,384cMs w/longest block 267cMs. Sarah’s half aunt transferred her raw results from myheritage to FTDNA, and matches Sarah 886cMs on chromosomes between 1-22, but nothing on chromosome 23 aka “x”. Sarah’s mother & Sarah’s half aunt have the same mother(different fathers), but the same grandmother of Sarah, so is it possible that Sarah & her half aunt do not match any “x” chromosome dna?

    1. Yes, it’s possible for Sarah and her aunt to not share any DNA on the X. There are probably other chromosomes on which they don’t match, as well.

  21. I am 75 years old and am so confused about my dna test. I found out recently that my father may not be my father. My sibling and I had a test done. the segments say we are matched at 42 and the cM is 2449. BUT, my sister is 56% English, I have none. I am Irish, she has none. Can you please help me. I need to know before I die.

    1. That amount (2449 cM) is in range for a full sibling. I wouldn’t fret about the different ethnicity estimates. One reason for the differences is that you share both parents, but you didn’t necessarily inherit the exact same bits of DNA from each parent. Another reason is that ethnicity estimates are still a developing science. Which company did you test with?

  22. This is very handy, as is the probability calculator on the DNAPainter site.

    Here’s my issue. I share (according to Ancestry) 64 cM in 5 segments with a bloke somewhere in N. America. He descends from Alice Dixon, a sister of my gt. grandmother, so, assuming we are of the same generation (and I think we are) we are 3rd cousins.

    Problem is that I don’t know for sure who was the father of his ancestor. The mother had been married before. Alice’s birth certificate shows her surname as Cole, and the father as William John Cole, the actual husband at the time. Five weeks after the birth registration, Alice was baptised with the surname Dixon, and the father as Samuel Dixon. My gt. grandmother came along later so was definitely a Dixon.

    So, depending on which is right – the birth certificate or the baptism, my match is either my 3rd cousin, or my half-3rd cousin.

    The probability calculator gives a 32% probability that we are half-3rd cousins, and 22% that we are full-3rd cousins.

    But the Shared cM Project Table 3 states that Ancestry averages/medians are 64/53 for a 3rd cousin and only 33/39 for a 3C1R (same as a half-3C in DNA terms). This seems to reverse the likelihoods shown by the probability calculator.

    So, my question is – given the substantial differences between the different companies in calculating shared cMs, is it really viable to have averages that derive from the aggregation of various company’s results?

  23. The amount you share with your sister is for a FULL SIBLING. Unless you are identical twins you won’t have the same ethnic make up.

  24. I have a surprise atDNA match who is either my 1C1R or 2C1R, depending on which of two brothers was my biological grandfather. Our shared ancestors were French-Canadian, with a single recent instance of cousin marriage, which also makes us 3C2R.

    We share 303 cM across 16 segments on AncestryDNA. Being somewhat familiar with the TIMBRE effect when endogamous populations are involved, I asked my match to upload his results to GEDmatch, where our One-to-Many match is 337 cM and our One-to-One match at the default settings (7 cM and 500 SNPs) is 326 cM across 13 segments.

    Based on these numbers, it appears we are far more likely to be 1C1R than 2C1R, even when the potential 3C2R “contribution” is taken into account. How would you go about quantifying that likelihood?

    1. That is such a great question, and unfortunately, I don’t have an easy answer. The best way to go about it would be to simulate each combination of relationships (1C1R + 3C2R and 2C1R and 3C2R) enough times to create distributions of expected shared DNA amounts, and then compare the distributions to the real numbers.

      I have a fudge that I use in cases like this, but I’m not ready to make it public because I haven’t validated it. I ran it on your scenario and it only very slightly favored the 1C1R hypothesis. Definitely not by enough to have any confidence in the result. Is there someone else you can test?

      1. Thanks. Fortunately, there is another person and she just agreed to test. Same relationship (either 1C1R or 2C1R) but a different grandfather than the other.

        That said, I am mystified by your conclusion that the existing results are close to a toss-up.

        1. It’s close to a toss-up because 303 cM is low for a 1C1R and even lower given that some of the shared DNA could have come through the 3C2R connection. Then again, it’s high for 2C1R + 3C2R. (Also, I make no claims to how well my “fudge” works, so maybe that’s where the problem lies.)

          Please update me when the new tester’s results come in. I’d love to see whether my tentative prediction holds or is rejected by more evidence.

  25. Thanks again. I will definitely let you know when we get the new results. At least 6-8 weeks, possibly more. How best to contact you with the results at that time?

    One more question for now. Why did you go with the 303 cM number from Ancestry instead of the 326 cM number from GEDmatch? (Perhaps the difference would be insignificant.)

    1. You can email me at theDNAgeek —a— gmail.com.

      I use AncestryDNA numbers over other estimates because Timber is, in theory, removing segments that are pileups. Pileups are unlikely to reflect recent shared ancestry, so we actually want to downweight them.

      1. One more thing (sorry). In my own crude analysis of this, I assumed that if the total match was 70% of average, then each of the two components (the 2C1R and the 3C2R) would be 70% of average. I gather that’s not right?

  26. Hi,
    Can you please help me…If two brothers each have a child with the same woman, what relationship will those children have? Will their centimorgan values be on the high side of half siblings? Or are they just regular half siblings?

    1. We call those “three-quarter” siblings. They’d be expected to share an amount of DNA between that of half sibs and fulls sibs.

  27. Help….I have a mystery guest that is either a 1/2 first cousin, or a second cousin on my paternal side. How can you determine a more precise match? I match my known Paternal Uncle at 1,825 centimorgans shared across 55 DNA segments. I match his daughter my known first cousin(female) at 1,042 centimorgans shared across 33 DNA segments. I match the mystery guest at 542 centimorgans shared across 22 DNA segments. My Uncle matches the mystery guest at 497 Centimorgans and his daughter(my known first cousin) at 287. I see how the numbers seem to indicate a 1st cousin once removed for my uncle and a second cousin relationship for my first cousin, but my number is higher than both theirs and indicates a 1/2 first cousin. Mystery guest father is the same age as my paternal uncle and mystery guest is close to my age. Unfortunately all the parents/grandparents and my Uncle are deceased so there is no was to get a DNA sample closer up the line.

      1. Thank you. I still had strange results.. probably due to only having 5 DNA sample cm numbers to compare. If all DNA cm’s are correct from all 5 sources, my Uncle, my 1C and my 2C, then the mystery guest is likely my 1/2 1C.

        I am going to have a test done at 23 and me, to confirm or deny my ancestry dna results.

        Removing my uncles Cm gives a score of 89 and points to the mystery guest being a 1/2 1C.

  28. Many thanks for the Utility and Blog.

    I’ve been somewhat perplexed over Siblings being given as sharing 50% of their DNA, but 2629cM rather than 3487cM… after reference to Blaine’s 2 books, I found a description on p104 of the first (as also given above) intimating that 25% is shared with both parents (FIR), 50% with either parent (HIR) and 25% with neither (on average!). This is then given as (25% + 50%) of 3487cM i.e. 2616cM – I can see you make the distinction between the testing companies in the DNA Detectives table above. AncestryDNA have then needed to qualify close matches above 1300cM as given in Figure 5.3 of their White Paper, identifying full siblings as 25% FIR and Identical Twins as 100% FIR to distinguish from Parent-Child and Half-Sibling relationships.

    It’s taken me a day resolve the confusion, so I thought it was worth a mention; I’ve always been somewhat perplexed that at GEDmatch & Ancestry I match about 3500cM with either parent, but no more with myself!

    1. Technically, we each have about 7000 cM of DNA, once you account for both copies of each chromosome. Because we pass along only one copy of each chromosome to our children, they match us at 3500 cM (give or take … it varies slightly by company) in what we call half-identical regions. Those are the ones that show in yellow when you do a one-to-one comparison at GEDmatch with the graphics on. An identical twin would match on both copies (7000 cM total) and would show as fully-identical at GEDmatch (green instead of yellow), but most of the companies don’t count the fully identical regions twice. That’s why your match to yourself (or to an identical twin) would only be 3500 cM instead of 7000 cM. It’s also why the total for full siblings seems off.

  29. Yes, that makes those upper level figures a bit tricky to interpret – it seems apparent that AncestryDNA do actually determine the amount that is FIR (as described in Figure 5.3 of the White Paper) and could presumably include the information with the match data. I think that would then lead to a bell-shaped curve at that top level, extending down from 7000cM. The lower curves then derive from that fundamental (i.e. comparison of two siblings is a comparison of two instances of the procreation process (avoiding the term meiosis!), which then repeats at each generation).

    I’ve also realised that the FIR/HIR mix inherited from parents must be 50%/50% for the mix between Siblings given (i.e. 25% FIR, 50% HIR, 25% no match)? i.e. 50% of the 50% FIR is shared between siblings (on average); the other percentages then follow.

    1. Sorry, I got a bit confused with the last paragraph (not difficult I guess!); I’ve been consulting Blaine’s first book again – p100 gives a nice diagram of the inheritance pattern from grandparents to grandchild. I think that 25% FIR content for a full siblings match derives from their sharing 50% of their father’s DNA on the paternal chromosomes and 50% of their mother’s on the maternal side; an expected 50% overlap between those would then lead to 25% FIR.

      The source of the variability (depicted in Figure 5.2 of Ancestry’s White Paper given above) must then come from the way in which the father’s paternal & maternal chromosomes are combined to provide the child’s paternal chromosomes, and likewise on the mother’s side. If that aspect can be modelled that might then reveal the background Endogamy in the more distant matches (i.e. they’re actually multiple matches)?

      Also wondering if the degree of FIR in a siblings match imparts some additional information not present in more distant matches.

      Think I’d better move over to the Facebook group…

    1. So-called false positive segments become more likely as the segment size decreases. They are rare for segments of 15 cM and higher, whereas most segments below 7 cM are false positives.

  30. Thank you for the article, Still trying to figure all this out.
    Would someone with 1123 Cm across 40 DNA segments possibly be a half sister?
    Her daughter shares 860 Cm across 37 DNA segments with me.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.