Science the Heck Out of Your DNA — Part 1

Scroll down for links to other posts in this series.

I presented a talk on this method at the i4GG conference
in December 2017. The video is available for purchase here,
either individually or as part of the all-conference package

Last year, I wrote a post entitled “The Limits of Predicting Relationships Using DNA” in which I shared a statistical approach for deciding which relationship to an unknown DNA relative is most likely. For example, if I have a new match who shares 260 cM with me, the probability table in that post would tell me that there’s a 62% chance that they’re a 2nd cousin (or equivalent), a 23% chance they’re a second cousin once removed (or equivalent), and a 15% chance that they’re a first cousin once removed (or equivalent).

I can use this information to guide where to start looking in my family tree for the connection.

Unknown parentage cases have a different challenge. There may be several DNA matches who all descend from a certain couple, but we don’t know where the searcher belongs in that family tree. Consider the example of Ruth, who has an unknown birth father:


Based on how much DNA Ruth shares with Katie (140 cM) and Marie (136 cM), she could fit into the tree in several different places. We need to test other descendants to narrow down the possibilities. Because the costs of DNA testing add up quickly and because each test can take weeks or months to process, we want our decisions to be guided by the best evidence available. How can we use the existing DNA matches to decide which scenario is most likely and, therefore, which branch of the tree to target first?

I am very excited about a new tool that I’ve been working on with Dr Andrew Millard and Jonny Perl to address that exact question. I had the inchoate idea of applying probabilities to situations more complex than one-to-one comparisons, Andrew created a working model that calculates the odds for specific scenarios, and Jonny has been creating an online tool to make the process easily accessible. I am going to describe that tool in a series of posts, beginning with this one.


What’s With the Title?

I’m sure you’re wondering why I titled this series “Science the Heck Out of Your DNA”. Short answer: It’s a (sorta) quote from the movie The Martian (sorta because he doesn’t say “heck”). I love the movie because the main character is an unrepentant science geek who uses his botany powers to survive on Mars. He also does a lot of math. And while botany powers aren’t likely to solve any genealogical brick walls, scientific thinking and math just might.


First, Some Basics

The probability table in my earlier post can tell me the likelihood that someone who shares a given amount of autosomal DNA with me is a 2nd cousin or 3rd cousin or whatever. But what if I have two DNA matches? Or three? Or more? Each match shares a different amount of DNA and could be related to me multiple ways. How do I figure out which scenario is most likely?

Here, we can turn to a basic rule of statistics: If you want to know the combined probability of two independent events (meaning, the chance that both events will happen), you simply multiply the probability of Thing 1 by the probability of Thing 2 (with apologies to Dr Seuss).


And if you want to know the odds of three, or four, or six, or more independent things happening, you multiply the probability of the first by that of the second by that of the third … and so on. The product is called a “compound probability”. (We’ll come back to what “independent” means later.)


In my next post, I’ll describe how we can apply these ideas to a simple genealogical problem. Subsequent posts will expand to more complex cases.


Other posts in this series can be found here:

30 thoughts on “Science the Heck Out of Your DNA — Part 1”

  1. This sounds great! I think I have narrowed down the identity of my paternal great grandfather to one of 3 brothers. I have 40 matches feeding into this mostly at Ancestry, only 5 in GEDmatch. I used the Maguire method and all ICW info to help firm this up. I feel quite confident that it’s one the 3 brothers but I would love to know whether the numbers confirmed it!

  2. Factoring in the variable that my birth parents were paternal 2nd cousins(shared the same paternal great-grandfather) does it make sense that I match a female cousin 997cMs on 38 segments whose maternal great-grandmother was my maternal grandmother, meaning we are 1st cousins 1x removed?

    Best regards, Doug

    1. That’s beyond the upper range for a 1C1R. If the 2C connection between your birth parents was on your mother’s father’s side, it shouldn’t affect how much DNA you share with someone on your mother’s mother’s side. Is it possible that this match has another connection to you?

  3. And I love the new tool and have already been sharing it with the people I am working with. Fantastic job! Kudos to you, Dr. Millard and Jonny! I blogged about it the night I found it. Talk about keeping an old gal up with excitement. I can barely keep up and you have certainly helped with the improvement of the tools out there.

  4. I suspect that this kind of analysis will not work wiyjin endogamous populations, where a total of say 200 cM of matching segments is likely to be distributed among several common ancestors if different directions.

      1. Israel, you are right. Where there is sharing that is due to relationships not in the tree the calculation can’t be correct, but in principle the calculations can be extended to any complex tree with pedigree collapse. For endogamous populations the table of probabilities is likely to be wrong, and we might have to develop a different table, for example using the data in Lara Diamond’s tables (and I’m sure you have a similar dataset!)

  5. We had a recent success identifying the biological father of one of two sisters (who we all thought were full sisters, but who turned out to be half). We were looking in the wrong part of the tree for awhile, because it turned out that some of the cousins were double 2nd cousins (one pair of grandparents were the siblings of the other pair).

    Although the statistical groups include double first cousins, they don’t include double 2nd, etc., so I’m not sure what the expected cM match range would be (in our case, the actual was almost exactly double the average expected for regular 2nd cousins).

    Any chance that future versions of the statistical tables might include double 2nd and 3rd cousins?

    1. I certainly hope so! Ideally, the community can develop probability tables for specific scenarios (like an isolated case of double 2C) as well as for specific endogamous populations, like Ashkenazi Jews. I’d like to see a combination of crowdsourced empirical data like from the Shared cM Project and simulated data.

    2. Double first cousins are a special case because they have a large proportion of the shared DNA in fully identical regions (FIR). Double second cousins have much less FIR and double 3rd cousins hardly any. It is a sufficient approximation to treat them as simply one group closer: second cousins are in Group E, so double second cousins are in Group D. For third cousins are in Group G, so double third cousins are in Group F. Tables like these will rarely suffice for complex multiple-cousinhood situations because there will rarely be enough data to relaibly estimate the range. That is where we have to turn to simulations.

  6. Do you need a guinea pig. I have 22 cousin matches in a family that I’m sure is my birth father. This tool would be awesome!

  7. I only know that this female’s grandmother a younger sister of my birth mother and that her maternal great-grandmother was my maternal grandmother. If there is another close relation that I am not aware of, then I hope to be able to discover that; but have to wait for better communication with this family.

  8. This is exactly what I need to narrow down who my grandfather is (my dad was adopted)! I look forward to the next blogs in this series!! Thank you!

  9. 260cM!!!
    This blog item’s advice is priceless for those with close adopted or unknown ancestors.
    However, until recently, most discussion I saw was closer to my own case, where matches are 10-30cM. Out in 3rd – 5th cousin land. Unfortunately the overlap between possible options tends to make these statistics unhelpful out there. But I think it is where most people’s matches live, don’t they? Any advice for us? Plleeaase
    If I see 260cM I reach for 1)my address book, because I probably know them but not by the alias against their DNA identity 2)my family tree 3)family books of descendants of various ancestors

    1. Distant matches are best for ruling out relationships rather than ruling them in. Even a 0-cM match can be useful as it can rule out hypotheses that would put you as 2C or closer. When you only have distant matches, the best use of the probability approach is to guide your targeted testing. I’ll be talking about that in a future post (the one about Ruth).

  10. I’d love to help test this tool. I have a great grandfather whose parentage is unknown. DNA testing has helped me determine that his grandparents were couples named Crumpton and Emerson, but there are 11 children named Crumpton and 5 named Emerson who are, in theory, possible parents for said great-grandfather. (Realistically, I have narrowed it down to six sons named Crumpton and two daughters named Emerson.) I have 19 DNA matches for whom I know the ancestry, including 4 who descend from BOTH families (which confuses things wonderfully, with a doubled relationship to consider).

    I don’t hold much hope of figuring out which of the two daughters was the mother, as I don’t have any DNA matches descending from either, only from their siblings. But many of the matches I have seem unusually high for the relationships they should represent. (I have two 3C1R matches at 157 and 113 cM, respectively. They are 2nd great grandchildren of a Crumpton daughter who is not a potential parent, as she was married and having legitimate children through the whole period when my great-grandfather was born.)

    It’s those high matches that make me hesitant about the one pair of matches I have that may hold the key. There is a man (R.C.) who matches my mother at 294 cM, and his nephew matches her at 197 cM. If R.C.’s grandfather Jonathan Crumpton (one of the six Crumpton sons) is my great-grandfather’s parent, then my mother and R.C. would be half 1C1R (p=0.57), and the nephew would by my mother’s half 2C (p=0.46). If Jonathan Crumpton was not the parent, then R.C. would be a 2C1R (p=0.10), and the nephew would be a 3C (p=0.05). Both are possible, but the former seems more likely based on your tables. Yet there are so many other high matches, it’s hard to be sure.

    1. Sounds like an interesting case! The probability tables aren’t designed for multiple relationships, unfortunately. If the match is a double cousin at the same level (e.g., 3C twice over), you can just bump them up one category (e.g., double 3C becomes 2C1R), but if there are multiple relationships at different levels (e.g., 3C + 3C1R) the table doesn’t really apply. That’s an area that would benefit from simulation programs that would let us generate probability distributions for specific cases.

      1. I have real trouble with those double relationships. Marion Crumpton married Lucy Emerson. I know my great grandfather isn’t one of their legitimate offspring, but Marion is a potential as the father, and Lucy would be a sibling of whichever girl was the mother. So my great grandfather’s relationship to their children would be either 3/4 sibling, or double first cousin. Neither relationship is really covered in the charts, but I have a feeling that the difference in DNA would be insignificant.

Leave a Reply

Your email address will not be published. Required fields are marked *