Science the Heck Out of Your DNA — Part 1

January 2, 2018 thednageek 40d Comments

Scroll down for links to other posts in this series.

Basic Probability

Last year, I wrote a post entitled “The Limits of Predicting Relationships Using DNA” in which I shared a statistical approach for deciding which relationship to an unknown DNA relative is most likely. For example, if I have a new match who shares 260 cM with me, the probability table in that post would tell me that there’s a 62% chance that they’re a 2nd cousin (or equivalent), a 23% chance they’re a second cousin once removed (or equivalent), and a 15% chance that they’re a first cousin once removed (or equivalent).

I can use this information to guide where to start looking in my family tree for the connection.

Unknown parentage cases have a different challenge. There may be several DNA matches who all descend from a certain couple, but we don’t know where the searcher belongs in that family tree. Consider the example of Ruth, who has an unknown birth father:

Based on how much DNA Ruth shares with Katie (140 cM) and Marie (136 cM), she could fit into the tree in several different places. We need to test other descendants to narrow down the possibilities. Because the costs of DNA testing add up quickly and because each test can take weeks or months to process, we want our decisions to be guided by the best evidence available. How can we use the existing DNA matches to decide which scenario is most likely and, therefore, which branch of the tree to focus on first?

I am very excited about a new tool that I’ve been working on with Dr Andrew Millard and Jonny Perl to address that exact question. I had the inchoate idea of applying probabilities to situations more complex than one-to-one comparisons, Andrew created a working model that calculates the odds for specific scenarios, and Jonny has been creating an online tool to make the process easily accessible. I am going to describe that tool in a series of posts, beginning with this one.

What’s With the Title?

I’m sure you’re wondering why I titled this series “Science the Heck Out of Your DNA”. Short answer: It’s a (sorta) quote from the movie The Martian (sorta because he doesn’t say “heck”). I love the movie because the main character is an unrepentant science geek who uses his botany powers to survive on Mars. He also does a lot of math. And while botany powers aren’t likely to solve any genealogical brick walls, scientific thinking and math just might.

First, Some Basics

The probability table in my earlier post can tell me the likelihood that someone who shares a given amount of autosomal DNA with me is a 2nd cousin or 3rd cousin or whatever. But what if I have two DNA matches? Or three? Or more? Each match shares a different amount of DNA and could be related to me multiple ways. How do I figure out which scenario is most likely?

Here, we can turn to a basic rule of statistics: If you want to know the combined probability of two independent events (meaning, the chance that both events will happen), you simply multiply the probability of Thing 1 by the probability of Thing 2 (with apologies to Dr Seuss).

And if you want to know the odds of three, or four, or six, or more independent things happening, you multiply the probability of the first by that of the second by that of the third … and so on. The product is called a “compound probability”. (We’ll come back to what “independent” means later.)

In my next post, I’ll describe how we can apply these ideas to a simple genealogical problem. Subsequent posts will expand to more complex cases.

40 thoughts on “Science the Heck Out of Your DNA — Part 1”

Veronica Williams says:

January 2, 2018 at 4:20 pm

This sounds great! I think I have narrowed down the identity of my paternal great grandfather to one of 3 brothers. I have 40 matches feeding into this mostly at Ancestry, only 5 in GEDmatch. I used the Maguire method and all ICW info to help firm this up. I feel quite confident that it’s one the 3 brothers but I would love to know whether the numbers confirmed it!

Reply
1. thednageek says:
  
  January 3, 2018 at 11:00 am
  
  Forty matches … wow! I’d love to know whether the tool confirms your suspicions.
  
  Reply
  1. Veronica Williams says:
    
    January 6, 2018 at 10:13 pm
    
    Could me in for BETA testing if you need it!
    
    Reply
Douglas W Fisher says:

January 2, 2018 at 4:42 pm

Factoring in the variable that my birth parents were paternal 2nd cousins(shared the same paternal great-grandfather) does it make sense that I match a female cousin 997cMs on 38 segments whose maternal great-grandmother was my maternal grandmother, meaning we are 1st cousins 1x removed?

Best regards, Doug

Reply
1. thednageek says:
  
  January 3, 2018 at 10:57 am
  
  That’s beyond the upper range for a 1C1R. If the 2C connection between your birth parents was on your mother’s father’s side, it shouldn’t affect how much DNA you share with someone on your mother’s mother’s side. Is it possible that this match has another connection to you?
  
  Reply
  1. Douglas W Fisher says:
    
    December 8, 2019 at 3:01 pm
    
    Since I posted this, I have confirmed who my birth parents are, and I also had help from one of the dna geek gurus.
    
    Best regards, Doug
    
    Reply
Barbara Shoff says:

January 2, 2018 at 9:02 pm

And I love the new tool and have already been sharing it with the people I am working with. Fantastic job! Kudos to you, Dr. Millard and Jonny! I blogged about it the night I found it. Talk about keeping an old gal up with excitement. I can barely keep up and you have certainly helped with the improvement of the tools out there.

Reply
1. thednageek says:
  
  January 3, 2018 at 10:58 am
  
  Thank you for the kind words!
  
  Reply
  1. Robin Wiggin says:
    
    December 8, 2019 at 1:32 pm
    
    Been tring to find my dna it been at less 7
    
    Reply
Israel Pickholtz says:

January 2, 2018 at 10:14 pm

I suspect that this kind of analysis will not work wiyjin endogamous populations, where a total of say 200 cM of matching segments is likely to be distributed among several common ancestors if different directions.

Reply
1. Israel Pickholtz says:
  
  January 2, 2018 at 10:15 pm
  
  “wiyjin” = within
  
  Reply
  1. Andrew Millard says:
    
    January 3, 2018 at 5:51 am
    
    Israel, you are right. Where there is sharing that is due to relationships not in the tree the calculation can’t be correct, but in principle the calculations can be extended to any complex tree with pedigree collapse. For endogamous populations the table of probabilities is likely to be wrong, and we might have to develop a different table, for example using the data in Lara Diamond’s tables https://larasgenealogy.blogspot.co.uk/2017/09/endogamy-closer-look-part-3.html (and I’m sure you have a similar dataset!)
    
    Reply
cleaverkin says:

January 2, 2018 at 10:20 pm

We had a recent success identifying the biological father of one of two sisters (who we all thought were full sisters, but who turned out to be half). We were looking in the wrong part of the tree for awhile, because it turned out that some of the cousins were double 2nd cousins (one pair of grandparents were the siblings of the other pair).

Although the statistical groups include double first cousins, they don’t include double 2nd, etc., so I’m not sure what the expected cM match range would be (in our case, the actual was almost exactly double the average expected for regular 2nd cousins).

Any chance that future versions of the statistical tables might include double 2nd and 3rd cousins?

Reply
1. thednageek says:
  
  January 3, 2018 at 11:06 am
  
  I certainly hope so! Ideally, the community can develop probability tables for specific scenarios (like an isolated case of double 2C) as well as for specific endogamous populations, like Ashkenazi Jews. I’d like to see a combination of crowdsourced empirical data like from the Shared cM Project and simulated data.
  
  Reply
2. Andrew Millard says:
  
  January 4, 2018 at 10:26 am
  
  Double first cousins are a special case because they have a large proportion of the shared DNA in fully identical regions (FIR). Double second cousins have much less FIR and double 3rd cousins hardly any. It is a sufficient approximation to treat them as simply one group closer: second cousins are in Group E, so double second cousins are in Group D. For third cousins are in Group G, so double third cousins are in Group F. Tables like these will rarely suffice for complex multiple-cousinhood situations because there will rarely be enough data to relaibly estimate the range. That is where we have to turn to simulations.
  
  Reply
  1. cleaverkin says:
    
    January 4, 2018 at 1:08 pm
    
    Thanks, Andrew, that’s consistent with what I saw in the very few samples I had to work with. Good to know.
    
    Reply
Ann says:

January 3, 2018 at 10:26 am

Do you need a guinea pig. I have 22 cousin matches in a family that I’m sure is my birth father. This tool would be awesome!

Reply
1. thednageek says:
  
  January 3, 2018 at 10:47 am
  
  We will be doing beta testing of the online tool soon.
  
  Reply
Ann says:

January 3, 2018 at 10:49 am

Awesome!

Reply
Douglas W Fisher says:

January 3, 2018 at 12:03 pm

I only know that this female’s grandmother a younger sister of my birth mother and that her maternal great-grandmother was my maternal grandmother. If there is another close relation that I am not aware of, then I hope to be able to discover that; but have to wait for better communication with this family.

Reply
Pingback: This week’s crème de la crème — January 6, 2018 | Genealogy à la carte
Tyler Foster says:

January 6, 2018 at 3:56 pm

This is exactly what I need to narrow down who my grandfather is (my dad was adopted)! I look forward to the next blogs in this series!! Thank you!

Reply
Christopher Schuetz says:

January 6, 2018 at 5:41 pm

260cM!!!
This blog item’s advice is priceless for those with close adopted or unknown ancestors.
However, until recently, most discussion I saw was closer to my own case, where matches are 10-30cM. Out in 3rd – 5th cousin land. Unfortunately the overlap between possible options tends to make these statistics unhelpful out there. But I think it is where most people’s matches live, don’t they? Any advice for us? Plleeaase
If I see 260cM I reach for 1)my address book, because I probably know them but not by the alias against their DNA identity 2)my family tree 3)family books of descendants of various ancestors

Reply
1. thednageek says:
  
  January 18, 2018 at 2:37 pm
  
  Distant matches are best for ruling out relationships rather than ruling them in. Even a 0-cM match can be useful as it can rule out hypotheses that would put you as 2C or closer. When you only have distant matches, the best use of the probability approach is to guide your future testing. I’ll be talking about that in a future post (the one about Ruth).
  
  Reply
Rebecca Nielsen says:

January 10, 2018 at 12:18 pm

I’d love to help test this tool. I have a great grandfather whose parentage is unknown. DNA testing has helped me determine that his grandparents were couples named Crumpton and Emerson, but there are 11 children named Crumpton and 5 named Emerson who are, in theory, possible parents for said great-grandfather. (Realistically, I have narrowed it down to six sons named Crumpton and two daughters named Emerson.) I have 19 DNA matches for whom I know the ancestry, including 4 who descend from BOTH families (which confuses things wonderfully, with a doubled relationship to consider).

I don’t hold much hope of figuring out which of the two daughters was the mother, as I don’t have any DNA matches descending from either, only from their siblings. But many of the matches I have seem unusually high for the relationships they should represent. (I have two 3C1R matches at 157 and 113 cM, respectively. They are 2nd great grandchildren of a Crumpton daughter who is not a potential parent, as she was married and having legitimate children through the whole period when my great-grandfather was born.)

It’s those high matches that make me hesitant about the one pair of matches I have that may hold the key. There is a man (R.C.) who matches my mother at 294 cM, and his nephew matches her at 197 cM. If R.C.’s grandfather Jonathan Crumpton (one of the six Crumpton sons) is my great-grandfather’s parent, then my mother and R.C. would be half 1C1R (p=0.57), and the nephew would by my mother’s half 2C (p=0.46). If Jonathan Crumpton was not the parent, then R.C. would be a 2C1R (p=0.10), and the nephew would be a 3C (p=0.05). Both are possible, but the former seems more likely based on your tables. Yet there are so many other high matches, it’s hard to be sure.

Reply
1. thednageek says:
  
  January 10, 2018 at 1:44 pm
  
  Sounds like an interesting case! The probability tables aren’t designed for multiple relationships, unfortunately. If the match is a double cousin at the same level (e.g., 3C twice over), you can just bump them up one category (e.g., double 3C becomes 2C1R), but if there are multiple relationships at different levels (e.g., 3C + 3C1R) the table doesn’t really apply. That’s an area that would benefit from simulation programs that would let us generate probability distributions for specific cases.
  
  Reply
  1. Rebecca says:
    
    January 10, 2018 at 2:57 pm
    
    I have real trouble with those double relationships. Marion Crumpton married Lucy Emerson. I know my great grandfather isn’t one of their legitimate offspring, but Marion is a potential as the father, and Lucy would be a sibling of whichever girl was the mother. So my great grandfather’s relationship to their children would be either 3/4 sibling, or double first cousin. Neither relationship is really covered in the charts, but I have a feeling that the difference in DNA would be insignificant.
    
    Reply
Pingback: Science the Heck Out of Your DNA — Part 3 – The DNA Geek
Pingback: Science the Heck Out of Your DNA — Part 4 – The DNA Geek
Pingback: Science the Heck Out of Your DNA — Part 2 – The DNA Geek
Pingback: Science the Heck Out of Your DNA — Part 6 – The DNA Geek
Pingback: Science the Heck Out of Your DNA — Part 5 – The DNA Geek
Pingback: Science the Heck Out of Your DNA — Part 7 – The DNA Geek
Pam Wolf says:

December 4, 2018 at 9:33 am

How can a non identified person (abandoned/ foundling) use their results from several DNA companies to locate their immediate family of origin I e. mother, father, brothers, sisters?

Reply
1. thednageek says:
  
  December 4, 2018 at 12:12 pm
  
  First, you need to find the connections between the top DNA matches (that is, how they are related to one another). The goal is to identify your possible ancestors. To use a simple example, if your top two matches are 2nd cousins to you and you find that they are descended from the same great-grandparent couple, that couple is probably your great grandparents, too. Then, you trace the descendants of that couple to see where you could possibly fit in. Usually, you’ll have to ask additional people to test for you to answer the question definitively. If you need help, I offer both consultations and research services.
  
  Reply
Pingback: Do You Have a DNA Outlier? – The DNA Geek
Pingback: DNA & A Question of Paternity: Part 6 – Statistically testing hypothetical relationships – Genes & History
Pingback: Eine Anleitung für das WATO-Tool – Genetic Genealogy Girl
Pingback: Improving the Odds – The DNA Geek
Pingback: A Major Update to “What Are the Odds?” – The DNA Geek