Science the Heck Out of Your DNA — Part 2

Scroll down for links to other posts in this series.

I presented a talk on this method at the i4GG conference
in December 2017. The video is available for purchase here,
either individually or as part of the all-conference package

Testing Hypotheses

In this second post of this series, I am going to use a completely fabricated example to show how we can use hypotheses to address genealogical questions. An hypothesis is just a possible explanation for something. It doesn’t need to be correct, but it does need to be testable, meaning that if it’s wrong, there must be a way to prove that it’s wrong.

 

In genealogy, our hypotheses are about relationships. For example, a traditional genealogist could form an hypothesis about who’s on the other side of a brick wall, like I did for my great grandfather’s parents. Or an adoptee might hypothesize who their birth parent is.

Note that “Who is my birth father?” is a perfectly reasonable question, but it’s not an hypothesis, because it can’t be disproven. To form an hypothesis from the question, the adoptee would need to have a specific person or family in mind, like: “My birth father is Marcus Burrell”, or “My birth mother was descended from Simon Washington and Marie Pritchard”.

 

What Doesn’t Kill You Makes You Stronger

If hypotheses had a motto, this would be it. Hypotheses are meant to be challenged.  By a “challenge”, I meant an active attempt to prove that it’s wrong. An hypothesis that survives challenges is likely to be correct, and the more challenges it survives, the more confidence we can have in the hypothesis.

Sometimes, we have competing hypotheses, meaning multiple possible explanations. Only one can be true, but we don’t know which one. Consider the fictional Joan, whose birth father is unknown. And suppose that we’ve already narrowed her father down to one of two brothers, Frank and Harry, with no other possibilities.  We could show the two hypotheses like this:

 

In the lingo, we would say that Hypothesis 1 (or H1) is that Frank is Joan’s father and Harry is her uncle, and Hypothesis 2 (H2) is that Harry is her father and Frank is her uncle.

 

Is Joan’s father Frank or Harry? (a Fictional Example)

We could test Joan’s hypotheses by comparing her autosomal DNA results to those of either Frank or Harry. A biological parent shares about 3400 cM of DNA with a child, while an uncle (or aunt) shares between 1100 and 2350 cM. If the man we test shares 3400 cM with Joan, then he’s her father and the other is her uncle, but if he shares, say, 1900 cM, he’s her uncle and his brother is her father. (Recall that in this simple scenario, there are only two possibilities for Joan’s father.)

Things get more complicated if Frank and Harry have passed away or are not willing to help. In that case, we’d turn to one of their children or grandchildren. I say “complicated” because we might be able to prove that one of the hypotheses is wrong, but we also might get a result that’s not definitive. Consider the case in which we test Frank’s son. He would either be a half brother to Joan (if H1 is true) or a first cousin (if H2 is true). The range of shared DNA from Blaine Bettinger’s Shared cM Project for a half sibling is about 1100–2350 cM, and for a first cousin it’s about 440–1500 cM. The problem is that between 1100 and 1500 cM, either relationship would fit the data.

There is even more overlap in the possibilities if we had to test the grandchildren of Frank and Harry. This diagram visualizes the problem. There are some centimorgan ranges that give you a definitive answer and some that don’t.

 

Let’s think about what happens in those ambiguous areas. Although 1500 cM is technically in the overlap zone between half sibling and first cousin, it’s near the middle of the range for half siblings and at the very end of the range for first cousins, so that match could be either relationship but is more likely to be a half sibling.

How much more likely? This is where probabilities come into play. At 1500 cM, there’s a 99% chance that the match is in the half-sibling range and only a 1% chance that they’re in the first-cousin range. (To see how I’m getting the numbers, read this post.) Assuming no other facts ruled out the half-sibling relationship—like, if the probable birth father was overseas at the time of conception—I would feel pretty confident concluding that the first-cousin hypothesis was wrong and that the half-sibling one was correct.

 

Not So Fast!

You didn’t think I was gonna let you off that easily, right? Let’s consider the worst case scenario: We test Frank’s child, and the shared amount is 1300 cM. Now look at this graph, which I modified from Figure 5.2 in the AncestryDNA Matching White Paper.

 

The dark green line shows how likely a given centimorgan amount is to be in the half-sibling range (the higher the line, the greater the chance), and the red line shows how likely a given centimorgan amount is to be in the first-cousin range. The bracketed range shows where the amounts for half sibling and first cousin overlap. The two lines cross at 1300 cM (circled in black), meaning that each relationship is equally likely. A match of 1300 cM wouldn’t help us at all in determining whether Frank or Harry was the father of Joan.

While we’re looking at this graph, consider what’s happening at 100 cM. Five different relationship lines span that point, meaning that a match who shares 100 cM could be in any of five relationship categories, ranging from the 2nd cousin group (yellow line) to distant cousins (purple line). This is an important point: the lower the amount of shared DNA, the more possible relationships could be represented and the less we can conclude based on that one DNA match alone.

 

Now Back to Joan. Remember Joan?

Let’s pretend that Frank’s son shares 1350 cM with Joan. That’s an unfortunate amount, as there’s a 74% that he’s a half brother to Joan and a 26% chance that he’s a first cousin. In other words, it’s only about three times more likely that Frank is her father than Harry.

Imagine that we also test Harry’s son, and he shares 1200 cM with Joan.  That’s also in an overlap zone, with a 16% chance of being a half brother and an 84% chance of being a first cousin. Those individual odds would not satisfy me on an issue as important as parentage.

 

 

I intentionally created an ambiguous situation to demonstrate how we can combine the individual probabilities to get a clearer picture. Recall from the first post in this series that we multiply the probabilities of independent events when we want to know the odds of both (or more) of them happening. The result is a compound probability.

If H1 (that Frank is Joan’s father) is true, then Frank’s son is her brother (74% chance, or 0.74 probability) and Harry’s son is her first cousin (84%, or 0.84). The compound probability is 0.74 × 0.84 = 0.62.

On the other hand, if H2 (that Harry is Joan’s father) is true, then Frank’s son is her first cousin (0.26 probability) and Harry’s son is her half brother (0.16). The compound probability is 0.26 × 0.16 = 0.04.

Put another way, when I consider both DNA matches together, H1 is 15 times more likely than H2, which is a much more satisfying conclusion than we could draw from either relative alone.

The beauty of this approach is that if we need more confidence in a conclusion, we can test more people (assuming we can find good candidates who are willing). As long as the DNA matches are independent of one another, we can simply look up the individual probability that each centimorgan amount fits the suspected relationship and then multiply that value by the others to get a compound probability for that hypothesis.

Independence is a key factor here. Because a child cannot share more DNA with Joan than their parent does (barring multiple lines of relationship), the child is not independent of the parent. That means that, from a mathematical standpoint, we couldn’t factor in both Frank’s son who shares 1350 cM and also a child of that son.  We could, however, include grandchildren of Frank through other of his children (nieces and nephews of Mr. 1350cM).

To simplify the rule of independence for this approach: If you match both a parent and their child (or children), ignore the data for the child(ren) and use only the match to the parent.

 

In the next post in the series, I’ll describe a fabulous tool for looking up the probabilities.

 

Other posts in this series can be found here:

Please follow and like us:

16 thoughts on “Science the Heck Out of Your DNA — Part 2”

  1. I do not see where you get 74% being the half sibling of Frank’s child. Similarly, I do not get the 26% being 1C of Frank’s child.

  2. > We could, however, include grandchildren of Frank through other of his children (nieces and nephews of Mr. 1350cM).

    I am not sure if I correctly understand this statement. Are you saying that the amount of DNA shared between Joan and Mr. 1350cM is independent of the amount of DNA shared between Joan and someone who is both a grandchild of Frank as well as a niece or nephew of Mr. 1350cM?

    More generally are you saying the following? Let Joan share X cMs of DNA with person A and Y cM of DNA with person B. Then a sufficient condition for X and Y to be independent is that A is neither a decedent nor an ancestor of B?

    1. For our purposes, yes, two matches can be considered independent as long as one is not a direct descendant of the other. (Strictly speaking, it’s more complicated than that. We’re working on some math to account for the partial dependence among, say, a set of siblings who are matches to the target person.)

      1. Ok, great. It is encouraging to hear that my understanding of independence for this situation is correct. I would love to discuss this with you in more detail. I wanted to do so by contacting you through your contact page (https://thednageek.com/contact/), but the contact form on it is broken.

  3. OK, I have a problem with the math in the case above. I think the problem is that H1 and H2 are not independent events (like the toss of a coin). They are mutually exclusive.

    The sum of the probabilities of all possible outcomes should always be 100%. In this case the probability that H1 is correct and H2 is incorrect is 62%. The probabiity that H2 is correct and H1 is incorrect is 4%. The sum is only 66%. To get to 100% you have to add in the probability that both are correct (.74 x .16) = 12% and the probability that both are incorrect (.26 x .84)= 22%.

    But in fact H1 and H2 cannot both be incorrect, or both be correct. So you really only have two possible outcomes, totalling 66%, of which the probability that H1 is correct is 62/66 or 94% and the probability that H2 is correct is 4/66 or 6%.

    Yes, that is 15 times more likely as you said. But calling it a *probability* of 62% versus a probability of 4% is somehow not sitting right with me.

    1. That’s a great question, and I’m not enough of a statistician to answer it well. The fact that only one hypothesis can be correct is intrinsic to the biology, not the math. It’s a bit like saying “I’m going to flip two coins, but I’m only interested in the probabilities of flipping all heads or all tails.” You’ll still get 25% heads, 25% tails, and 50% mixed, but you’re only looking at the two 25% probabilities.

      1. Here is the analogy with (fair) coin flips that I would give. First, for notation, let P(H) be the probability of hypothesis H. Suppose hypothesis H1 is “the coin flip is heads” and hypothesis H2 is also “the coin flip is heads”. Then the P(H1) = P(H2) = 0.5 = 50%. Now consider the combined hypotheses (C1) H1 is false and H2 is false, (C2) H1 is false and H2 is true, (C3) H1 is true and H2 is false, and (C4) H1 is true and H2 is true. We need to consider both logic and probability theory in order to correctly compute the probabilities of the combined hypotheses. If we only considered probability theory and ignored logic, then P(C1) = P(C2) = P(C3) = P(C4) = 0.5 * 0.5 = 0.25 = 25%. However, from logic, we know what C2 and C3 are impossible, so in face P(C2) = P(C3) = 0 = 0%. That doesn’t mean that our calculations that P(C1) = P(C4) = 0.25 are incorrect or wasted. Now I am much more comfortable speaking about discrete probability theory, which is the case here. These numbers, which you correctly pointed out are no longer probabilities, are called weights. We can recover probabilities by normalizing. For a given weight w, the corresponding probability is w / sum_i w_i. In words, the probability is w divided by the sum all the weights involved. In our case, P(C1) = P(C4) = 0.25 / (0.25 = 0.25) = 0.5 = 50%. Another way to compare to probabilities is via odds, which is essentially to divide one probability by the other. Notice that if we started with weights w_1 and w_2, then we need not first normalize to obtain probabilities. Directly dividing the weights to get w_1 / w_2 is the same as first obtaining probabilities w_1 / (w_1 + w_2) and w_2 / (w_1 + w_2) and then dividing to get (w_1 / (w_1 + w_2)) / (w_2 / (w_1 + w_2)) = (w_1 / (w_1 + w_2)) / ((w_1 + w_2) / w_2 ) = w_1 / w_2.

        And that is with thednageek did in her post. She directly when from the weights 0.62 and 0.04 to obtain the odds 15 to 1.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.