Scroll down for links to other posts in this series.
I presented a talk on this method at the i4GG conference
in December 2017. The video is available for purchase here,
either individually or as part of the all-conference package.
In this second post of this series, I am going to use a completely fabricated example to show how we can use hypotheses to address genealogical questions. An hypothesis is just a possible explanation for something. It doesn’t need to be correct, but it does need to be testable, meaning that if it’s wrong, there must be a way to prove that it’s wrong.
In genealogy, our hypotheses are about relationships. For example, a traditional genealogist could form an hypothesis about who’s on the other side of a brick wall, like I did for my great grandfather’s parents. Or an adoptee might hypothesize who their birth parent is.
Note that “Who is my birth father?” is a perfectly reasonable question, but it’s not an hypothesis, because it can’t be disproven. To form an hypothesis from the question, the adoptee would need to have a specific person or family in mind, like: “My birth father is Marcus Burrell”, or “My birth mother was descended from Simon Washington and Marie Pritchard”.
What Doesn’t Kill You Makes You Stronger
If hypotheses had a motto, this would be it. Hypotheses are meant to be challenged. By a “challenge”, I meant an active attempt to prove that it’s wrong. An hypothesis that survives challenges is likely to be correct, and the more challenges it survives, the more confidence we can have in the hypothesis.
Sometimes, we have competing hypotheses, meaning multiple possible explanations. Only one can be true, but we don’t know which one. Consider the fictional Joan, whose birth father is unknown. And suppose that we’ve already narrowed her father down to one of two brothers, Frank and Harry, with no other possibilities. We could show the two hypotheses like this:
In the lingo, we would say that Hypothesis 1 (or H1) is that Frank is Joan’s father and Harry is her uncle, and Hypothesis 2 (H2) is that Harry is her father and Frank is her uncle.
Is Joan’s father Frank or Harry? (a Fictional Example)
We could test Joan’s hypotheses by comparing her autosomal DNA results to those of either Frank or Harry. A biological parent shares about 3400 cM of DNA with a child, while an uncle (or aunt) shares between 1100 and 2350 cM. If the man we test shares 3400 cM with Joan, then he’s her father and the other is her uncle, but if he shares, say, 1900 cM, he’s her uncle and his brother is her father. (Recall that in this simple scenario, there are only two possibilities for Joan’s father.)
Things get more complicated if Frank and Harry have passed away or are not willing to help. In that case, we’d turn to one of their children or grandchildren. I say “complicated” because we might be able to prove that one of the hypotheses is wrong, but we also might get a result that’s not definitive. Consider the case in which we test Frank’s son. He would either be a half brother to Joan (if H1 is true) or a first cousin (if H2 is true). The range of shared DNA from Blaine Bettinger’s Shared cM Project for a half sibling is about 1100–2350 cM, and for a first cousin it’s about 440–1500 cM. The problem is that between 1100 and 1500 cM, either relationship would fit the data.
There is even more overlap in the possibilities if we had to test the grandchildren of Frank and Harry. This diagram visualizes the problem. There are some centimorgan ranges that give you a definitive answer and some that don’t.
Let’s think about what happens in those ambiguous areas. Although 1500 cM is technically in the overlap zone between half sibling and first cousin, it’s near the middle of the range for half siblings and at the very end of the range for first cousins, so that match could be either relationship but is more likely to be a half sibling.
How much more likely? This is where probabilities come into play. At 1500 cM, there’s a 99% chance that the match is in the half-sibling range and only a 1% chance that they’re in the first-cousin range. (To see how I’m getting the numbers, read this post.) Assuming no other facts ruled out the half-sibling relationship—like, if the probable birth father was overseas at the time of conception—I would feel pretty confident concluding that the first-cousin hypothesis was wrong and that the half-sibling one was correct.
Not So Fast!
You didn’t think I was gonna let you off that easily, right? Let’s consider the worst case scenario: We test Frank’s child, and the shared amount is 1300 cM. Now look at this graph, which I modified from Figure 5.2 in the AncestryDNA Matching White Paper.
The dark green line shows how likely a given centimorgan amount is to be in the half-sibling range (the higher the line, the greater the chance), and the red line shows how likely a given centimorgan amount is to be in the first-cousin range. The bracketed range shows where the amounts for half sibling and first cousin overlap. The two lines cross at 1300 cM (circled in black), meaning that each relationship is equally likely. A match of 1300 cM wouldn’t help us at all in determining whether Frank or Harry was the father of Joan.
While we’re looking at this graph, consider what’s happening at 100 cM. Five different relationship lines span that point, meaning that a match who shares 100 cM could be in any of five relationship categories, ranging from the 2nd cousin group (yellow line) to distant cousins (purple line). This is an important point: the lower the amount of shared DNA, the more possible relationships could be represented and the less we can conclude based on that one DNA match alone.
Now Back to Joan. Remember Joan?
Let’s pretend that Frank’s son shares 1350 cM with Joan. That’s an unfortunate amount, as there’s a 74% that he’s a half brother to Joan and a 26% chance that he’s a first cousin. In other words, it’s only about three times more likely that Frank is her father than Harry.
Imagine that we also test Harry’s son, and he shares 1200 cM with Joan. That’s also in an overlap zone, with a 16% chance of being a half brother and an 84% chance of being a first cousin. Those individual odds would not satisfy me on an issue as important as parentage.
I intentionally created an ambiguous situation to demonstrate how we can combine the individual probabilities to get a clearer picture. Recall from the first post in this series that we multiply the probabilities of independent events when we want to know the odds of both (or more) of them happening. The result is a compound probability.
If H1 (that Frank is Joan’s father) is true, then Frank’s son is her brother (74% chance, or 0.74 probability) and Harry’s son is her first cousin (84%, or 0.84). The compound probability is 0.74 × 0.84 = 0.62.
On the other hand, if H2 (that Harry is Joan’s father) is true, then Frank’s son is her first cousin (0.26 probability) and Harry’s son is her half brother (0.16). The compound probability is 0.26 × 0.16 = 0.04.
Put another way, when I consider both DNA matches together, H1 is 15 times more likely than H2, which is a much more satisfying conclusion than we could draw from either relative alone.
The beauty of this approach is that if we need more confidence in a conclusion, we can test more people (assuming we can find good candidates who are willing). As long as the DNA matches are independent of one another, we can simply look up the individual probability that each centimorgan amount fits the suspected relationship and then multiply that value by the others to get a compound probability for that hypothesis.
Independence is a key factor here. Because a child cannot share more DNA with Joan than their parent does (barring multiple lines of relationship), the child is not independent of the parent. That means that, from a mathematical standpoint, we couldn’t factor in both Frank’s son who shares 1350 cM and also a child of that son. We could, however, include grandchildren of Frank through other of his children (nieces and nephews of Mr. 1350cM).
To simplify the rule of independence for this approach: If you match both a parent and their child (or children), ignore the data for the child(ren) and use only the match to the parent.
In the next post in the series, I’ll describe a fabulous tool for looking up the probabilities.
Other posts in this series can be found here:
- Part 1 — Basic Probability
- Part 2 — Testing Hypotheses (you’re here!)
- Part 3 — DNA Painter Look-up Tool
- Part 4 — GIs in Germany: Which Brother?
- Part 5 — Ruth: Using Probability to Guide Targeted Testing
- Part 6 — Ted, or When Close Targets Aren’t Available
- Part 7 (not yet published)