Science the Heck Out of Your DNA — Part 4

Scroll down for links to other posts in this series.

I presented a talk on this method at the i4GG conference
in December 2017. The video is available for purchase here,
either individually or as part of the all-conference package

GIs in Germany: Which Brother?

In late 1919 or early 1920, an American GI had an encounter with an adolescent girl from a rural village in Germany. Nine months later, a son was born. The young woman grew into an old one and passed away without ever telling her son, E, or her grandchildren who the GI was.

 

Enter DNA

J, a grandchild of this liaison, turned to DNA testing to identify his unknown grandfather. By the time I consulted on the this case, his grandfather had been narrowed down to one of two American brothers, LD and PD. At the time of conception, PD was 19 and stationed in the same village as the girl, while 27-year-old LD’s unit was only 15 km away. Proximity would favor PD as the father, but LD could easily have visited the village to see his kid brother and met the young lady there.

The autosomal data pointing to the brothers consisted of five DNA matches to members of the same American family. For a scientific approach, we express the two possibilities as hypotheses. The two hypotheses are labeled in red in the McGuire chart below: Hypothesis 1 (H1) is that PD was J’s grandfather and Hypothesis 2 (H2) is that LD was.

 

At first glance, the DNA results are problematic, because the match to CD2 (493 cM) suggests a half 1st cousin relationship, which would support H1, while the match to GD (855 cM) fits a half uncle relationship, supporting H2. The match to DH (318 cM) is right in the middle of the expected values for a half 1st cousin (H2) and a 2nd cousin (H1), so that’s not much help either.

So which brother was it?

Tobias Kemper, a German genealogist, asked for feedback on this this case in the Facebook group Genetic Genealogy Tips & Techniques. It was a perfect opportunity for the probability approach. For each hypothesis, I determined the relationships between J and his DNA matches, manually looked up each probability, then multiplied them to get the compound probability. (Note that the relationships to CF and LE don’t change under either hypothesis.)

 

A few observations: First, all those lookups are exceedingly boring, and each one has the potential for a transcription error that would mess with downstream calculations. Second, the compound probabilities get really small—and hard to read—really fast. Third, 0.00027 vs 0.00042 doesn’t mean much to the average person; we need a metric that’s easier to understand.

 

Jonny Perl’s Probability Test Tool

The solution to all three problems can be found in another fabulous resource from Jonny Perl: the Probability Test Tool, which converts the lookup table and calculations we’ve already been using into a handy online calculator.  It looks something like this when you first open it:

 

For J’s case, we click “Add new relationship hypothesis” to incorporate Hypothesis 2 and then click “Add new match” five times, once for each DNA match. Then, we fill in the comparisons, cM values, and relationships from the table above. The relationship fields have a handy pull-down menu that lets you select the correct designation.

 

When we click “Calculate Probabilities”, each individual probability will appear below the respective relationship.

 

Even better, the tool will calculate the compound probabilities, rank them from most to least likely, and also compute the odds of each one. This information is presented above the data input table. Odds are far more intuitive than compound probabilities; the larger the odds, the more likely the hypothesis is.

Here’s what that output looks like for the two GIs in Germany. (You can ignore those massive decimals in parentheses; focus on the odds.)

 

We see that the odds are about 1.5 for H2 (that LD was J’s grandfather) and 1.0 for H1 (that PD was J’s grandfather).

 

O Brother, Which Art Thou?

The difference in odds between H1 (1.0) and H2 (1.5) isn’t nearly enough to draw a conclusion about which brother was J’s grandfather. Ideally, we’d like one hypothesis to have odds at least 10 or 20 times greater than the other hypotheses.

We needed more data.

Fortunately, J’s sister C was willing to test. The McGuire chart below shows C’s cM shares to CD2, DH, GD, and LE. (She and CF tested at different sites and could not be compared to one another. This doesn’t affect our hypothesis testing at all, because CF is a 2C1R to C either way and can’t help distinguish between H1 and H2.)

 

Next, we add these four additional comparisons to ones we already had in Jonny Perl’s probability tool.

 

What does this tell us?  Let’s look.

 

Well! Well! Well! That certainly changes things. Not only is H1 now favored instead of H2, but H1 is favored strongly: 124-to-1 odds. Recall that we want one hypothesis to have odds at least 10–20 greater than the other. We’re well beyond that limit!

We can conclude that PD was J’s grandfather. This agrees with the circumstantial evidence that he was stationed in the young lady’s village and that his age was closer to hers.

 

More Data Is Better Data

If the motto of this story is Probabilities Rule!, the moral is that more data is better. Recall that when we first ran the numbers with just J’s data, the LD hypothesis (H2) was supported, but only weakly. Because the odds were only 1.5-to-1.0, we couldn’t draw a firm conclusion. However, when we added J’s sister C, not only was the other hypothesis favored (H1, that PD was the grandfather), the odds strongly supported it. This is a very important point!  Weak support for an hypothesis can be used to guide future testing, but it shouldn’t be used to draw conclusions.

 

Just a few days ago, Tobias messaged me with new information. After we’d identified PD as J and C’s grandfather, DH’s brother decided to test. BH shares 335 cM with J and 222 cM with C. These two new comparisons factored into the overall calculations shifted the odds from 124-to-1 to 791-to-1 in favor of the PD hypothesis .

Even better!

 

Other posts in this series can be found here:

34 thoughts on “Science the Heck Out of Your DNA — Part 4”

  1. Thank you for this great post, Leah. This is the information that I was looking for when I contacted you because I did not do a great job of note taking at your presentation at the i4GG conference.

  2. An excellent working example here – thank you!! Exactly my situation too (looking for my grandfather).

    I don’t have any brothers or sisters who could test (like in your example, where J’s sister C tested) but rather have my father and myself tested. Could I enter matches and cMs for matches common to both him and to me to increase my data, or is a match to the parent and to the child not independent probabilities that can be multiplied together?

    1. You should use only your father in this case. Because all of the DNA that you inherited from your grandfather came through your father (I assume you’re looking for your paternal grandfather), you’re not independent of your father. These statistical tests assume independence.

      1. Thank you! That makes total sense.

        One more question, and I promise no more!
        My father’s strongest match is doubly related; they are likely 2C (through his match’s paternal great-grand) *and* 4C (through his match’s maternal 3rd great). Their total shared cM is 340 cM.

        Now, my father has another different 4C match through the same 3rd-great grandparent, and that is 89cM. Together the three form a matching triangle, and I assume that the 89cM was passed on to all three from this 3rd great grandparent couple.

        Question: should I enter *both* 2C and 4C matches in the probability calculator separately as 340-89=251 cM (2C) *and* 89 cM (4C)?
        or
        should I just enter the total 340cM and a 2C relationship?

        1. There’s no good answer to that. The probability distributions on which this method is based assume that each match is related only one way. I would go with just 340/2C, but I might demand a higher odds ratio before accepting one hypothesis over the other. There’s no road map for this, unfortunately.

  3. I plugged all the matches in for my hypotheses on my great-grandfather’s parentage, and the odds ratio was 12.8 to 1 (relative probabilities of 0.9277 and 0.0722). Not quite “statistically significant” (p<0.05) but close enough that I'm getting fairly confident that I know who my great-great-grandfather was.

  4. Hi Leah, I love this post and the information you provided! I really want to try this for my bio-grandfather since my Dad was adopted. Similar to your example I had it narrowed down to 2 brothers, I hope I have enough close testers to generate strong odds for one of the brothers. I’ve also been working with an adoptee on my maternal side who seems to be the granddaughter of one of my great-uncles (2 brothers again!). I’ll try these techniques for her as well.

    Great stuff, thanks for all that you do!

      1. Re: My bio-grandfather, I have mine and my sister’s test to compare to a prepective Half 1C from both prepective grandfathers. If that sample size is large enough to trust, the tool gave a 1 to 0 ratio to one grandfather over the other. Based on cM’s, that’s completely inline with what the DNA relationship charts predict.

        For the other case of my g-uncles granddaughter (my “new” cousin 🙂 ). G-uncle2 died at age 30 with no know children, so I don’t have any known data points for that hypothesis. My new cousin could be the 1st known descendant from his line which would be great to know!

  5. I love it !!!
    Gee, I thought I was weird looking at first and last name probabilities from namestatistics com – this is even better.

  6. Hi Leah

    Thanks so much for sharing this. It is immediately helpful on a number of cases I am working with. I have a couple of questions, if I may.

    First is there a practical minimum shared cM below which we should not include a match? Barabara Rae-Venter whom I am working with, seems to recall 90cM. Sadly this is rather high for a number of my cases. I realize that the range of estimated relationships will widen as we go lower.

    Second, how would you deal with a single hypothesis? The process seems to rely on comparison. Should we pull in a couple of others even if they are not likely? As a way of dealing with a single hypothesis I have used an average of probabilities rather than the compound. I am not a my good at math so this may be daft.

    Many thanks for your great work

    1. You’re welcome! I hope the method leads to many, many reunions.

      First question: There’s no lower limit. In fact, small cM shares convey information, same as large ones. For example, someone who doesn’t share DNA with your searcher cannot possibly be a full 2C or closer. That is often valuable to know.

      Second question: With a single hypothesis, all the numbers can do is rule it out (e.g., if a proposed cousins has a probability of zero given the DNA share amount). Technically, though, I think you do have two hypotheses: (H1) Searcher is related to this family in this way and (H2) Searcher is not related to this family. For H2, I’d plug in all the relationships as 5C and see what you get.

  7. Leah
    This is great stuff. Thank you. I am working with someone who has a bizarre match that might be throwing off our calculations. She has a 182 cM match to one match and 54 to his 1C. He and his 1C share ~850 cM, and the tree shows they cannot be half uncle/nephew. This searcher is UK-based and has very few matches, but we do have 5 — the lowest is 22 cM which I see you say is workable. Is that very skewed match between the two cousins throwing off the calculations?
    Thanks
    Robin

    1. The “skew” in this case can be helpful, as long as you’re certain that the two matches are 1C to one another. The 182-cM match alone could be anything from a 1C1R to a 3C, while the 54-cM match alone could be anything from 2C1R on down. The two together, however, can only be 2C1R or 3C (or equivalent).

      1. Using both these matches seems to skew the results quite a bit. Should I still use them both? I get 0.505083 probability for the 182 cM match being a 2C1R and 0.089687 for the 54 cM match (his confirmed 1C).

        1. Yes, use both. The only reason to throw out data is when you can confirm that the relationships aren’t what you thought they were (e.g., if you later found that they weren’t 1C to one another).

        2. Leah

          I’m still lost. How can you determine what range they would have to be in, when you say “The two together, however, can only be 2C1R or 3C (or equivalent).” I tried to average them and that doesn’t work. I tried to extrapolate data and just got confused.

          Thanks

        3. If the two matches are 1C to one another, they should have the same relationship with your target. If you only look at the one who shares 182 cM with your target, the relationships could be anywhere from 1C1R to 3C. If you only look at the one who shares 54 cM with your target, the relationships could be 2C1R to distant. The intersection of those two sets limits you to the 2C1R or 3C categories.

  8. This? Is freaking awesome!

    It’s amazing how you can take raw data and extract clarity from it just by applying some simple math.

  9. I am blown away. I am helping a woman identify her birth grandfather. She has six good matches (207 cm to 553 cm) with the descendants of a man born in 1862. That man had 9 sons, one of whom (1885) lived next door to her grandmother when her grandmother was growing up. Her grandmother was born in 1904, her father in 1923.

    She was convinced that 1885 was her grandfather, especially since she shares the most cm (553) with his daughter. But the daughter is the only one in her generation who tested, the other five matches are all one generation more removed from the shared ancestor, so the daughter *would* share more cm.

    I tested out four hypotheses just scratching out the family tree by hand and placing the probabilities for each relationship. By using the dnapainter.com tool for shared matches, it seemed as though one scenario was much more likely than the others. All but one of the relationships in that hypothesis were in the highest probability group, the remaining relationship was in the second highest probability group (and not out of first place by much). The other three scenarios all had problems with at least one, sometimes more than one being very low probability (2-4%).

    But using the Perl tool I can see how *much* more likely and it’s crazy! Scenario 1, which I thought was the overwhelming favorite, comes up at 50,045. Scenario 3 which I thought was possible but much less likely, comes up at 196. Scenarios 2 and 4 come up at 1 and 12 respectively.

    I think we have cleared the 10-20x bar with a bit to spare!

    It seems that 1862 is the winner, and his son 1885 has been unfairly maligned (yes, she told his granddaughter that he had had an affair with her 18 year old grandmother).

    Can I stop there, or should I mirror 1862’s wife to kick the tires a bit? That would be rather tough as we get into Irish immigrants in the 1800’s and difficulty tracing their trees very far back, plus her matches generally do not have trees that go back very far in Ireland either.

    Another possibility is chromosome mapping via GEDmatch – I’m very weak at that but hey, learning opportunity!

  10. Is Jonny Perl’s Probability Test Tool supposed to be used only for Ancestry provided shared DNA amounts? The reason why I am asking, is that the probabilities were extracted from the AncestryDNA Matching White Paper. Would you only use the Probability Test Tool only for Ancestry data? Do we have probabilities from any of the other testing companies like Family Tree DNA, 23andMe, MyHeritage, etc?

    1. That’s a great question! I use data for all of the companies in the tool, with the proviso that I omit segments smaller than 7 cM at FTDNA. None of the other companies has provided probability data in this form.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.