Science the Heck Out of Your DNA — Part 3

Scroll down for links to other posts in this series.

I presented a talk on this method at the i4GG conference
in December 2017. The video is available for purchase here,
either individually or as part of the all-conference package

The DNA Painter Lookup Tool

We are learning how to use probabilities to figure out genealogical relationships. The first post in this series explained how to combine independent probabilities (just multiply them), and the second described hypotheses and how we can test them numerically. But where do we get the probabilities in the first place?

The probabilities used in this series (and in the tools I’ll be describing) come originally from simulations done by AncestryDNA scientists. Specifically, they come from Figure 5.2 in AncestryDNA’s Matching White Paper. I’ve modified the labels in this version to list some of the relationships included in each color-coded group.

 

 

I used an online plot digitizer to extract numeric values for the lines on the graph then converted the data into a table. (The whole gory process is described here).

 

I’ve used this information—with great success—almost daily over the past year in my own genealogical work and in the research I do for clients. I will forgive you, however, if your eyes glaze over at the very sight. It’s boring. It’s too much information. It’s easy to cross rows. And it doesn’t help below 40 cM at all.

Enter Jonny Perl, who is rapidly becoming a knight in shining armor for the genealogical community by creating fabulous online tools to rescue us from our distress.

 

He’s converted the table above into a more visual interface that not only tells you relationship probabilities but also highlights where your DNA match could theoretically fit into your family tree. Dr. Andrew Millard made some refinements to get more precise values between the points in the table. And I extrapolated the numbers below 40 cM.

Go ahead: try it!

 

Let’s do a simple example together. In the “Filter” field at the top of the window, enter 140.

 

This table will appear, ranking the possible relationship groups from most to least likely:

 

And below that, you’ll see a stylized family pedigree with the possible relationships highlighted. (The dimmed relationships are ruled out for a match who shares 140 cM with you.)

 

The percentages from the tool are what we need for the probability calculations described in Part 1 and Part 2 of this series. Remember to convert the percentages to probabilities first by dividing by 100. For example, if your first hypothesis places this 140-cM match as a 2nd cousin, the probability you would use in the calculation is 18.88% ÷ 100 = 0.1888. And if your second hypothesis places the match as a 2nd cousin once removed, the probability would be 6.42% ÷ 100 = 0.0642.

 

In the next post of this series, we’ll consider the mystery of the American GIs in Germany. Which brother was the grandfather?

 

Other posts in this series can be found here:

Please follow and like us:

32 thoughts on “Science the Heck Out of Your DNA — Part 3”

  1. Phenomenally useful Leah! Thank you so much!!
    A couple questions:
    Which relationships would be too distant and uncertain to use with this method of combining probabilities? I.e. should a probability only up to 4th cousins, or closer be included in the combined product of probabilities, or are all relationships from 1C to 8C to be weighted the same?

    What about matches that show 0cM when there should be a fairly recent (e.g. 4th cousins) genealogical relationship? E.g. my dad matches someone at 89cM whereas I do not match him on any segment. Does the predicted probability of 50% for a 0cM match really mean something ?

    1. Great questions. Theoretically, the “distance” of the cousin should already be factored into the probabilities, so it shouldn’t matter. That said, Fig 5.2 only goes out to 4C and only goes down to 40 cM. I had to make some assumptions when extrapolating the probabilities into the 0–40 cM range. I would love to have actual simulated probabilities in that range.

      As for 0 cM matches, they’re better for ruling out relationships than ruling them in. We know this match of yours can’t be a 2C or closer, because there are no known reports of a 2C not sharing any DNA.

  2. Group B start text is 1300
    Group C end text is 1300
    Chart indicates an overlap suggesting that group B start should be lower, ie 1200, and that group C end should be higher, ie 1400
    cheers
    Al

  3. But the numbers don’t match … this is where I have trouble wrapping my brain around it.

    For instance, the Ancestry Table says 90-200 cM is a 2nd cousin. Blaine’s chart says 90-200 is a 3rd cousin …..

    1. The number come from different sources, so we wouldn’t necessarily expect them to be the same. Which Ancestry Table are you referring to?

  4. I think the table you created is very misleading. I do not think it works! In particular, the conditional probability of a relationship given the sharing cM is much much more complicated than what you presented in your blog.
    I believe you need to consult a couple knowledgeable statisticians and genetic genealogists.

  5. In defense of this post, I’d say its a very, very good start. Ancestry’s simulations are quite a good match to Blaine Bettinger’s actual cM distributions, and there’s no arguing with those….thats real data.
    The next upgrade would be to calculate probabilities from normalized gaussians fitted to Blaine’s distributions instead of Ancestry’s synthetic data. I made a spreadsheet to do just this (which is how I know that Ancestry’s sims are decent).

    This post and the tool it describes are excellent … exactly what anyone looking for a birth parent or grandparent using DNA needs!

  6. This is all very interesting and useful. In the case of two siblings, who have a common match of interest, I’m assuming it would make sense to use the cM total that is higher. EX: I match John Doe at 55 cMs, but my brother matches John Doe at 122 cMs. Should I use brother’s 122 cMs for figuring out the relationship probabilities of John Doe to us?

      1. Hmmm. My inclination would be to use the cM total from the sibling that was higher. My thought being that the one sibling, through luck of the draw, happened to get more of the common DNA passed down. So, in theory, and if DNA was passed down evenly which it isn’t, then both sibs would have the same amount of shared DNA with John Doe. So, why not use just the higher cM value? Not arguing with you, just want to understand. Thanks.

      2. Leah, sorry to belabor this, but I think I now understand what you mean by “use both.” I think you’re saying I should use a chromosome browser to look at my brother’s 122 cMs and my 55 cMs that match John Doe. If my 55 cMs includes a segment that my brother didn’t get, I should add that unique segment cM to my brother’s total and use that for the probability. Is this correct thinking? Thanks.

        1. Not quite. If you were testing two (or more) alternate hypotheses, you would multiply the probability for the first sib by the probability for the second sib under Hypothesis 1 and compare it to the compound probability for Hypothesis 2. If you haven’t yet gotten to the stage of forming hypotheses and just want to know what range to look in, I’d use the average for the two siblings, which is 88 cM. That puts you in the 2C1R to 3C1R range.

      3. Do you mean using the combined probabilities that you discussed in part II? This is the very issue I was trying to clear up!

        1. Yes, we use combined probabilities for situations with multiple DNA matches within the same family. The next posts in the series show some examples.

  7. I like everything but one thing. Why did you take it out to two decimal places? (53.00%, 20.01%, etc.) Too many people are entirely too literal in their interpretation of data, and that’s why you get all the stupidity around people who see 0.1% Scottish and think they need to go buy a kilt. Why not just round it to whole numbers or better yet, put some ranges around it? It won’t change anything, and it will make it easier for the average person, who isn’t into numbers, to absorb. Frankly I wish all the testing companies would use whole numbers or ranges, so you wouldn’t have all the problem of people reacting to what’s almost always noise. The same should go with these reference calculators.

    1. AncestryDNA and Family Tree DNA do use whole-number percentages, and AncestryDNA shows the range if you click on the ethnicity name in your report. 23andMe and Living DNA allow you to choose a confidence level for the estimate.

      I will pass along your concerns about the decimal places to Jonny Perl, who created the tool.

  8. Ancestry DNA shared cM amount is almost always lower than the same kits compared in GedMatch.
    If I put the Ancestry DNA number into the calculator, is there any ‘fudge factor’ I should be allowing for?

  9. I have a recent dna match to Subject A, who matches me 791cMs on 22 segments at 23andMe. Subject A was tested with V5 version autosomal test. I asked A to transfer his raw results to gedmatch, so I could see how he matches my daughter Subject B and my grandson Subject C. I will label myself Subject D to make this easier. At gedmatch, A & D matched 761cMs on 17 segments, A & B matched 471cMs on 12 segments, A & C matched 195cMs on 7 segments. Subject A shared the names of his parents, grandparents, great-grandparents with me, and I am not finding a matching ancestral name yet. Subject A is the same generation as my daughter-Subject B, so his parents are from my generation. I am thinking I could be a 1st cousin, 1st cousin 1x removed, half uncle, great uncle? I thought I could identify our common ancestor easily by knowing who his great-grandparents are. I have also searched siblings of his grandparents, great-grandparents, so maybe this will just take time to figure out.

    1. A relationship this close can only be a few things, so if you’re not finding the connection in your two trees, there may be a misattributed parentage some place.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.