In Which Citizen Science Finds an Error, and the Error Is Fixed

A key reason to evaluate our DNA tools is to improve them. After all, that’s how science works: a perennial cycle of testing, critique, and adjustment. Improving our tools is the primary motive behind the ongoing survey of DNA relationship predictors.

Recently, some unexpected entries to the survey unearthed a longstanding error with the original Shared cM Tool (SCT) at DNA Painter – unexpected because the tool has been available since 2017, and no one had previously reported this problem!

A screen shot of the matching calculator. A volunteer, Ricki, sent in a couple of parent–child matches that fooled the SCT. It predicted them as full siblings to their mother rather than her children! Both children shared an unusually low amount of DNA with their mother: 3,062 cM and 3,018 cM, respectively. The average parent–child match at AncestryDNA is closer to 3,455 cM.

This discrepancy immediately raised my suspicions that perhaps these entries were typos. “Trust, but verify” works in science as well as politics!

I reached out to Ricki, who not only confirmed the entries but sent a screenshot of the mom’s matches to nine of her children (the original two plus seven more). All nine siblings and their mother had tested at AncestryDNA, and all of them share substantially less than the average parent–child match. (Screenshot used with permission.)

A couple of pictures with some labels on them

The siblings were correctly predicted as their mother’s children by AncestryDNA’s algorithm because it can “see” that each one shares an entire copy of each chromosome pair with their mom, as expected. (The children inherited their other chromosome copies from their father.)

But the third-party SCT only has the total amount of shared DNA to go on, not how that DNA is distributed. The low centimorgan amounts for this family are below the parent–child threshold in the original SCT, so the tool got the prediction wrong.

A screen shot of the dna parent questionnaire. Thanks to Ricki, the tool has been fixed! When these same matches are entered now, the predictions range from an 86% chance of a parent–child relationship (for 3,258 cM) to a 6% chance (for 3,018 cM).

As always, it’s up to the individual genealogist to evaluate the predictions based on other information, like ages, family history, and other DNA matches.

Two Questions

Why did we not catch this problem in the SCT sooner? First, parent–child matches this low are extremely rare. Second, there was never a need to use the SCT for these matches in the first place, because the predictions from AncestryDNA were unambiguous and accurate. This error was only revealed thanks to Ricki’s volunteer efforts with the survey.

The second question is: Why are these matches so low? Parents and their children should always match across the entire genome. The exact amount can vary slightly between testing companies and with genotyping errors, but it’s almost invariably more than 3,400 cM. In this case, I suspect that the mom’s DNA test had an unusually high error rate. That, coupled with Ancestry’s stringent matching algorithm, caused the site to underestimate how much the children share with her.

Two lines of evidence support that suspicion. First, parents and children should share 22 segments of autosomal DNA, one for each of the 22 autosomes. Occasional genotyping errors can inflate that number slightly; for example, I share 26 and 27 segments at AncestryDNA with my two children. The nine siblings in Ricki’s family, however, share between 42 and 65 segments with their mother at AncestryDNA.

A table with the numbers of each individual in an ancestry.

What’s more, all of these kits were uploaded to MyHeritage, which uses a statistical trick called imputation that’s able to smooth out the errors. The centimorgan totals and numbers of segments are normal there.

That in itself provides a cautionary tale for tool developers: we can’t always assume that real-life data fits our assumptions. In other words, the original SCT assumed that a parent–child match would not have significant genotyping errors, and when that assumption failed, so did the tool. Fortunately, once we became aware of the problem, it was easy to fix!

You can contribute to the predictor study
–and to the Shared cM Project at the same time–
here!

Posts in This Series

Share on Facebook

9 thoughts on “In Which Citizen Science Finds an Error, and the Error Is Fixed”

At MyHeritage with my great grandchild, I share 820 cMs, largest 56.8, and they acknowledge her as my GREAT NIECE.

thednageek says:

May 2, 2023 at 9:19 am

Presumably they still list great grandchild as an option. I expect they’ll refine how they use ages over time.

Reply
Craig Smith says:

May 2, 2023 at 12:21 pm

“Acknowledge” her as your great-niece – or list that relationship as the one with the highest probability?

I am not certain that MyHeritage ‘acknowledges’ anythiing.

Reply

Thank you for your work.

thednageek says:

May 2, 2023 at 10:12 am

It’s a pleasure!

Reply

It will be interesting to see what My Heritage has to say about this case!

thednageek says:

May 2, 2023 at 11:09 am

Yes! I only know the ages for the two that were originally submitted to the survey. They were predicted as parent–child at 72.8% and 71.5% probability.

Reply

This indeed a nice example of citizen science in action — but I’m not sure the remedy is to “fix” the shared cM tool to accommodate this case. There is some sort of error or anomaly in the data source that should be investigated. It also seems odd that the new calculation comes up with a 21% probability when the footnote says the results fall outside of the 99th percentile.

Perhaps Ancestry would be willing to retest the mother if Ricki shares this blog with them. Inspection of the raw data for mother/children could also be reveal some problems. I’d be happy to look into this if Ricki is interested. My email is DNACousins@gmail.com

thednageek says:

May 3, 2023 at 12:28 pm

Unlike simulated data, empirical data isn’t always perfect. I am of the opinion that our sim-based tools should be adapted whenever possible rather than the other way around.

In this case, we knew what the problem was (in 2016, I didn’t interpolate below ~3300 cM from Ancestry’s Figure 5.2) and better data was available (in the beta version), so we simply “cheated” off of the better beta data.

It’s important not to confuse probability with percentile. A match of 3062 is in a marginal percentile for both parent-child and full sib. That’s a completely different mathematical concept from how likely it is to be one versus the other.

Reply