Science is like a ratchet. Each discovery builds upon previous work, and ideas that withstand scrutiny become stepping stones to future progress. Competition can be fierce, but good scientists credit their predecessors and accept critique with grace in a perennial collaboration toward greater knowledge. That’s the beauty of science.
The ratchet is evident in the evolution of relationship prediction tools for genealogy. The DNA testing companies all suggest ranges of relationship for our DNA matches, like “2nd–3rd cousin,” but independent genealogists have been instrumental in pushing the field forward.
The bedrock of relationship prediction is the Shared cM Project by Blaine Bettinger, which has amassed data for more than 60,000 known relationships since 2015. Although not a predictive tool itself, its statistical approach to DNA matches inspired the first use of probabilities to evaluate them. (See below for a more detailed timeline.)
We now have several tools that assign relationship probabilities to autosomal DNA matches, including the recently announced cM Explainer from MyHeritage. But which tool is best? What does “best” even mean in this context?
No tool will predict the correct relationship 100% of the time; DNA inheritance is too random for that. And no prediction can be accepted without further investigation. Even for a “parent–child” match, we must consider which person is older and whether an identical twin exists.
A tool should make our work easier. It should point us in the right direction without misleading us or giving false confidence. An incorrect prediction, or an overly confident one, can send us down a rabbit holes and waste our time.
I suggest the following criteria for which relationship predictor is “best:”
- It predicts the correct relationship as the top-ranked option more frequently than others, across the spectrum of shared DNA amounts.
- It predicts the correct relationship with more confidence. A tool that correctly predicts a 2nd cousin at 68% probability is better than one that gives the same match a 47% chance of being a 2C.
- When it’s wrong, the true relationship is still ranked within the top three possibilities.
- When it’s wrong, the true relationship has a similar probability to the top-ranked one. A wrong prediction that scores only 1% less than the incorrect one is better than a wrong prediction that scores 30% less.
You can help!
Blaine Bettinger and I are collaborating on a study to compare three different relationship prediction tools: the original Shared cM Tool at DNA Painter, the Shared cM Tool with updated probabilities, and the cM Explainer at MyHeritage. For matches in the AncestryDNA database, you can also evaluate their built-in probabilities. (A fourth tool was originally included in this study but was removed after strong objections by its developer.)
We are soliciting volunteers in two ways. If you prefer to work with your own DNA kits, you can fill out this online survey. Each pass through the survey will prompt you to evaluate the same match in different tools. To get started, you’ll want to have handy: the known relationship, the amount of shared DNA in centimorgans, the number of segments, and (if possible) the age of the DNA match. You can submit as many matches as you like by going through the survey multiple times.
Alternately, you can help by using data from the Shared cM Project. Blaine has collected shared DNA amount and number of segments for about 40,000 pairs of known relatives. This means we can quickly and efficiently process large amounts of data. The one parameter missing is age (used by the MyHeritage tool), which is why we’re giving volunteers a choice of how to contribute. To help this way, see the dedicated post in the Genetic Genealogy Tips & Techniques group on Facebook.
Timeline of Relationship Prediction Tools
The list below is by no means comprehensive. Instead, it’s meant to acknowledge the key advances in the field and introduce the main options for relationship prediction in the survey described above.
2015—Blaine Bettinger launched the crowdsourced Shared cM Project. Today, it has amassed more than 60,000 data points that provide averages, ranges, and histograms for shared DNA between known relatives. The project was foundational for interpreting DNA matches. All data comes from volunteers, but the sheer size of the dataset should, in theory, outweigh occasional errors.
2016—Christa Stalcup and The DNA Detectives group produced “The Green Chart,” compiling similar information to the Shared cM Project but from relationships that were vetted by experienced genealogists. This more-careful approach necessarily yields less data. (This is a perennial trade-off in science.)
2016—AncestryDNA published a Matching White Paper with a key graph (Figure 5.2) that allowed us to take a more statistical approach to relationship prediction. Instead of simply showing which relationship groups shared the same amounts of DNA, it indicated which were more likely for any given centimorgan amount.
2017—Jonny Perl integrated two datasets—the Shared cM Project and Ancestry’s Figure 5.2—to create the Shared cM Tool. This was the first interactive tool to provide statistical probabilities for relationships and has become the gold standard for DNA-based prediction, both for its ease of use and its accuracy.
2017—Professor Andrew Millard performed computer simulations to show that the number of segments can sometimes help distinguish among close relationships that have the same expected total of shared DNA. This concept built on an earlier scientific publication by 23andMe scientists in 2012.
2019—AncestryDNA began reporting relationship probabilities in their DNA match lists. These probabilities are occasionally been updated. To date, they have not produced a white paper explaining their methods.
2023—Brit Nicholson introduced the interactive SegcM, a tool that predicts relationships based on both total shared DNA and number of segments. This tool builds on the earlier work of 23andMe scientists and Andrew Millard and attempts to distinguish among certain close relationships that other tools cannot. It claims to be “the most accurate relationship predictor available”, but the developer has taken extreme measures to prevent it from being compared to other tools. (See below)
Updates to This Post
The text has been edited to reflect recent events. On 17 March, 2023, a tool under consideration was password protected by its developer to shield it from scrutiny. It was re-released on 22 March with new “terms and conditions” that explicitly prevent it from being used in this study. This behavior is profoundly unscientific, and I’m sad that the genealogical community had to witness it.
Posts in This Series
- Relationship Prediction Tools: Which Is Best?
- The Relationship Predictor Comparison: A First Peek
- In Which Citizen Science Finds an Error, and the Error Is Fixed
- The Science Behind Relationship Predictors