This post has been updated
On 15 July, 2020, AncestryDNA updated their “Matching White Paper“, which is a detailed document describing how they use our DNA data to match us to our genetic relatives. The previous Matching White Paper was released in 2016.
On the surface, it seems like finding identical DNA segments shared by two people would be simple. It’s anything but. For biological, technical, economic, and computational reasons, all of the genetic genealogy companies have to use sophisticated algorithms to achieve the goal. In fact, differences in how each company approaches the problem are why you can match the exact same person at different sites and appear to share more or less DNA.
How AncestryDNA Matches Users
As a very broad overview, DNA matching at AncestryDNA involves four main steps:
- We have two copies of each autosomal chromosome, but the laboratory technique used by the companies (called a microarray) doesn’t analyze each one individually. Instead, your two copies of chromosome 1 are analyzed as a unit, your two copies of chromosome 2 are analyzed together, etcetera. After the fact, the computer algorithm has to determine which data came from your maternal copy of chromosome 1 versus your paternal copy, and so on for each chromosome pair. This step is called phasing.
- After the raw data is phased, people in the database are compared to one another to determine whether they share matching DNA sequences as a result of recent common ancestry. Such segments are called “identical by descent” or IBD to distinguish them from DNA that might appear to be identical by chance or because the matching DNA dates back dozens of generations. This matching is complicated by the sheer size of AncestryDNA’s database. There are hundred of trillions of comparisons to be made, and the database is growing all the while.
- Segments of DNA can be shared for reasons other than recent common ancestry. For example, there is a cluster of three genes on chromosome 4 around position 38,800,000 that appear to give resistance to the plague. Two people could share this segment of DNA not because they are recent cousins but because both came from populations that survived the Black Death 1000 years ago. AncestryDNA applies an algorithm called Timber to adjust for population-level segments, sometimes called pile-ups,
- Once AncestryDNA has determined how much DNA two people share, the final step is relationship estimation. It’s all well and good to say that cousins Tyneka and William share 192 cM of DNA, but what does that mean for how they’re related to one another? Here, AncestryDNA sorts our matches into broad categories of relationship meant to be a starting point for us to look for the connection in our family trees. To use our example, Tyneka and William would appear in the 3rd Cousin category in one another’s lists, but they might well be 2nd cousins instead. (AncestryDNA tends to err on the side of underestimating relationships, so you’re far more likely to see a true 2nd cousin estimated as a 3rd than vice versa.)
So What’s New?
That’s the general overview of how AncestryDNA provides us matches. What’s changed in this new White Paper? There are three key updates that you should be aware of. They go into effect in early August. Ultimately, they’ll position AncestryDNA to start incorporating NextGen sequence data into their database.
The Number of Shared Segments Will Be More Accurate
First, the number of unique segments shared by two people will be more accurate. Because of a strict matching algorithm, AncestryDNA sometimes reports a single long segment as two separate segments. This occurs when one of the people being compared has a random error in their data, making it appear that they don’t match at a single spot within the segment when they really do.
The effect is most obvious when comparing a parent and child. For example, AncestryDNA currently says that my mother and I share 3,475 cM of DNA across 44 segments.
That’s impossible. I only have 22 autosomal chromosomes, and I match her all the way across each one. In reality, she and I share 22 segments, each of which is an entire chromosome. There must be 22 errors in either my data or hers causing some of our chromosomes to match in discrete chunks rather than across the entire thing. (Given that AncestryDNA analyzes more than 600,000 markers in our DNA, it’s remarkable that there are only 22 such errors!)
The update will allow such “single SNP mismatches” nested within otherwise matching regions to be ignored, so the segment correctly appears as continuous rather than broken in two. By August, my mom and I should be reported to share 3,475 cM of DNA across 22 segments rather than the current 44.
This is good news but will only affect the subset of genetic genealogists who use the number of matching segments in their work. I calculate average segment size when working with endogamous populations, so I am very pleased to see this update. More accurate averages are always better.
AncestryDNA Will Report the Length of Longest Shared Segment
Thus far, AncestryDNA has only shown us the total amount of shared DNA and the number of shared segments. With the pending update, they will also report how long the longest shared segment is. For many users, this won’t make a difference in how they work with their matches, but for those of us from endogamous populations, this will be a huge benefit.
Endogamy occurs when people marry within the same group for many generations. This eventually causes complicated webs of relationship, in which individuals are cousins many times over.
Endogamous matches often share far more DNA than would be expected given their closest relationship, because those more distant connections are also adding DNA to the total. Those additional genetic contributions, though, tend to be very small segments. That’s why knowing the size of the largest segment is so important. A match who shares 70 cM with a largest segment of 35 cM is far more likely to be a recent cousin than a match of 70 cM whose largest segment is 12 cM.
Minimum Match Raised from 6 cM to 8 cM
(UPDATE: This change has been delayed until early September.)
Currently, our match lists at AncestryDNA include people who share as little as 6 cM of DNA with us. That minimum will be raised to 8 cM in the new update, meaning many of us will “lose” matches.
I have about 50,000 matches at AncestryDNA, of which roughly 21,000 are below 8 cM. While it might seem alarming to lose 42% of my matches, in practice it’s not such a bad thing. First, when a child and both parents have tested, studies show that about 40% of the child’s matches in that range don’t match either parent, meaning they’re false positives. There’s no easy way to tell which matches are false, meaning many of us are being mislead by them. With these matches, we’re not just chasing ghosts, we’re chasing someone else’s ghosts!
Second, the tiny matches that are valid may represent genetic connections dozens of generations back, ones I’ll never be able to document. I’ve only managed to connect a small fraction of my closer matches to my tree in the years since I first tested, and more matches are rolling in all the time, so I’ll never be able to systematically analyze those extremely distant matches.
Finally, even though I’ll miss out on some valid matches that might be traceable, I recognize that this compromise will accommodate the ever growing database. And I’d much rather AncestryDNA invest in growing their database than divert resources to matches I’ll probably never look at.
That said, there are some distant matches I’d like to keep in my list — and I can. Any match that I’ve messaged, added a note to, starred, or included in a custom group (the color dots) will be retained after the update goes into effect. Better yet, if I act to retain a tiny match in my list, I will still show in their list after the update, even if they don’t tag me. In other words, tiny matches will be symmetrical.
What Should You Do?
For the first two updates—improved number of segments and longest segment size—you needn’t do anything. However, if you want to retain your very distant matches, you’ll need to take some extra steps.
Here’s what I’m doing: First, I’m triaging. I’m not even trying to preserve every single match below 8 cM. Most of them are either false positives or too distantly related to ever sort out. There are some, though, that I’d really like to keep.
There’s one particular question I’ve been working on lately: the parentage of my 4th great grandmother Marianne Dykes. Without documentation, I added a likely Dykes couple to my tree to see whether ThruLines were generated, and they were! Some of those matches, though, are 8 cM or smaller.
Here, it’s important to know that AncestryDNA rounds the numbers they show us. A match that’s labeled as 8 cM might be 8.2 cM, above the new threshold, but could just also be 7.5 cM and scheduled to disappear. For that reason, I’m triaging all matches of 8 cM or below.
To keep them, first I created a custom group for them. Next, I clicked on the ThruLines for William Dykes and for his wife Phoebe Singleton and viewed the matches in list format.
Then I opened each match of 8 cM or less in a new tab and added them to the Dykes–Singleton custom group. (Quick tip: Put an exclamation mark before the group name so it sorts at the top of the list of groups.)
Importantly, I did all of this in my mom’s match list and my uncle’s rather than my own. Their one generation closer to Marianne Dykes than I am, so their matches to the Dykes–Singleton family are more important than mine. It doesn’t matter if I lose them from my own match list, as long as they’re preserved in my mother’s and uncle’s.
Keeping Surnames of Interest
I’m also using the custom filters in the main match list to quickly find and flag potential Dykes–Singleton matches. I first filtered the centimorgan range to between 6 and 8 cM. (Remember: the ones above 8 cM are not in danger of disappearing soon.)
Then I searched for trees with the surnames Dykes or Singleton and added those matches to the Dykes–Singleton group using the “Add to group” feature. No need to open a tab for each match this time, but be sure to scroll all the way down so you don’t miss anyone.
It took about 15 minutes total to preserve 110 distant matches in my mom’s list and another 107 in my uncle’s. Of course, they may turn out to be false leads, but I can decide that later.
Matches with Common Ancestors
The third group of matches I’d like to preserve are those for whom AncestryDNA has identified common ancestors. Because I don’t have time to sort these by which branch of my tree they’re on, I created a new custom group called “Common Ancestor”. Then, using the “Common ancestors” filter and my custom DNA range, I’ve started labeling these matches, too.
I may not get through all of these before they disappear, so I’m starting with the largest ones (8 cM) and working my way down.
Updates to This Post
- 17 Jul 2020: Added the approximate position of the Toll-like receptor (TLR) genes that appear to confer resistance to the plague
- 18 Jul 2020: Explained that AncestryDNA rounds cM values so matches of 8 cM and below should be triaged; added new clarification from AncestryDNA that starred matches will be retained, and retention is symmetrical
- 23 July 2020: The elimination of 6–8 cM matches has been delayed until early September.