For years, a heated debate has roiled the genetic genealogy community about small DNA segments. Judy Russell has recently written an excellent summary of the issue.
The precise definition of “small” varies, but the scientific consensus is that segments less than 6–8 cM are not trustworthy for genealogy purposes. Such segments are likely to be “false positives”, meaning they look like real segments of DNA inherited from a shared ancestor (called “identical by descent”, or IBD), but they’re not.
Scientists from 23andMe even published a peer-reviewed paper about false-positive segments way back in 2014. They compared child–parent trios (where a child and both parents had tested) to other people and tracked how frequently other people matched the child but not either parent. That is, of course, biologically impossible. Those segments are false positives.
The smaller the segment, the worse the problem is, as this graph shows. That’s why most leading DNA companies ignore segments less than 6 cM.
Even so, some genealogists continue to use small segments as “proof” of relationships. One DNA matching company, FamilyTreeDNA, even reported segments down to 1 cM, adding a sheen of credibility to the idea that tiny segments are somehow valuable for genealogy research.
FamilyTreeDNA Updates Their Matching Program
Recently, FamilyTreeDNA updated their matching algorithm to exclude all segments less than 6 cM. They also published a white paper explaining why. The key evidence is summarized in their Table 2.
Focus on the left-most and right-most columns. At 1 cM, almost all of the “matching” segments—99.96%—are false positives. Put another way, only four out of every 10,000 such segments are identical by descent (IBD), and there’s no way to tell which four are legit from looking at them. The situation is only marginally better for 2-cM and even 3-cM segments. Even at 5 cM, nearly half of all segments are false. And again, there’s no indication of which half are real.
Why the False Segments?
At this point, you may be thinking “But I thought DNA matching was hard science. How can so many segments be wrong?” And that’s a great question! We should always be asking why.
The reason is that the technology we use for genetic genealogy is not perfect. It’s extremely good, but it has a fundamental weakness: it can’t tell which bits of autosomal DNA we inherited from our mothers and which we inherited from our fathers. This causes a mix-and-match problem (pun intended) that leads to false segments.
Consider the example to the left. Person A and Person B each have two copies of the chromosome, one paternal (blue) and one maternal (red). Neither sequence (haplotype) from Person A matches either sequence from Person B. Even so, this region gives a false positive match because at every point, at least one base in Person A matches at least one base in Person B. The error is caused by “haplotype switching”, where the match only appears to exist because the computer analysis is switching back and forth from one haplotype to the other.
Yes, newer sequencing technology that can get around the haplotype-switching problem, but it is both expensive and error prone. Our current methods may be imperfect, but they’re affordable, and there’s a simple workaround: ignore the segments that are statistically likely to be false.
Which is precisely what FamilyTreeDNA is now doing. As of August 2021, they only consider segments of 6 cM or more. This may cause substantial changes to your Family Finder match lists, but it’s a long-overdue change. It should also better align the DNA matches there with those in the other databases.
Worse Than It Looks
The false-positive problem is even worse than the data above suggest. That’s because FamlyTreeDNA’s Table 2 is based on simulations.
There were no mis-calls or no-calls or imputation errors in the data. There was no pedigree collapse. And the tree was known with absolute certainty, so the false positive segments were easy to spot. Even in that best-case scenario, segments below 5 cM were more likely to be false than true.
In the real world, there will be more false positives, because real data isn’t perfect. Some bases are called incorrectly. Some aren’t called at all. What’s more, companies that accept uploaded data files, like FamilyTreeDNA and MyHeritage, have to account for the fact that those files are not fully compatible with their own tests. They use a statistical trick called “imputation” to get around the incompatibility, but imputation can introduce its own errors. All of those errors can increase the false positive rate.
The Population Problem
Layer on to that one more issue: some segments really are IBD but don’t reflect a recent common ancestor. Instead, they are shared because they were widespread in the ancestral population. This often happens because the ancestral population was endogamous and the members were all related to one another. Ultimately, all ancestral populations were endogamous, so we all need to be concerned about small segments misleading us.
Consider this scenario: Person A and Person B share the “blue” DNA segment (solid path), and both also have “Purple” in their trees (dashed path). While it might seem reasonable to say the blue segment is proof that A and B are both descended from “Purple”, that would be wrong. The shared DNA came to both A and B through a much more distant ancestor via a population that had lots of blue descendants.
In this case, three statements are true:
- A and B might both be descended from Purple.
- A and B share a “blue” segment of DNA.
- The “blue” segment is not evidence that A and B are descended from Purple.
In other words, the small IBD segment in this case might be proof that A and B are related but it’s not proof of how they are related.
And that conclusion can only be drawn after you have confirmed that the segment is IBD in the first place.