Once you get below 20 cM, a match is more likely to be a 10th cousin than a 4th cousin.
Sounds nuts, right? How is that possible when most of our distant cousins don’t share any DNA at all? Let me explain.
It’s absolutely true that beyond 2nd cousins, some of our biological relatives will not share measurable autosomal DNA with us. There’s about a 7% chance that a true 3rd cousin will not match. For 4th cousins, it’s about 50%. Only about 16% of our 5th cousins will match, and the odds of matching obviously go down from there.
On the flip side, we have a lot more 5th cousins than 4th, a lot more 6th cousins than 5th, and so on. Using a very simple assumption of 2.5 children per couple over the generations, we’d expect to have about 938 4th cousins, roughly 586,000 8th cousins, and more than 4.6 million 10th cousins. It turns out that even if only a tiny fraction of our 10th cousins share DNA with us, that’s still more than half of 938.
This concept can be difficult to intuit, so I performed some computer simulations to give you a better feel for the numbers.
Ped-sim Simulations
Ped-sim (short for “pedigree simulator”) is an open-source computer program released by the laboratory of Dr Amy Williams at Cornell University. (Dr Williams is now a senior scientist at 23andMe.) Ped-sim does not have a graphical user interface, so it’s not for the layperson. However, it has features that are useful to us:
- It’s lightning fast.
- It can simulate shared DNA amounts for any pedigree.
- It can account for the fact that the crossover rate is higher in women than men. Crossing over is how DNA segments break down from one generation to the next.
- It can account for a phenomenon called crossover interference, which limits how closely two crossovers can happen to one another in a given generation.
On the other hand, the default genetic map for Ped-sim is smaller than the maps the genealogy companies use (Bhérer et al., 2017). For example, at AncestryDNA, a parent–child match is about 3,470 cM, whereas with Ped-sim’s map, it would be 3,346 cM.
Because you can’t simply scale up a centimorgan any more than you can scale up a mile, I created a new genomic map modeled on the Ped-sim one but with the AncestryDNA centimorgan total. This map would be useless for biomedical research, but it should be fine for our purposes. All we care about are the sizes of the DNA segments, not precisely where they start and stop.
For each relationship, I simulated 50,000 matches using my custom map with sex-specific crossover rates, crossover interference, and a 7-cM minimum segment size. I simulated relationships from grandparent–grandchild down to 10th cousin, then summarized the data in this spreadsheet.
See for Yourself
Let’s return to my statement about low matches, those who share less than 20 cM. Focus on spreadsheet rows 20–22, which are highlighted in lavender. These calculations assume 2.5 children per generation. For each cousin level. row 20 shows the average number of cousins, row 21 shows the average number who will share measurable DNA, and row 22 shows the average number who will share between 7 and 20 cM.
On average, we have 938 4th cousins, of which about 471 will match and 262 will share between 7 and 20 cM. (Some will share more than 20 cM.) In that same centimorgan range, we’ll have approximately 900 6th cousins, 1,242 8th cousins, and 2,344 10th cousins. In other words, our sub-20 matches contain about nine times as many 10th cousins as 4th cousins.
Be Skeptical of Small Segments
Good genealogy is all about evidence, not just the easy stuff that falls in our laps, but what we find after reasonably exhaustive research. You can’t assume that a funeral record for “Mary Smith” in a city full of Marys and Smiths is proof about your 3rd great grandmother until you’ve vetted it: year, age, race, neighborhood, religious denomination, maiden or married surname, next of kin, and so on. You need to demonstrate that the record is not for some other Mary Smith before you can claim it for your Mary Smith,
The same goes for DNA. A shared segment is not evidence for any given relationship until you’ve shown that you could not have inherited it through any other relationship path.
This is true even if traditional genealogy has led you to a common ancestor with your match. You might truly be a 4th cousin to that match through Mary Smith, but one or both of you could have inherited a shared DNA segment via different ancestors. In other words, the segment is not evidence for the relationship until you can show that you both inherited it from Mary and not some other ancestor. That’s especially challenging when you consider that a segment less than 20 cM could have come from a 10th great grandparent on a completely different line for either of you.
The truth is, most of us stop looking for connections to our matches once we find one, and we’re more likely to find a 4th cousin connection than a 10th cousin one, even though the 10th cousin relationship is statistically more likely. (I’m guilty of this too!) But finding a genealogical connection isn’t the same thing as proving that the DNA match came via that connection. If you use the paper trail to argue that the segment came from Mary Smith, and the segment to validate the paper trail, you are guilty of circular logic.
To understand why this is important, we need to think about the lives of our ancestors. Travel was challenging. Social norms restricted associations. Even immigrants tended to move in extended family units. That is, our ancestors from any given place were probably all related to one another. In fact, in 2018, scientists from MyHeritage reported in a scientific paper that married couples between 1650 and 1850 tended to be 4th cousins to one another (Kaplanis et al., 2018).
In other words, Mary Smith was probably related to her husband, and her children were probably related to their spouses, as were her parents and siblings and cousins and so on. A shared segment with another descendant of Mary has many alternate paths to follow.
What Good Are Small Matches?
Small matches can point you to a geographic region or possibly even an extended family. But they can’t tell you who, exactly, your ancestor was. For that, you’ll need traditional research. Even if the documentation pans out, you still can’t assume that the segments in question came from that ancestor.
For example, I have a 4th great grandmother named Marianne Dykes of unknown parentage. My family also has very distant DNA matches (7–32 cM) who are descended from a couple named William Dykes and Phoebe Singleton from eastern Louisiana. I can find no record that they had a daughter named Marianne, but Phoebe’s father was in the process of moving to Cajun Louisiana when he died, so I have a plausible connection between the two regions. Does that prove who Marianne’s parents were? No, I’ll need records for that. Those small matches did draw my interest to eastern Louisiana, though, and that’s a clue I never would have found otherwise.
The simulations provide other valuable insights, which I’ll explore in future posts.
Thanks for the post! The further back my MRCAs are with the match I’m researching, the more focus on following the spouse lines too, for this very reason. I’ve made a tree tag on Ancestry “MultipleRelationships” as soon as I find a second one, and often there are more which are easily found – if I look. And these days, I always look. And then look again.
For once it is useful that my paternal grandparents were born thousands of miles away from each other. Both, ultimately Rhine Valley, though.
It helps to have first cousins testing. The largest red flag is actually IF a match has DNA in common with all cousins.
It helps to know segments and to plot them on DNAPainter. If the cousins share the same pieces and a couple of matches share the same pieces, it suggests I am on the right track.Three cousins only have UK ethnicity on their other side. I have to be cautious there.
I tend to think Occam’s razor has a place.
If by Occam’s razor you mean that the MRCA is the source of the segment, I fear that’s can get you into trouble. Sorting matches by the four grandparent lines is pretty straightforward if you don’t have endogamy. That narrows the possibilities for alternate inheritance paths a great deal.
Thank you for a very interesting post. I note that the model is very sensitive to the assumed number of children each couple has. On what basis has 2.5 children been calculated?
It’s not the number of children born to each couple that is important, but the the number of children born who have descendants to my generation.
I would have guessed a higher figure than 2.5, but it would just be a guess.
Fabulous insight! Yes, what matters is the number of children that leave descendants of their own. I used 2.5 children/couple because I’m pretty sure that’s what AncestryDNA used for their probabilities, and it’s a fairly standard population growth rate. (Long story: I tried for years to get a straight answer out of them, and that’s the closest I got.)
A more realistic model would be to have larger families in, say, the 1800s and smaller ones today. I’d like to do a post where I play around with those numbers to see how it affects the matching estimates.
Thanks. As an average 2.5 seems to be as good as any other number.
One other factor to consider:- we probably have a smaller percentage of our 10th cousins living today than the percentage of 4th cousins. Some (many?) of our 10th cousins were born many years ago, or haven’t yed been born, and so they will not have been tested. A higher percentage of fourth cousinds will have done a DNA test.
Is there a measure of the spread of the ages for cousins?
Dewi
You make an interesting point about the offset in birth years for different generations. I suspect as far as our matches go that it’s offset by genetically equivalent removed relationships. For example, we might not overlap lifespans with all of our 10C, but that category also includes 9C2R, 8C4R, etc.
> The same goes for DNA. A shared segment is not evidence for any given relationship until you’ve shown that you could not have inherited it through any other relationship path.
I think you’re missing something important, maybe two somethings.
* At present, most test labs look for SNPs, not full DNA sequences. So, you could have one SNP from one source and another from another source. When you gets down to the lower cMs, you’re getting into statistically uncertain territory.
* At present, most test labs don’t identify which DNA strand any particular SNP came from, so one SNP could come from one parent and the next from the other parent, so that it appears you have a match.
False matches (caused by the phasing issues you describe) are a separate and important issue but they don’t affect this analysis. There are no false matches in the simulated data.
Thank you for doing this. I find DNA the most complicated aspect of genealogy. I have a long way to go to understand, but this kind of information will lead the way to dealing better with DNA results.
You’re welcome!
Except…I have begun a study of siblings and the variation of cM amounts within matches to each. This began when I was looking for my grandmother’s father and I began comparing the DNA matches of two of her children, my Aunt and Uncle. Their largest delta match was 110cm vs 23 cM for the same person. Since there were some large swings in matches I compared my brother and I. Our largest was 149cM vs 17cM. I have sent two more kits to siblings of other cousins so I can compare those also. Had I not compared our matches, I would have clearly dismissed relevant relationships, depending on who I looked at first.
You are describing issues with the matching algorithms at the various sites rather than the segments themselves. That’s not a factor for simulated data.
I’m curious which site gives you the larger matches and which the smaller. Which site do you think is most accurate and why?
Couldn’t agree more about what you say, however using segment data and chromosome analysis techniques such as visual phasing and walking back the segments through the generations can confirm, or at least increase the likelihood, that the segments and the genealogy are correct even with smaller segments <20cMs. I totally agree that people jump too quickly to conclusions with one genealogical match, without confirming that it is also a genetic match on that line. Shared matches can be very misleading the further back you go, but shared segments on the other hand help to increase the chance that the hypothesis about the genealogy is correct.
Thank you so much for saying this! Chromosome mapping and especially the “walking back the segment” of Jim Bartlett can go a long way to ruling out the alternate paths of inheritance. Once you’ve done that, those small segments can be a valuable part of our work.
Thanks for bringing this article to my attention, but….. the Kaplanis et al., 2018 article is seriously flawed. Two main issues, first, while the stat methodology used might be appropriate for the authors aims, but the base problem is that assumptions are made which a science based genealogist knowing the data in these pedigrees would not make. Other of their ‘tests’ for validation are biased to the point of being of no utility. The Vermont vital recs database for example, just silly to use as representation of I suppose class bias in genealogy trees. And one NPE per family, no, that’s clearly not supported by historic demographic data. Despite all those big words, old axiom applies, junk in, junk out.
I assume you’re referring to this sentence in Kaplanis et al.: “Using a prior of no more than a single non-paternity event per lineage, we estimated a non-maternity rate of 0.3% per meiosis and non-paternity rate of 1.9% per meiosis.”
They aren’t saying that there was one NPE per family. A prior is a statistical parameter in Bayesian analysis. Put simply, once a misattributed parentage event disassociates the haplogroup from the paper lineage, it’s difficult to tell whether there was one MPE, two, or several. In the analysis, Kaplanis et al. assumed no more than one (i.e., between zero and 1 inclusive). If anything, they underestimated the MPE rate, because there will be lineages with more than one.
In any case, that part of the paper is not relevant to this blog post.
Thanks for the great post.
This analysis (https://www.biorxiv.org/content/10.1101/352732v1) is a bit dated (done in 2017/8) so I am not sure if the data are still completely accurate but I suspect that even if the exact SNPs examined have changed over time, the company’s philosophies are likely the same. I think genealogy companies tune their SNP selection to optimize geographical information rather than inter-individual matching. The two goals are intertwined but not identical and inter-individual matches are completely independent of the geographical databases maintained by each company. 23 and me was an outlier in SNP choice in that they favored many more rare SNPs with a higher ancestral bias than either Ancestry or My Heritage. This should also have an impact in that agreement on the rare SNPs should be more meaningful than matching of the more common SNPs. I don’t know whether their analysis was skewed such that those SNPs were more important or not. It would be useful to know which SNPs are shared among supposedly related individuals. Sharing common SNPs is much less meaningful than sharing rare SNPs.
Thanks for the link. You make some interesting points. Fortunately, SNP selection is not a factor in the simulations, so we can draw conclusions about the longevity of small segments regardless of the SNPs used.
Wow, what a great tool, totally improved my understanding of the role of segment length!
Just a comment about your spreadsheet. I include in a separate column, the ahnentafel of the match to the CA as well as mine. It certainly shows which way the removed generation(s) are. Also when I have endogamy because my grandparents were cousins, for instance, I show ahnentafels for both paths, e.g. 54/56 which immediately shows why cMs are higher than expected. Thank you for your interesting post.
Your paragraphs about Mary Smith describe endogamy. Has any research been done to find a measure for endogamy?
E.g. on Ancestry, the percentage of shared matches, compared to total matches. Or the percentage of total matches which are over 20 cM.
I use average segment size as a gauge of endogamy.
“There’s about a 7% chance that a true 3rd cousin will not match. For 4th cousins, it’s about 50%. Only about 16% of our 5th cousins will match, and the odds of matching obviously go down from there.” 7% of 3rd cousins won’t match and then 50% of 4th cousins won’t match? There is a difference between ‘not matching’ and not sharing DNA though right? If I paraphrase are you saying that 50% of 4th cousins will share less than 7 cM resulting in the testing company not matching the two true 4th cousin testers as related? There are only 512 ways (paths) for someone to be your 4th cousin. 4th cousins share an average of .20% of their DNA which coincidentally is 1/512 of their DNA. Since Tester 1 can be either a man or a woman the total ways that two people can be 4th cousins are 1024 and 169 of 1024 happen to be on paths that can share centimorgans on the X chromosome which won’t be included in the testers total cM shared by most Testing companies except for 23 and Me. Ancestry has the lowest 100% total cM of all the testing companies and the low end of 4th cousin is 9.8 cM. Testers that shared 12 cM total with all of it on the X chromasome would be 4th cousins at 23 and Me but would not even match at Ancestry. About 16% of 4th cousins have the potential to have autosomal amounts dip below the 7 cM reporting threshold. It is just a different way to look at the same issue a reason why some 4th cousins are not matching has nothing to do with the amount of DNA they share it has to do with the amount of shared DNA not being reported by the testing companies. At 23 and Me the reverse happens on a few paths. Some paths share more cM on X than the expected high end of the 4th cousin range so they might have cM amounts consistent with 1/2 3rd cousins.
Correct: 7% of 3rd cousins, 50% of 4th cousins, and 84% of 5th cousins won’t match one another in the genealogy databases, assuming a 7-cM match threshold. The X chromosome is not included in those calculations.
Wonderful work.
I try to persuade people to start from the big matches and work downwards. Someone always pipes up “but I had a great 6cm match”, ignoring all similar matches that were rubbish. So it’s really good to have some statistics to back me up.
In my own research I am very aware that there may be more than one possible path of connection between me and the match. And some may be on the other parental side. I have to accurately assign each individual match segment to a line to be certain.
Your careful approach is the model we should all aspire to.