Have you ever looked at a particularly large shared segment with a distant cousin and wondered “What are the odds?” We wouldn’t question a 100-cM segment shared with a 1st cousin, but with a paper-trail 5th cousin, we might be suspicious. Turns out, we can calculate the probability using some basic statistics. Math to the rescue!
Centimorgans
Some background: We each have two sets of chromosomes, one inherited from our mother and one from our father. Each of our parents also has two sets of chromosomes, but they passed on only one set to us through their gametes (the egg for Mom, the sperm for Dad). The process of making gametes is called meiosis, and two important things happen. (1) The two sets of chromosomes in our parents pair up with their partners (the two copies of chromosome 1 pair, the two copies of chromosome 2 pair, etc.) and swap bits. This is called crossing over, and it’s what causes DNA segments to get smaller over generations. (2) After crossing over, the resulting mix-and-match chromosomes are divvied up into the gametes, such that each egg or sperm gets only one copy of each chromosome, but which one it gets is random. This is called independent assortment.
The image below shows a hypothetical pair of chromosome 1 (at the top) that experiences two cross-over events (the black Xs) to produce two recombined products (at the bottom). Only one of the two daughter chromosomes will be passed on to the child during independent assortment. (It’s actually more complicated than this in real life; this cartoon is meant to convey the basic concept.)
We “measure” DNA segments in units called centimorgans (abbreviated cM). A centimorgan isn’t really a distance, though, because chunks of DNA with the same length in bases can have different centimorgan values. Instead, centimorgans are an indicator of how much crossing over is likely to take place in the daughter chromosome that is passed on. Many people think a centimorgan is a probability, that a 100-cM segment has a probability of 1.0 for crossing over, but it’s not that, either. If it were, the maximum possible centimorgan value would be 100, and most chromosomes are longer than that.
A centimorgan is actually an average. Take chromosome 1, which is 281.5 cM at GEDmatch. That mean on average it will experience 2.815 inherited crossovers in a single generation. Sometimes it will crossover once, sometimes four times, sometimes not at all. Over thousands of cases, though, the number of crossovers will average to 2.815. The same goes for any chromosome or segment: It should experience crossing over, on average, C/100 times, where C is the centimorgan value.
A Law of Probabilities
The probability of two independent events occurring is the probability of the first one times the probability of the second. For example, the chances that I could flip a coin twice and get two heads in a row is 0.5 x 0.5 = 0.25, or a 25% chance. It doesn’t matter what the events are, as long as they’re independent. The probability that I could flip heads twice (0.25) and that my dog will be hungry (1.0. She’s a Lab; she’s always hungry!) is 0.5 x 0.5 x 1.0 = 0.25. If I wanted to add a third independent event, I’d simply multiply its probability times the combined probability that I already have. My family goes out to dinner about once a week (1/7, or 0.14 probability), so the chances that I could flip two heads, my dog would be hungry, and we’d eat out are 0.5 x 0.5 x 1.0 x 0.14 = 0.0357, or about 3.5%.
These are somewhat silly examples to demonstrate that the probability of multiple independent events all occurring is equal to the probability of the first x the probability of the second x the probability of the third, and so on until each individual event has been factored in. You will see how this relates to segment inheritance in a bit.
The Poisson Distribution
We can estimate the chances that a segment of a known length crosses over a certain number of times using what’s called the Poisson distribution. Without going into the gory mathematical details, the Poisson distribution lets us predict the odds of an event happening once, twice, thrice, etc.—or even not at all—as long as certain assumptions are met. For the Poisson distribution, those assumptions happen to fit our situation pretty well: (1) the average rate must be known (that’s what centimorgans are), and (2) each event must be independent of the others. That is, a segment can’t be more or less likely to cross over in one generation just because it did (or didn’t) cross over in a previous one.
Wikipedia gives the formula for the Poisson distribution:
Before your eyes glaze over, let me explain. Recall that a 100 cM segment crosses over an average of once; put another way, our average number of events per interval (λ, or lambda) is the centimorgan value divided by 100. The variable k represents the number of events you’re interested in. Here, k is zero, because we want to know how likely it is that a segment will experience no crossovers.
Here’s where math is on our side. Any number to the power of zero (the numerator in the equation’s fraction) is equal to 1. And the factorial of zero (the denominator) is also 1. Because k = 0 in our situation, the fraction part the equation simplifies to 1/1 and can be ignored entirely! Thus, the probability that a segment of C centimorgans will experience zero crossovers in a single parent–child transmission event is P = e –C/100.
To figure how how likely the segment is to survive more than one transmission event, recall the law of probabilities above. For multiple events, we multiply the probability of each. In this case, it’s P for the first transmission x P for the second x P for the third, and so on. For two steps (say, grandparent to parent to child), it would be P x P, or P 2. For three transmissions, it would be P 3, and for N transmissions, the calculation is P N.
Easy Peasy, Right?
Not quite. There are two more things we have to consider. The first is that we have no way of “seeing” crossovers in a parent–child comparison. A parent and child are half identical across all the autosomes, no matter how many crossovers took place in the parent. When we compare a grandparent and grandchild, although there were two generations in which crossover occurred, we can only see the results of one. That means we need to adjust our formula to account for the fact that one generation of crossovers will always be invisible to us: instead of P N, the correct calculation is P (N–1).
But Wait! There’s More!
We have one more consideration when calculating the probability we’re after: independent assortment. So far, I’ve mainly talked about segments as if there’s only one that either does or does not experience a crossover, but that’s a simplification. Recall that our parents have two copies of each chromosome. When we’re tracking a specific segment (like one we share with a cousin), we need to consider both the chance that that region didn’t cross over, P (N–1), as well as the chance that it gets passed on at all.
Let’s go back to our visual from above. Imagine that these are my mom’s two copies of chromosome 1 part way through meiosis. Crossing over has occurred but not independent assortment. Also imagine that my mom matches her niece (my 1st cousin) on that red 90-cM segment. Will I match my cousin there?
If I inherited the upper chromosome in this image, I will match my cousin; if I inherited the lower one, I won’t. There’s a 50-50 chance, either way.
So now we have the last piece of our mathematical puzzle: We multiply the probability P that the segment of interest escapes crossing over by the probability that it gets passed on (0.5). Only then do we consider what happens over multiple generations. Let’s put it all together. The chances that a segment of C centimorgans will survive N transmissions is:
P = (e –C/100 x 0.5)(N–1)
Clear as mud, right?!?
Don’t feel bad if you’re confused. This is complicated stuff. I do these calculations in a spreadsheet to avoid mistakes. Here’s an example with some pre-set centimorgan units for reference. You’ll need to click on it to see a readable version.
Ideally, someone will be willing to collaborate with me on a calculator that would let users plug in a relationship and centimorgan amount and have a probability pop out. Please contact me if you’re interested.
Another Law of Probability
If those numbers look really low to you, you’re getting it! They are low. That’s because I’m not calculating the odds of a cousin sharing any segment of C cM with you, rather the odds of a cousins sharing a specific segment of C cM with you. That’s an important distinction, because the odds of the former are a lot higher.
This brings us to another law of probability: if you’re interested in the chance that either of two independent events will happen (instead of both), you add each probability (instead of multiplying them). The mathematical approach outlined here can’t help, because the multiple ways that two cousins could share a C-cM segment are not independent of one another. Take this hypothetical example from chromosome 22, which is about 79.1 cM in total.
Two cousins could share a 50-cM segment all the way on the left or all the way on the right (or anywhere in between), but the possibilities would always overlap (black arrow). That is, they wouldn’t be independent of one another, and we can’t simply add up the individual probabilities.
To get at the question of how likely two cousins are to share any segment of a given size, we’re going to need simulated data, which is above my pay grade. I know a few people are working on simulations, and I look forward to seeing their findings.
Why Do We Need to Know the Odds?
DNA data are becoming increasingly important in genealogical proofs. For segments to provide compelling evidence, especially for distant relationships, we must consider how likely they are to be shared in the first place. If you are looking at a match that is statistically improbable, like a 50-cM segment shared between putative 5th cousins, you need extra diligence in ruling out other, closer relationships as part of your analysis. That’s not to say that 5th cousins will never share a 50-cM segment, just that there’s an extra burden of proof associated with it.
As described in the previous section, we’re going to need simulation data to determine the overall probabilities that two cousins share any segment of a given size, but the numbers described here do have some uses.
The first is when comparing two different relationship scenarios for the same match. For example, my uncle SH. shares a 53.9-cM segment with cousin RJ. RJ is a 2C1R through SH’s father and also a 4C1R through SH’s mother. How much more unlikely is it for that 53.9-cM segment to have come through the 4C1R relationship rather than the 2C1R one? Because we’re now considering a specific segment, we can use the calculations presented here. The numbers are low for both relationships (6.54×10–4 and 4.92×10-6, respectively) but that’s not what we’re interested in. We’re interested in the ratio between the two. That ratio is 133. Put another way, it’s 133 times more likely for that segment to have come through the 2C1R relationship than the 4C1R one. (Turns out in this case it did come through the more distant relationship, but that’s a topic for another post.)
The other way we can use these numbers is with triangulation. Once two people match one another on a segment, any additional cousins forming a triangulation group have to match in that specific spot. We can use calculated odds for a triangulation group to gauge whether our proposed MRCA is reasonable or whether we might be dealing with a different MRCA or even a pile-up region.
Ultimately, we can even extend the math to include the different recombination rates between men and women. But first, I think the the community must become comfortable with simpler, pairwise comparisons and reach a general consensus on when, statistically, to expect an extra accounting for unlikely results. Is it at a 1 in 20 chance (P = 0.05)? One in 100 (P = 0.01)? One in a thousand (P = 0.001)? Where would you draw the line?
“If you are looking at a match that is statistically improbable, like an 50-cM segment shared between putative 4th cousins (P = 2.17×10–5, or 1 in about 46,000), you need extra diligence in ruling out other, closer relationships as part of your analysis. That’s not to say that 4th cousins will never share a 50-cM segment, just that there’s an extra burden of proof associated with it.” — This is an tangled problem. The problem is that simply multiplying exponentially decaying functions doesn’t represent the actual inheritance in our population. That is why approaches like markov chain models are used (see the discussion by AncestryDNA on how they modeled their probability graphs for expected shared chromosome lengths, for various number of meioses.)
There will be many people who will find 4th cousins with whom they share at least 50 cM of statistical significantly large chromosome regions, though usually each segment will be less than 50cM. And the smaller the founding population in regards to the testing population within a defined number of generations, the more common this will become for 4th cousins to share many segments.
The initial conditions become quite important when the founding population themselves are already close-ish cousins.
The final wrinkle, and I deal with this in my genealogy, is that often matches tend to be multiple cousins. Just the other day a 4C1R showed up for a kit of mine, sharing over 100cM. Upon closer inspection of the pedigrees, it seems like the two people are also 5th cousins, and perhaps by two different paths. Untangling this is difficult. But with a population breeding like this, the likelihood of finding large segments among 4th cousins is not surprising.
Thanks for your thoughts. It’s important to distinguish between a total shared amount of, say, 50 cM and an individual 50-cM segment (whether there are other shared segments or not). I’m only addressing the latter here. I also need to revise the wording because I’m not calculating the probability that 4th cousins will share a 50-cM segment, rather that 4th cousins will share a specific 50-cM segment. We need simulations to get at the former question.
I’m not too sure what to say – I’m lost for words, except to say that [a] I love dogs, [b] I love probability / statistics, [c] you and Blaine are an excellent inspiration to other researchers, but with one minor constructive comment – could you include a discussion on confidence levels, or similar.
Do you or Blaine know, or have met Abby Sciuto from NCIS ??? I’m sure you’d get on just fine, LOL
Thank you for the kind words. As for confidence levels, the way I’m doing the math doesn’t lend itself to that. Simulations would get a great asset here, but they’re above my pay grade. I’m hoping to spur others to do them!
A calculator in Excel based on segment size and relationship can be build using a lookup table with input variable Relationship and output variable N.
Please email me if you like to receive an example.
That sounds promising! Do you know how to put something like that online as a simple tool for end users?
Could this explain why the cousins on my chromosome 1 are all getting shaved down by timber? Does CHR 1 have a higher chance of this recombination? I have one on gedmatch with a 65cm segment on CHR 1 but ancestry stripped him down to 14cm. Conversely, I have a match with only 3 segments, one is 72cm long! We share a total of 88cm.
Timber isn’t based on crossovers. Instead, it finds chromosome regions in your matches that are hugely overrepresented compared to the rest of your genome and then down-weights them. The assumption is that those regions aren’t shared because of recent ancestry but because of deep population-level matching called “excess IBD”. It’s unusual to have them down-weight a segment that large, but it can happen. What is your population background?
okay, I did not read down all the way to the bottom, but remember too, that your cousin, also received the same similar information, one of which is related to your parent. Brother or sister to one or the other of them. SO they too also share the same information, proportionally, possibly as well. (or in the case of the Mormon persuasion, could also be two or three times related, based on pedigree collapse.)
So you would have to factor that into your equation, to calculate the end results that you are attempting to figure out.
Bottom LINE, DNA IS RANDOM at best. We are just lucky that at conception, it was ALL GOOD, and viola, you are here.
Every step in the connection between two cousins is being considered here.
Thanks for interesting blog. I would like to know if we can adapt your formula to triangulated relationships and multiple regions of shared DNA. We found Evelyn shares about 50 cM DNA with Judy on chromosome 3 and about 50 cM with both Judy and her brother Michael on chromosome 11. Based on a putative 2nd cousin once removed (2C1R) relationship the pairwise probability that 50 cM sharing is random is <0.001. Since there were 3 separate 50 cM matches the probabilities are multiplied, ie. P20 cM of DNA at chromosome 10 (Judy and Michael), chromosome 12 (Michael) and chromosome 22 (Judy). Each of the latter are associated with an <0.005 probability at the 2C1R level. The combined probability that the latter 4 shared segments can be attributed to chance is <10-E9.Thus, the data are highly consistent with a 2C1R or more distant genetic relationship. However, the results also seem to be consistent with a closer familial relationship. For example, if we propose a potential 1st cousin relationship the combined finding of three 50 cM matches computes to a combined probability of <0.00003 as do the chances of four 20 cM matches. Analysis of vital records indicates the closest potential relationship between Evelyn with either Michael or Judy is 2nd cousin once removed. So how does one apply probability to decide most likely familial relationship in this situation?
I’m afraid I don’t quite follow which chromosomes the shared segments are on, but remember that the probability that full siblings (Judy and Michael) share a given segment is much higher than the probability that one of them will share with a 2C1R. Also, to prove a 2C1R relationship, I wouldn’t use this approach at all; it only applies to individual segments at fixed locations. You might want to read this post to see how I proved a relationship in that range:
https://thednageek.com/claude-duval-lacoste-1871-1926/
My husband and I have both done our DNA. We are both looking to break down brick walls from the early to mid 1800’s. As I read”average” tables, we should be actively seeking matches with 5 to 15 cM, yet when you find someone with segments that size AND a name you are looking for, the usually go “It’s not real”. What I find really frustrating is that with a Gedmatch type comparison down to 3cM, we are getting up to 8 segments around the 5cM size. Although 5cM is low confidence of a findable paper trail, there must be a formula to increase the probability that a match is real where there are multiple segments? As a side note, I have paper trail to a known 4th cousin relative and we share 7 segments between 3 and 7 cM. I consider the combination of paper AND so many shared segments to be as close as I can get to verifying the paperwork.
By the time a segment gets down to 40 cM (yes, forty), it’s more likely to be lost completely than to be broken down and passed on in part. The smaller the segment, the worse those odds are. That means true small segments that reflect relationship are expected to be rare. Another factor is that small segments (less than 7 cM) are statistically more likely to be false positives than real IBD. Finally, many small IBD segments are what we call excess IBD, pileups, or population segments. Those terms all mean the same thing: that particular segment was relatively widespread in the historical population that your ancestors came from, so the fact that you share it with someone else from that population doesn’t mean your shared ancestor with that match lived in that population; the shared ancestor could have been hundreds of years prior.
All of which is to say, small segments are dangerous. They are more likely than not to give you misleading information.
I have set up a spreadsheet — with a lookup table for number of transmission events — to calculate the probability of two people inheriting an identical segment using the approach you’ve outlined here. I’ve also added an additional step to calculate the probability of several individuals inheriting the same stretch (I’d love to have my math checked on that).
Contact me by e-mail and I’ll send the spreadsheet.
Will do. Thanks!
Dave, I’m interested in that, do you have a blogpage or FB group I can find you via for connecting?
Hello Dave, I would love that table !
Pls send
thank-you !
Julian
Great post. My question is twofold: (1) transposons constitute some 45% of the human genome. What part do they play in what you describe? (2) we do chromosome mapping of matching DNA segments to our genome, attributing those segments to an ancestor/ancestral couple. Are the ancestral segments on my chromosomes in the exact position (start, stop, chromosome) as they were in the ancestor from whom they were inherited? Everyone makes that assumption but I’m not clear on why this should be the case.
Great questions! For those unfamiliar with transposons, they are short segments of DNA that can move around in the genome and make copies of themselves. Short answer to your question is that it depends on how old the transposon is (how many million years it’s been since it stopped jumping around the genome). I assume that the chips are designed to exclude SNPs in transposons that are unreliable.
Re ancestral segments, I assume you’re asking about transposons and indels (insertions/deletions) in the genome. It’s entirely possible that a segment that starts at position 1,000,000 in you started slightly earlier or later in your ancestor, due to length changes in the genome along the way. However, you’d still be able to cross-reference them because of the SNP composition.
I hope that answers your questions.
I’m a little confused about your paragraph on raising to N-1 instead of N. From my understanding you are calculating the probability of inheriting a specific unbroken segment of length C from a parent, by combining the probability of a cross-over appearing on that section and the probability of inheriting that section. If you plug N = 1 into your formula (i.e. calculate the probability of inheriting that specific segment from you parent), then you get P = 1, implying that you have a 100% chance of inheriting that specific segment unbroken from your parent.
Also, I seem to remember from a molecular ecology textbook from a class I was in a while ago that the probability of inheriting any segment of length C instead of a specific segment of length C can be calculated by summing the probability of inheriting a specific segment of length C across all the possible starting positions for that length of segment. So, if there are 3 billion base pairs in the human genome, then the probability of inheriting a segment of length C would be roughly:
P = ((e^–C/100) x 0.5)^(N–1) x (3,000,000,000 – C*x)
where x is the number of base-pairs in centimorgans (~ 1 million?)
Are those probabilities really not independent? (above my pay-grade)
The N-1 adjustment is explained in the “Easy Peasy” section. It’s there to account for the fact that we can’t “see” the crossovers that occurred between a parent and a child. When I compare myself to each of my parents, we match all the way across each of the 22 autosomes, and it looks like there were no crossovers, even though there were 35 on average.
You’re right that the probability of inheriting any segment of length C is different than the probability of inheriting a specific segment of length C. Your formula would need to be adjusted to accommodate chromosome size, though. The probability of inheriting a segment of 300 cM is zero, because none of the chromosomes is that long. The probability of inheriting a segment of 200 cM would be lower than your calculation, because only 5 of the 22 chromosomes are that long. Even for shorter segments, the true probabilities would be lower than your calculations because as you “walk” the chromosome, you get to a point that is C centimorgans from the end and you can no longer make a segment of size C.
Also, converting cM to BP introduces a lot of error because there is no fixed ratio.