Have you ever looked at a particularly large shared segment with a distant cousin and wondered “What are the odds?” We wouldn’t question a 100-cM segment shared with a 1st cousin, but with a paper-trail 5th cousin, we might be suspicious. Turns out, we can calculate the probability using some basic statistics. Math to the rescue!
Some background: We each have two sets of chromosomes, one inherited from our mother and one from our father. Each of our parents also has two sets of chromosomes, but they passed on only one set to us through their gametes (the egg for Mom, the sperm for Dad). The process of making gametes is called meiosis, and two important things happen. (1) The two sets of chromosomes in our parents pair up with their partners (the two copies of chromosome 1 pair, the two copies of chromosome 2 pair, etc.) and swap bits. This is called crossing over, and it’s what causes DNA segments to get smaller over generations. (2) After crossing over, the resulting mix-and-match chromosomes are divvied up into the gametes, such that each egg or sperm gets only one copy of each chromosome, but which one it gets is random. This is called independent assortment.
The image below shows a hypothetical pair of chromosome 1 (at the top) that experiences two cross-over events (the black Xs) to produce two recombined products (at the bottom). Only one of the two daughter chromosomes will be passed on to the child during independent assortment. (It’s actually more complicated than this in real life; this cartoon is meant to convey the basic concept.)
We “measure” DNA segments in units called centimorgans (abbreviated cM). A centimorgan isn’t really a distance, though, because chunks of DNA with the same length in bases can have different centimorgan values. Instead, centimorgans are an indicator of how much crossing over is likely to take place in the daughter chromosome that is passed on. Many people think a centimorgan is a probability, that a 100-cM segment has a probability of 1.0 for crossing over, but it’s not that, either. If it were, the maximum possible centimorgan value would be 100, and most chromosomes are longer than that.
A centimorgan is actually an average. Take chromosome 1, which is 281.5 cM at GEDmatch. That mean on average it will experience 2.815 inherited crossovers in a single generation. Sometimes it will crossover once, sometimes four times, sometimes not at all. Over thousands of cases, though, the number of crossovers will average to 2.815. The same goes for any chromosome or segment: It should experience crossing over, on average, C/100 times, where C is the centimorgan value.
A Law of Probabilities
The probability of two independent events occurring is the probability of the first one times the probability of the second. For example, the chances that I could flip a coin twice and get two heads in a row is 0.5 x 0.5 = 0.25, or a 25% chance. It doesn’t matter what the events are, as long as they’re independent. The probability that I could flip heads twice (0.25) and that my dog will be hungry (1.0. She’s a Lab; she’s always hungry!) is 0.5 x 0.5 x 1.0 = 0.25. If I wanted to add a third independent event, I’d simply multiply its probability times the combined probability that I already have. My family goes out to dinner about once a week (1/7, or 0.14 probability), so the chances that I could flip two heads, my dog would be hungry, and we’d eat out are 0.5 x 0.5 x 1.0 x 0.14 = 0.0357, or about 3.5%.
These are somewhat silly examples to demonstrate that the probability of multiple independent events all occurring is equal to the probability of the first x the probability of the second x the probability of the third, and so on until each individual event has been factored in. You will see how this relates to segment inheritance in a bit.
The Poisson Distribution
We can estimate the chances that a segment of a known length crosses over a certain number of times using what’s called the Poisson distribution. Without going into the gory mathematical details, the Poisson distribution lets us predict the odds of an event happening once, twice, thrice, etc.—or even not at all—as long as certain assumptions are met. For the Poisson distribution, those assumptions happen to fit our situation pretty well: (1) the average rate must be known (that’s what centimorgans are), and (2) each event must be independent of the others. That is, a segment can’t be more or less likely to cross over in one generation just because it did (or didn’t) cross over in a previous one.
Wikipedia gives the formula for the Poisson distribution:
Before your eyes glaze over, let me explain. Recall that a 100 cM segment crosses over an average of once; put another way, our average number of events per interval (λ, or lambda) is the centimorgan value divided by 100. The variable k represents the number of events you’re interested in. Here, k is zero, because we want to know how likely it is that a segment will experience no crossovers.
Here’s where math is on our side. Any number to the power of zero (the numerator in the equation’s fraction) is equal to 1. And the factorial of zero (the denominator) is also 1. Because k = 0 in our situation, the fraction part the equation simplifies to 1/1 and can be ignored entirely! Thus, the probability that a segment of C centimorgans will experience zero crossovers in a single parent–child transmission event is P = e –C/100.
To figure how how likely the segment is to survive more than one transmission event, recall the law of probabilities above. For multiple events, we multiply the probability of each. In this case, it’s P for the first transmission x P for the second x P for the third, and so on. For two steps (say, grandparent to parent to child), it would be P x P, or P 2. For three transmissions, it would be P 3, and for N transmissions, the calculation is P N.
Easy Peasy, Right?
Not quite. There are two more things we have to consider. The first is that we have no way of “seeing” crossovers in a parent–child comparison. A parent and child are half identical across all the autosomes, no matter how many crossovers took place in the parent. When we compare a grandparent and grandchild, although there were two generations in which crossover occurred, we can only see the results of one. That means we need to adjust our formula to account for the fact that one generation of crossovers will always be invisible to us: instead of P N, the correct calculation is P (N–1).
But Wait! There’s More!
We have one more consideration when calculating the probability we’re after: independent assortment. So far, I’ve mainly talked about segments as if there’s only one that either does or does not experience a crossover, but that’s a simplification. Recall that our parents have two copies of each chromosome. When we’re tracking a specific segment (like one we share with a cousin), we need to consider both the chance that that region didn’t cross over, P (N–1), as well as the chance that it gets passed on at all.
Let’s go back to our visual from above. Imagine that these are my mom’s two copies of chromosome 1 part way through meiosis. Crossing over has occurred but not independent assortment. Also imagine that my mom matches her niece (my 1st cousin) on that red 90-cM segment. Will I match my cousin there?
If I inherited the upper chromosome in this image, I will match my cousin; if I inherited the lower one, I won’t. There’s a 50-50 chance, either way.
So now we have the last piece of our mathematical puzzle: We multiply the probability P that the segment of interest escapes crossing over by the probability that it gets passed on (0.5). Only then do we consider what happens over multiple generations. Let’s put it all together. The chances that a segment of C centimorgans will survive N transmissions is:
P = (e –C/100 x 0.5)(N–1)
Clear as mud, right?!?
Don’t feel bad if you’re confused. This is complicated stuff. I do these calculations in a spreadsheet to avoid mistakes. Here’s an example with some pre-set centimorgan units for reference. You’ll need to click on it to see a readable version.
Ideally, someone will be willing to collaborate with me on a calculator that would let users plug in a relationship and centimorgan amount and have a probability pop out. Please contact me if you’re interested.
Another Law of Probability
If those numbers look really low to you, you’re getting it! They are low. That’s because I’m not calculating the odds of a cousin sharing any segment of C cM with you, rather the odds of a cousins sharing a specific segment of C cM with you. That’s an important distinction, because the odds of the former are a lot higher.
This brings us to another law of probability: if you’re interested in the chance that either of two independent events will happen (instead of both), you add each probability (instead of multiplying them). The mathematical approach outlined here can’t help, because the multiple ways that two cousins could share a C-cM segment are not independent of one another. Take this hypothetical example from chromosome 22, which is about 79.1 cM in total.
Two cousins could share a 50-cM segment all the way on the left or all the way on the right (or anywhere in between), but the possibilities would always overlap (black arrow). That is, they wouldn’t be independent of one another, and we can’t simply add up the individual probabilities.
To get at the question of how likely two cousins are to share any segment of a given size, we’re going to need simulated data, which is above my pay grade. I know a few people are working on simulations, and I look forward to seeing their findings.
Why Do We Need to Know the Odds?
DNA data are becoming increasingly important in genealogical proofs. For segments to provide compelling evidence, especially for distant relationships, we must consider how likely they are to be shared in the first place. If you are looking at a match that is statistically improbable, like a 50-cM segment shared between putative 5th cousins, you need extra diligence in ruling out other, closer relationships as part of your analysis. That’s not to say that 5th cousins will never share a 50-cM segment, just that there’s an extra burden of proof associated with it.
As described in the previous section, we’re going to need simulation data to determine the overall probabilities that two cousins share any segment of a given size, but the numbers described here do have some uses.
The first is when comparing two different relationship scenarios for the same match. For example, my uncle SH. shares a 53.9-cM segment with cousin RJ. RJ is a 2C1R through SH’s father and also a 4C1R through SH’s mother. How much more unlikely is it for that 53.9-cM segment to have come through the 4C1R relationship rather than the 2C1R one? Because we’re now considering a specific segment, we can use the calculations presented here. The numbers are low for both relationships (6.54×10–4 and 4.92×10-6, respectively) but that’s not what we’re interested in. We’re interested in the ratio between the two. That ratio is 133. Put another way, it’s 133 times more likely for that segment to have come through the 2C1R relationship than the 4C1R one. (Turns out in this case it did come through the more distant relationship, but that’s a topic for another post.)
The other way we can use these numbers is with triangulation. Once two people match one another on a segment, any additional cousins forming a triangulation group have to match in that specific spot. We can use calculated odds for a triangulation group to gauge whether our proposed MRCA is reasonable or whether we might be dealing with a different MRCA or even a pile-up region.
Ultimately, we can even extend the math to include the different recombination rates between men and women. But first, I think the the community must become comfortable with simpler, pairwise comparisons and reach a general consensus on when, statistically, to expect an extra accounting for unlikely results. Is it at a 1 in 20 chance (P = 0.05)? One in 100 (P = 0.01)? One in a thousand (P = 0.001)? Where would you draw the line?