The Future of Relationship Predictions

Autosomal DNA is a fabulous tool for genetic genealogy.  We can use it to predict how we might be related to our DNA cousins, then follow-up with documentary research to corroborate and expand our trees.

Currently, we base these predictions on the total amount of shared DNA.  A simple mnemonic is that, on average, the amount of shared DNA halves with each generational step.  The more shared DNA, the closer the relationship.  You can see this effect in the “expected cM” column in the table below.  The “observed” and “range” columns show real data from the Shared cM Project by Blaine Bettinger.

Relationship prediction is complicated by several factors:  (1) there are genetically equivalent relationships (for example, an aunt averages the same amount of shared DNA as a half-sibling), (2) each relationship has a range of expected DNA rather than a precise value (see the table above), (3) the ranges for different relationships overlap, and (4) we tend to have more cousins at each degree of cousinship (more 2nd cousins than 1st, more 3rd than 2nd, and so on).

Winding road with view of Appalachians in the distance
You can’t get there from here.”  You can’t predict a relationship knowing only how much DNA is expected for different cousinships.  You also need to know how many cousins at each level you have.

To predict a relationship from a centimorgan amount, you need accurate information for the averages, ranges, and distributions of how much DNA is typical for each cousin relationship.  Then you need to account for how many cousins you have at each level, as I explained in an earlier blog post.

If the first step in the process is flawed, the second will necessarily be flawed, too.

Graph showing relationship probabilities given centimorgan amounts
Figure 5.2 from AncestryDNA’s Matching White Paper

For now, the best resource we have for relationship prediction comes from Figure 5.2 in AncestryDNA’s Matching White Paper.  Their data forms the basis of both the Shared cM Tool and the What Are the Odds? tool at DNA Painter.

Is the AncestryDNA data perfect?  I doubt it; no dataset is.  They’ve never publicly explained how they got the data in Figure 5.2 nor the updated values available on their website, so we can’t evaluate their methods other than through the thousands of successes using the tools mentioned above.

If we can’t evaluate what they did, we can’t improve upon it directly.  Where else can we get data to forge the next generation of predictive tools?  Where does the future lie?

The Shared cM Project

The Shared cM Project (SCP) is an invaluable asset to genetic genealogy.  It is real data, contributed by thousands of volunteers, from the very databases we use for genetic genealogy.

The SCP has two main weaknesses, though.  First, all of the information is self-reported.  Some volunteers may have submitted incorrect data, either because they didn’t realize Cousin Ed is really a half cousin or because of typos.  Second, the volume of data is limited.  While the SCP represents a remarkable ≈60,000 individual data points, it covers 48 different relationships or an average of only 1,250 comparisons per relationship type.

A alternative approach is to generate artificial DNA data using computer algorithms.  The benefits are that you always know the exact relationship (if you tell the computer that Cousin Ed is a full cousin, he’s a full cousin) and that you can collect as much data as your computer can handle.  The drawback—and I can’t emphasize this enough—is that if the mathematical model you use for the computer simulations is not right for the task, then all of the downstream predictions will be wrong.

Let me say that louder for those in the back:  If you use the wrong model, your results will be wrong.  Period.  Over-reliance on simulated data just because it came from a computer is bad science.  It always always ALWAYS needs to be validated against theoretical expectations and against real data.

The good news is that we don’t have to choose.  We benefit most from using both an empirical approach, like the SCP, and an in silico strategy like computer modeling.  In fact, they are complementary.  We can use SCP data to evaluate whether our computer models are correct, and we can compare simulated data to find errors in the self-reported values.

The problem with simulations is that the average genealogist doesn’t have easy access to that kind of computational firepower.  Thus far, we’ve been reliant on data from AncestryDNA for our predictive tools, but it would be fabulous to be able to play around with these analyses ourselves.

 

Enter Ped-sim

There simply aren’t many DNA simulation programs available to the public, and they are not user friendly.

A promising one is called Ped-sim.  It is an open-source program from the laboratory of Professor Amy Williams at Cornell University.  Ped-sim can simulate shared DNA data for any genealogical relationship you tell it to.  And it’s fast!

Even better, it incorporates two features of biology that affect how DNA segments are passed on:  crossover interference and sex-specific recombination rates.  I won’t go into the gory biological details (unless you ask politely); suffice to say that for advanced genetic genealogy applications, we want to account for both factors.  And Ped-sim is the only publicly available program I know of that will do both.

Dr. Williams has posted a “lite” version of Ped-sim on her website HAPI-DNA (pronounced “happy DNA”).  It’s a neat introduction but shows only a fraction of what Ped-sim can do.

Unfortunately, we can’t use Ped-sim to create a predictive tool for genealogy.  At least not in its current form.  That’s because Ped-sim uses a different “map” of the genome than our testing companies do.  The map in Ped-sim was published by Bhérer et al. (2017) and has fewer centimorgans than the maps the genealogy companies use.

For example, a parent and child at AncestryDNA share about 3,470 cM, and all of the other major databases report more than 3,500 cM for that relationship.  When factoring in the sizes of the databases, the overall weighted average is almost exactly 3,500 cM.

The sex-averaged map of Bhérer et al., by contrast, is only 3,346 cM.  I’ve confirmed that total by tallying up the chromosome sizes from their scientific paper and by running parent–child simulations in Ped-sim.

If the total genome size is different from the ones we use for genetic genealogy, the averages for each relationship will be different, too.  Brit Nicholson, who has been trying to develop new relationship predictors, considers correct averages to be Rule #1 for a genetic genealogy dataset to be accurate.  As he so aptly says, “This is the easy one.”

The Bhérer averages are not “correct”.  More precisely, they are correct for many applications, just not for genetic genealogy.  Thus, a predictive tool based on data from Ped-sim will not be accurate for our purposes.

Such a tool would be “off” in its predictions for all of the DNA matching companies.  The problem would be increasingly severe at sites that use larger totals for parent–child matches, like FamilyTreeDNA and GEDmatch.  You can see the effect at different cousin levels in the table below.  For each relationship, look at how different the company averages are from the Bhérer et al. values (3.7%–6.4%).

Let’s be very clear:  Ped-sim is not “wrong”.  It’s an excellent piece of software that’s been validated against empirical data and published in a peer-reviewed scientific journal.  In fact, the paper (Caballero et al., 2019) clearly shows why crossover interference and sex-specific recombination rates are important.  (Previous posts on this blog, Julie’s Story and Gordon, considered both factors.)

Ped-sim is just not the right tool for our particular needs.  It’s like using a flathead screwdriver on a Phillips screw—close but not quite—and trying to force it could have unintended consequences.

I suspect future iterations of Ped-sim will make adaptations for genetic genealogy.  When that happens, or when other biologically accurate simulation tools become available to the public, the next frontier in relationship prediction will be upon us!

References:

Bettinger B (2020) The Shared cM Project Version 4.0 (March 2020).  https://thegeneticgenealogist.com/wp-content/uploads/2020/03/Shared-cM-Project-Version-4.pdf

Bhérer C, Campbell CL, Auton A. (2017) Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nature Communications 8:14994. https://doi.org/10.1038/ncomms14994

Caballero M, Seidman DN, Qiao Y, Sannerud J, Dyer TD, Lehman DM, et al. (2019) Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLoS Genetics 15(12): e1007979. https://doi.org/10.1371/journal.pgen.1007979

Nicholson B (2021) What is an accurate dataset in genetic genealogy?  https://dna-sci.com/2021/08/19/what-is-an-accurate-dataset-in-genetic-genealogy/

8 thoughts on “The Future of Relationship Predictions”

  1. Thank you for this. Although I cannot pretend to understand everything about all the contributing factors, including crossover interference, it helps to see a layout of the differences in various testing companies. The starting point affects the entire outcome, and reminds us to be cautious in predicting a relationship based on numbers alone.

  2. In that the ratios of the Bhérer, et al. numbers to the other “standards” are fairly consistent, does that mean using Ped-sim becomes more valuable simply by applying the ratio multiplier?

    1. For closer relationships (which should be normally distributed), a simple multiplier might work to shift the distributions to the averages we expect. We start to see skew (that is, distributions that are “tilted” one way) at about 2C, and it gets worse from there. For those, a simple multiplier won’t work as well. And for relationship prediction, those matches are where we need our models to be as accurate as possible.

  3. Agree with what Mike says above!
    Thanks again Leah for being “geek” enough to boil this down for us!!

  4. Good to know that.
    I am happy using the models currently available, such as the Bettinger chart and Jonny Perl’s tool. My statistical work has been using data with much less overlap, so I found the massive overlaps from different relationships a bit different at first. Then there are the segment lengths at different places. Some of my Ancestry cM match totals have been massively trimmed and are greatly different from elsewhere. Maybe the Ancestry figure is better? I don’t know. And then MyHeritage uses imputation, which will also tend to result in a slightly different segment length.
    But genealogists are used to “which Smith in the village is mine?” type problems and work with a variety of data sources to reduce uncertainty. It comes with the territory. We just have to remember that uncertainty lives here too.
    I have a wide variety of matches across a group of related people for the same relationship, so in my experience, biology is often the biggest source of variability, and sometimes those 1% chances do come up!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.