People Power: Crowdsourcing DNA Stats

March 4, 2019 thednageek 17d Comments

AncestryDNA’s updated DNA Matches (now in beta testing; available here) has a nice little surprise for those of us who take a statistical approach to our DNA work.

Click on the shared DNA amounts in your match list.

Alternately, you may have a little “i” in a grey circle beneath the predicted relationship. Click it.

You’ll get a pop-up window with some interesting information.

The example above is my mom’s first cousin. She and I share less than average for first cousins once removed; only 32% of people who share 293 cM are 1C1Rs; the majority (63%) are second cousins.

The most likely relationship isn’t always the relationship. An essential question in DNA-based predictions is ‘How much more likely is one relationship than another?‘

Previously, the best data we had for that came from simulated data in Figure 5.2 of the AncestryDNA Matching White Paper, and it took some doing to get the numbers into a useable form. The probabilities from Figure 5.2 are integral to the Shared cM Project Tool and What Are the Odds? at DNApainter.com.

The numbers in the new DNA Matches feature at AncestryDNA are different from the previous data, and they also go down to lower centimorgan levels. I suspect that Ancestry’s scientists have redone their simulations now that their database is larger and improved the likelihoods. This is good news! Better probabilities will improve the predictions in tools like What Are the Odds?

Would you like to help? Compiling this new data from match lists is perfectly suited to crowd-sourcing. The values for centimorgan amounts below 280 cM have already been compiled. You can contribute percentages for matches above 280 cM by completing this survey.

If warranted, the new probabilities will be used to improve the relationship prediction tools at DNApainter.com.

Thank you!

Share on Facebook

17 thoughts on “People Power: Crowdsourcing DNA Stats”

Wallace Fullerton says:

March 4, 2019 at 1:11 pm

But if Ancestry calculated the probabilities based on the family tree analysis they have used for ThruLines, wouldn’t the results be skewed by all the questionable data in those trees?

Reply
1. thednageek says:
  
  March 4, 2019 at 2:23 pm
  
  The probabilities almost certainly came from computer simulations, not from ThruLines. That’s where they got the last set of statistics. We’ll have to wait for the White Paper to know precisely what they did.
  
  Reply
Dana Leeds says:

March 4, 2019 at 1:54 pm

I’m eager to see the results – thanks for doing this! I

Reply
1. Elle B. says:
  
  March 4, 2019 at 2:38 pm
  
  Thank you so much for informing us.
  So, currently at this point, which do you think is likely to be more accurate — the prediction tool at DNApainter (The Shared cM Project 3.0 tool v4, to which I often refer) or this new addition at AncestryDNA… or something else?
  Thanks!
  
  Reply
  1. thednageek says:
    
    March 4, 2019 at 8:57 pm
    
    I wish I knew. We don’t know how Ancestry generated the new numbers, so it’s hard to say which is more accurate.
    
    Reply
Christopher Schuetz says:

March 4, 2019 at 6:07 pm

“The probabilities from Figure 5.2 are integral to the Shared cM Project Tool and What Are the Odds? at DNApainter.com.”
For most people that implies that the Shared cM Project Tool figures were extracted from Ancestry modelling. I am sure that is not what you mean.
Surely the Shared cM Project Tool came from community contributions.
Why not support that existing project rather than duplicate effort?

Reply
1. thednageek says:
  
  March 4, 2019 at 9:01 pm
  
  The ranges in the Shared cM Project Tool came from Blaine Bettinger’s Shared cM Project. The probabilities came from Figure 5.2 in the AncestryDNA White Paper.
  
  Reply
  1. Christopher Schuetz says:
    
    March 4, 2019 at 10:20 pm
    
    They are two different sets of data. Any statistician would usually take both ranges and probabilities from one set of data – or the other. Normally only politicians cherry pick like that.
    Or the model would be adjusted to better fit the collected data. That could handle some relationships that were not collected from real data.
    
    Reply
    1. thednageek says:
      
      March 5, 2019 at 8:46 am
      
      Your charm notwithstanding, I’m afraid you don’t understand the datasets. They convey two different—but related—things. The Shared cM Project tells us the probability of a cM amount given the relationship. AncestryDNA’s figure tells us the probability of a relationship given the cM amount.
      
      As an aside, independent replication is a cornerstone of the scientific method, so even if these datasets did measure the same thing (which they don’t), they would both be valuable.
Pam Tabor says:

March 4, 2019 at 7:18 pm

I was attempting to submit data for 1747 cM after already submitting 1928 cM. I realized that there is no option for half siblings. (I misread it on 1928 and entered 100% for siblings when that should have been 0%. It was 100% for half siblings.) I think you need to add half siblings to your survey.

Reply
1. thednageek says:
  
  March 4, 2019 at 9:02 pm
  
  The survey uses the groups as defined by Ancestry. Half siblings are in the same group as grandparent/grandchild and aunt/uncle/niece/nephew.
  
  Reply
Richard Martin says:

March 5, 2019 at 1:21 pm

Apparently Ancestry.com changed its beta “New & Improved DNA Matches” page after you created your survey form, which I tried to use on March 5. The only blue link I could find now says “Add to group.” The name of each match on that page does change from black to blue when hovered over, but clicking it calls another page, not a pop-up. You do get a pop-up if you click on the little [i] after the amount of shared DNA, but for my granddaughter (1,677 cM) that pop-up fails to show on its list of possible relationships most of the choices presented in your survey form, which will not accept “not listed” as a response.

Reply
1. thednageek says:
  
  March 5, 2019 at 2:50 pm
  
  I have edited the post to show both ways of accessing the percentages. I’m seeing both ways at times, but I can’t quite figure out why.
  
  Reply
Pam T says:

March 5, 2019 at 2:13 pm

Yes, I understand now. I was just suggesting that it is easily misunderstood as it is currently phrased.

Reply
Valorie A Zimmerman says:

March 11, 2019 at 9:19 pm

My top matches in Ancestry are about 800 cM (first cousins). However I have my father as a match as well as my sister in MyHeritage, 23andme and Gedmatch, and my father on FamilyTreeDNA. Do you want such matches reported?

Reply
1. thednageek says:
  
  March 13, 2019 at 8:39 am
  
  Thanks so much for offering. We have almost all of the data we need for this survey, but the Shared cM Project is still accepting data: https://docs.google.com/forms/d/e/1FAIpQLSc5a0SIHIeiwLl5Wxn4sLqgnRV-su2klK2W_YzIJc9xq2i4zw/viewform
  
  Reply
Pingback: Improving the Odds – The DNA Geek