At the end of Part 2, I presented some data comparing different endogamous populations.
Without going into detail (Watch the talks!), the “hotter” the colors, the more endogamous the population is, and the more caution you need to exercise when working with your autosomal DNA matches.
I also made an appeal for volunteers to contribute their own match data anonymously to an ongoing study of endogamy. The goals are two-fold: to gauge how much endogamy is in different populations, and to develop best-practices based on that information.
If two or more of your grandparents were from the same endogamous population, you can help! The study can use match data from either MyHeritage or AncestryDNA. The other companies don’t provide the information in a format that’s easy to use. However, if you’ve tested elsewhere, you can upload your raw data from your testing company to MyHeritage and still participate in the endogamy study. Instructions are here; scroll down to the section on MyHeritage.
These instructions describe how to get the two or three columns of information needed for the study. You can then email the file to me at theDNAgeek (at) gmail (dot) com. Please also tell me the name or location of the population and the number of grandparents who belonged (e.g., ‘all four grandparents were Ashkenazim’ or ‘two of four grandparents were Samoan’).
If you manage kits for relatives who are also willing to contribute, you can send their match data, too. Reassure them that I don’t need any identifying information about them or their matches.
MyHeritage
The data is easiest to obtain from MyHeritage. At the top right of your match list, you’ll see three vertical dots. Click them to get a pop-up with some options, and select “Export entire DNA Matches list”.

MyHeritage will ask you to confirm (click “OK”), then they’ll email you a csv file. It may take a few hours for that file to come through, so don’t worry if you don’t see it right away. The file will be called “Firstname Lastname DNA Matches list” with some additional code for the date and kit identifier.
When you receive the email, open the attached file in a spreadsheet program, like Excel or Google Sheets. Delete every column except “Total cM shared”, “Number of shared segments”, and “Largest segment (cM)” (highlighted below).

Save the edited file (with just the three columns) in csv format with a descriptive name for the population and the number of grandparents from it, e.g., Iceland4.csv or Garifuna2.csv. Note that the file will no longer contain any identifying information about your matches.
Email the file(s) to me at theDNAgeek (at) gmail (dot) com. I’ll let you know how your kit(s) rank compared to other endogamous populations.
AncestryDNA
The data isn’t quite as straightforward to get from AncestryDNA. You have to use a third-party tool called the DNAGedcom Client, which is a stand-along program you install on your computer. It requires a nominal subscription of $5/month. It’s worth trying it for a month to see if you find it useful. (It does a lot more than I describe below.)
Alternately, if you aren’t already in the MyHeritage database, you might consider uploading your data there, especially as they’re offering a free “unlock” of their tools this week. Instructions are here.
However, if you’re at AncestryDNA and have (or can get) the DNAGedcom Client, here’s what to do.
Open the DNAGedcom Client and log into your account there.

Next, click “Gather” in the top menu bar, then click the AncestryDNA button.

On the next screen, log into your AncestryDNA account using your credentials. This information will not be stored anywhere else.

Once you’re logged in, select the kit you’d like to scan, set the minimum cM value to 20, and make sure none of the tick boxes are checked.

Then click the green “Gather DNA Data” button. The scan may take a while, depending on how many matches you have. (Rescanning a kit is a lot faster. I promise!)
When it’s done, you’ll see a note below the login panel that says “Creating Ancestry Reports Completed.” It will have saved a file to your computer called “m_Firstname_Lastname.csv”.
Open that file in a spreadsheet program, like Excel or Google Sheets. Delete every column except “sharedCM” and “sharedSegments” (highlighted below).

Save the edited file (with just the two columns) in csv format with a descriptive name for the population and the number of grandparents from it, e.g., Oaxaca4.csv or Afrikaans3.csv. Note that the file will no longer contain any identifying information about your matches.
Email the file(s) to me at theDNAgeek (at) gmail (dot) com. I’ll let you know how your kit(s) rank compared to other endogamous populations.
Andthank you!

Updates to This Post
14 October 2021 – corrected the subscription price for DNAGedcom
Could You tell what happens to DNA-data after the analyze is done. Are the data deleted or do we risk sharing with unknown actors?
The study doesn’t collect any DNA data or even the names of matches. The only information needed is the total amount of DNA (in centimorgans), the number of segments, and the amount of the longest segment for each match.
I would be happy to send to you many DNA tests from Ancestry that I manage due to all are family for you to upload, you can reach me directly at Peggy Sue Druck
I have one grandparent who is Ashkenazi and another who is 3/4 Ashkenazi and 1/4 Sephardi. Will that satisfy the criteria for the study?
I’d love to see that data! Thank you for offering.
I would consider contributing to your study. I am 1/4 French Canadian ( and the tree goes waaaay back), 1/2 Ashkenazi, and maybe 1/8 Nova Scotia
It might be hard to distinguish how much each group is affecting your matches. I’d be happy to take a look for you, though.
My family comes from western Finland. When I did a cluster in December of 2019 with my My Heritage DNA, 70 of 100 results were one big cluster — which is actually better than my initial cluster in March of that year of 93 of 100. Is that a traditionally endogamous area? Would you like any of that info? Your talk certainly gave me renewed hope for being able to make any sense of this!
I would love to see the data! I don’t have a good sense of how endogamous Finland is, so any new data would be helpful.
I come from an endogomous population from an island off the coast of current-day Croatia. This is on my mother’s side, and I am on 23andMe. So is my mother’s brother (my uncle). I don’t want to upload DNA data to another site, nor to read their ToS, but I’m handy with spreadsheets and math. From the explanation you give in a response above, it sounds like a list I can generate from the relatives csv file of relations I can download from 23andMe. I can even anonymize people’s names myself.
The degree of endogamy I have found in my ancestors back to the 1700s: 2 ancestor pairs who appear four times, one ancestor pair that appears twice, and there are also repeat ancestors from those ancestor descendants. I was also able to, through a lot of work, find three common ancestor pairs with a DNA cousin. I was heartened to learn from your talk that in five years there may be methods to make DNA relations clearer in endogomous populations.
Whether you’re interested in my population or not, I salute your work and found your talks the most interesting and relevant to my efforts in Rootstech (the tiny subset I watched, of course).
If you’re willing to extract the data from your 23andMe matches, that would be great! I’m collecting total cM, total number of segments, and the cM of the longest segment (all excluding the X chromosome). Ideally, it would be in a spreadsheet with one match per row. No need to include match names.
I’m the descendent of Colonial Americans. I will try to manage this, but the number of marriage combinations in my family fried my brain. My father in the descendent of John Price and his first wife and my mother is a descendent of same with his second wife. Brothers married sisters and then their children married each other. Cousin married her uncle and her nephew. I was a flower girl in the wedding of my second cousins. I have third cousins who share triple the DNA of other same generation cousins. I’ve stopped using 3rd cousin and use CMs instead. I will try this but I’m not sure it will give the big picture.
It’s definitely complicated!
I have three different endogamous groups in my ancestry. Both of my maternal grandparents had about 90% of there ancestors from the northern part of Essex County in Massachusetts and the rest came from towns in eastern Massachusetts. Most of my grandmother’s ancestors came from Amesbury MA. There are 17 original settlers there and 8 are her ancestors and two of them 5 times and all of their ancestors were here before 1700 and most by 1650. My paternal grandfather was from a small town in Nova Scotia CA and my paternal grandmother was from a small town in Newfoundland CA.
I sent your article to my 4th cousin in Newfoundland (we connected on Gedmatch) and she would be perfect for your study. All 4 of her grandparents are from the same small area. She said that she would be interested. If you are interested in her send me an e-mail and I’ll send you her e-mail.
Thank you! Please have her contact me here: https://thednageek.com/contact/
I just re-read your comments about volunteers. I tested at 23 and me but transfered my data to my heritage and gedmatch. I know that my 4th cousin tested at FTDNA but she has transfered her data to gedmatch and my hertiage. I have a 2nd cousin in Nova Scotia who has 2 1/2 grandparents from that small area. Her data is on 23 and me and gedmatch. Would here be any good?
The data from MyHeritage is perfect!
Not sure if I can be of help. One set of 2nd great grandparents were second cousins (Acadians) who likely had church approval to marry. I once heard Blaine Bettinger comment about Acadians as an example of endogamous population. Let me know if this meets your criterion.
Yes, Acadians and their Cajun descendants (I’m Cajun) were endogamous. It’s quite common! Thank you for the offer to help. Right now, we’re focusing on people who have tested and are double second cousins to one another. We’d love your help later when we expand our testing to other scenarios.
Just listened to your two Rootstech talks on endogamy today with great interest. I would love to participate in your study and have at least two and possibly a third set of data to offer. First of all mine may not qualify in that I have only one grandparent who is Cajun (hi cuz). His wife, Colonial American. My father in law, 3 grandparents Spanish Colonial New Mexico. Fourth grandparent a different endogomas population, namely Ireland. Finally my wife, 4 grandparents, 8 great grandparents Spanish Colonial New Mexico. Do you want all 3 or just the two?
Could you also include links to other endogamy tools you mentioned in your presentation.
Thanks
That would be great! Right now the study is focusing on people who have at least two grandparents from the same endogamous population, so your wife and FIL would be perfect!
I’ll work on it in next couple of days.
Would be happy to share My Heritage data — I know I have endogamy, on both maternal and paternal sides, although different populations. What I don’t know is how to separate out the data into populations. The file I got from My Heritage has 13781 lines of data, and I know at least a few of the largest are paternal and many of the rest likely to be maternal. Would the data help? I can provide more detail if you like, or I can just go ahead and send with my best guess about likely groups! Thanks.
I appreciate the offer. For now, the study is focusing on individuals from just one endogamous population. I’ll keep you in mind for future work, though!
Hi, are you still accepting contributions to this? I’m waiting on test results at the moment, but I can send stuff your way later if you’re interested. I’m thoroughly Appalachian on both sides of my family and I have two 5th great-grandfathers that are brothers.
Yes please!
Hi, Leah,
Thoroughly enjoyed your Banyan talks at RootsTech. I manage several kits at MyHeritage for people with 4 Ashkenazic grandparents. Would you like me to download each one?
That would be wonderful! Note that if you process the data as described in the blog, it will be completely anonymous.
Are you still collecting data for your endogamous study?
Yes, feel free to contact me if you’d like to contribute. https://thednageek.com/contact/
When you calculate the average segment size do you use the pre-Timber or post-Timber amount?
I use post-Timber because it’s easier to access. If there’s a big difference, you almost definitely have endogamy.
The problem with this methodology is that it creates a double counting distortion, in the best case scenario. Furthermore, that kind of advice could lead to people reaching the wrong conclusions about the actual rate of endogamy among their DNA matches.
The endogamy among some populations, e.g. Roma, is so heavy that it makes AJ endogamy look like a cakewalk by comparison. Yet the former are barely grazed by Ancestry’s TIMBER. Meanwhile, those of British descent get pummeled and hammered by TIMBER at a rate disproportionate to their actual level of endogamy, which is broadly agreed to be at the mild end of the spectrum.
Ditto for Ashkenazi Jews, believe it or not, at least in the disproportionate sense, and those who are partially AJ get it even worse. It would help to explain why there is such a sharp drop between 2C and 3C, with the average segment size among Ashkenazim in the Close Family to the 2nd Cousin cM range being similar to populations in the Mild Endogamy category, while the average segment size for the 3rd Cousin to Distant Cousin cM range being similar to populations in the Extensive Endogamy category. The fact that Ashkenazi Jews are awkwardly straddling these two disparate categories should have been a clue that something is off. Although part of it has to do with the historical tempo of AJ endogamy, that’s not the whole story. It’s also an artifact of the ham-fisted way that Ancestry decides if you are TIMBER-worthy, which causes bizarre discontinuities.
Anyone who can guess why these two groups get the brunt of TIMBER deserves a free year of Pro Tools, or a decade. Hint: Ancestry’s TIMBER algorithm has a major base rate problem.
The fact that Ancestry limits TIMBER to cMs below 90 could be a hint that even they don’t have much faith in its own validity, and are actually more concerned with keeping our match lists to a “manageable” size, a 23&me style truncation by a different name. As opposed to doing it in a less micromanaging way, such as providing us with the ability to order our matches by longest segment or average segment size, or by introducing a chromosome browser or a triangulation tool.
Plus, we know that the 90 cM limit is not some magical safe harbor that makes it immune to the effects of pedigree collapse or endogamy, with the zone of endogamy extending well above it for some populations. And since that’s where the closer matches are, that’s also where a valid TIMBER, as opposed to the current overwrought, sloppy, in your face one would be most helpful. It’s small comfort knowing that you have to wade through hundreds of endogamous IBS matches mixed in with the valid ones to reach the still potentially mis-TIMBERed ones below 90 cM.
The takeaway is that everyone gets shafted by TIMBER, albeit from different ends, with some systemically over-TIMBERed and some systemically under-TIMBERed. Ancestry’s TIMBER algorithm deserves to be viewed in the same light as MyHeritage’s overeager imputation algorithm which so generously gifts us those beloved phantom Frankensegments, and not simply taken at face value.
Ancestry’s TIMBER is also causing knock-on problems for their AutoCluster tool. The other implication is that the data in the Shared cM Project is corrupted in the sub-90 cM range, and isn’t comparable across different populations.
Yes, this issue seriously needs to be brought to the attention of TPTB at Ancestry, and not merely hand-waved away, as it’s already caused much misunderstanding and confusion.
There’s more that I can say on the matter, but I’m trying to keep my comment “manageable” here. I know it would have no hope of making it through at, say GGTT.
I agree that TIMBER is imperfect, especially for groups that are underrepresented in the database, because they probably don’t have enough of a reference sample to assess the “pileups.” I can’t think of a mechanism that would cause overcorrection for groups like Brits and Ashkanzim. Can you?
(Apologies for the delay in posting. I was at RootsTech.)