Last month, scientists from AncestryDNA published a research paper in the journal Nature Communications that describes the science underlying the forthcoming Genetic Communities. (Note: I earn a small commission if you purchase through the links in this post. The cost is the same for you. Click here for more information.) The title is “Clustering of 770,000 genomes reveals post-colonial population structure of North America.” While we wait not-so-patiently for this new tool, I thought I’d devote some attention to the paper itself. Scientific papers are not written for a general audience. They are dense, even for scientists like myself. So, I suspect the genetic genealogy community will benefit from a scientist’s take expressed in layperson’s terms. The paper is open access (fancy-pants wording for “free”), so you may want to download it and have it handy as you read this post. I will be describing both how to read a scientific paper as well as what’s in this particular report.
The first thing I do when reading a paper is check out some of the accessory information that relates more to who wrote the paper and when than to the content itself. The title and author list look like this:
The first thing I notice is that the paper took nearly a year from submission (24 Feb 2016) to final publication (7 Feb 2017). That tells me that the peer reviewers who were asked to evaluate the initial version of the paper suggested that the authors do additional analyses and that those new analyses took several months. For an author, it’s always frustrating when you get reviews back from a journal and realize you still have a lot of work to do, but the changes always almost make for a better study.
This particular report has 19 coauthors at four different institutions. Most of the authors have the superscript 1 or 2 after their names, indicating that they work for AncestryDNA in either San Francisco, CA, or Lehi, UT. If I dig more deeply, I see that Shiya Song and Theodore Roman were interns at Ancestry at the time and that Erin Battat was a paid consultant. (You’ll see how I learned that in a minute.) Thus, the study was a broad collaborative effort (lots of people) but done in house at AncestryDNA.
Many journals now include a section called “Author contributions” (or similar) that tells you how each author participated.
This project had a lot of moving parts, including preliminary experiments; genotyping; identity-by-descent (IBD) detection; and admixture, statistical, and genealogical analyses in addition to coordinating the work, interpreting the results, creating figures, and the actual writing.
The next thing I do is check whether the authors had conflicts of interest. Conflicts don’t necessarily mean that the work itself is suspect, but they should always be disclosed. Readers can decide for themselves whether they think the conflicts of interest color the conclusions of the study.
None of these competing financial interests is unexpected: AncestryDNA employees might hold stock in the company, two authors were interns, and one author was a paid consultant. The patent application describes what we’re hearing about the Genetic Communities feature. No surprises there, either.
The first paragraph in a paper is usually the abstract, which may or may not be labeled as such. This one isn’t. The abstract is a brief summary of the entire paper, so it packs a lot of information into a few sentences: what the question or challenge was, what was done to address it, what was found, and what it all means. The abstract is often the only part of a paper that gets read. For the more intrepid, it helps to set the stage for what you’re about to read.
Rather than dissect this abstract, I’ll move on to the paper itself.
The introduction (which is also not labeled) describes the rapid population expansion in the post-colonization Americas and some of the historical and environmental factors affecting demography. Previous studies of genetic patterns in the Americas have used admixture-type data, which doesn’t evolve fast enough to shed much light on patterns that emerged over the last 500 years. Instead, this study used statistical methods to find clusters of people who had taken AncestryDNA’s genealogy test (and consented to participate in research). The authors then integrated information about historical origins from the public family trees of those users to better understand large-scale patterns in American demography.
(This paper goes straight to the Results. The Methods are placed at the end, which is a newer style adopted by some journals. I assume it’s a concession to the fact that most scientists don’t read the Methods, although they should. A well-written paper using the Methods-last organization will provide enough summary information about the methods in the Results section put the findings in context. In this post, I may combine information from both sections.)
The study was based on nearly 775,000 consenting individuals who were genotyped with AncestryDNA’s v1 SNP chip. The data underwent the standard AncestryDNA matching process, including phasing and IBD detection, with the added requirement that two people share more than 12 cM total. (I did not see a reference to their Timber algorithm, which is meant to remove excess IBD, a.k.a. pile-ups, that does not reflect recent shared ancestry.) The workflow from there is summarized in their Figure 2, an example using data from the African American cluster. (Note: for all of the figures and tables I show here, you can find additional details in the original paper.)
Panel a shows a network of shared matches, where each person is represented by a circle, and each line indicates at least 12 cM of shared DNA. Panel b is the same network colored to show the extent of the three clusters. Explaining panel c is above my pay grade (“Spectral embedding … from eigen-decomposition of Laplacian matrix”?!? Yeah, no.), but this step refines the clusters to find stable subsets within them. Finally, more traditional genealogy comes in to play: each cluster is annotated with (1) admixture results of the cluster members and (2) birth locations of their ancestors taken from family pedigrees. The map (panel e) shows how birth location is tied to cluster membership, with “hotter” colors (orange and red) more tightly linked to the African American cluster.
The first round of clustering found five large clusters and one small one. The five large clusters were then subjected to another round of clustering to divide them into smaller groups and then divided again using spectral analysis (the Laplacian yadda yadda).
All of this statistical analysis resulted in 25 main clusters that fit into four broad categories based on similar demographic histories. Being in the same category doesn’t mean that two groups are necessarily related to one another, just that they’ve experienced similar patterns of genetic isolation and/or integration.
(1) Each of the intact immigrant clusters maintains a substantial amount of population structure from their source regions, before immigration to the Americas. A few examples include African Americans, Ashkenazi Jews, Scandinavians, and my personal favorite, Acadians. In contrast, (2) the continentally admixed clusters show ancestry from two or more continents. All seven such clusters listed in the table are associated with Spanish colonization and have 10% or more Iberian admixture. (3) A third category of clusters is assimilated immigrants, who lack strong connections to Old World populations and are not strongly separated from other clusters. Finally, (4) the four post-immigration isolated clusters reflect genetic isolation that happened after immigration to North America, either due to cultural or geographic constraints on intermarriage.
The most visually stunning part of this paper is their Figure 3, which maps each of the 25 clusters. For several — Pennsylvania, Lower Midwest and Appalachians, Upland South, Lower South, African Americans — you can see the steady overland migrations westward and, to a lesser extent, southward. For others — Acadians, Northeast and Utah, Caribbeans — the migration paths are disjoint, without much settlement along the way. Acadians, for example, arrived in Louisiana by ship, not over land. And some clusters — European Jewish, Irish, Appalachians, Mennonites — are still very localized.
The Discussion is where the authors explain what their results mean and what’s novel about them, address any weaknesses of their study, and speculate about how their findings might be useful in the future. In this case, they devote several pages to the four categories of clusters that they found (intact immigrant clusters, continentally admixed clusters, assimilated immigrant clusters, and post-immigrant isolated groups). This text is fairly accessible and interesting, so you should read it yourself. I don’t think I could do it justice in a summary.
The remainder of the discussion (starting on page 8) covers a variety of topics. First, the authors consider the ideal amount to subdivide the populations. Recall that they initially found five major groups using clustering, subdivided each of those groups again via clustering, then partitioned them further using spectral analysis. The more finely you divide a cluster, the more specificity you can uncover in the demographic history, but that comes at a loss of IBD signal. The optimal amount of subdivision is an area that will likely see improvements both as the database grows and as analytical methods are refined.
To validate the method and demographic conclusions, the authors analyzed samples from the 1000 Genomes Project, in which the populations were labeled by experts. Although the 1000 Genomes dataset was much smaller (1,816, which isn’t 1,000 … don’t get me started!), this analysis also recovered some of the same structure as found with the data from AncestryDNA’s database.
A novel product of this study is the ability to find genetic structure caused by recent population history, much of which could not be detected by other methods that primarily used global admixture patterns. Examples include Nuevomexicanos (European settlers in what is now the state of New Mexico) and unique groups of Mexican Americans whose IBD clusters reflect the different regions of Mexico from which they came.
A limitation of this work is that the algorithm requires large sample sizes to be able to detect accurate clusters. For example, some Jamaicans appeared in the Portuguese cluster, which makes no sense from an historical perspective and probably is an artifact of the small number of connections. Also, some known population structure in North America, such as among Southeast Asians and Chinese, isn’t represented. These issues should improve as more people from those populations test.
Another potential weakness is that the genealogical information came from Ancestry user trees. Some trees are supported by solid documentation, while others are not, either because documentation does not exist (e.g., Jewish records from parts of Europe) or because the tree was built by a novice genealogist. However, the authors conclude that the sheer scale of the genealogical “signal” swamps out the “noise” from incorrect or missing tree information.
The authors briefly compare their method (two types of partitioning of the genetic data) to similar approaches that don’t scale well to a dataset this large. They highlight how well their IBD approach reflects the demographic history of recent colonization of North America. Finally, they note that some of the genetic clusters they found have higher-than-average frequencies of certain alleles (versions of a gene) associated with disease risk. This finding has potentially huge implications for improving biomedical research studies.
Some additional thoughts
At several places in the manuscript, the authors suggest that the clustering method will improve as their database grows (e.g., as more people test). This brings us full circle to one of those accessory bits of information I pointed out at the beginning of this post: the paper was first submitted in February of 2016. At the time, AncestryDNA’s database had about 1.5 million people. By the time the paper was accepted for publication in December, the database had doubled in size to about 3 million people. This means that some of the problems attributed to small population samples may already be rectified and that some of the clusters reported here may be divided into more fine-scale groups by the time Genetic Communities are rolled out to customers.
Any. Day. Now.