What's Coming Down the Pike: AncestryDNA

The genetic genealogy testing companies were out in force at the i4GG conference this past weekend in San Diego! Representatives from Living DNA and MyHeritage gave hour-long talks on Saturday, and people from Family Tree DNA, AncestryDNA, and 23andMe spoke on Sunday.

They all gave polished and informative presentations, and I made a point to see them all so that I could report back to my readers. This is the fourth in a series of five (in the order of the original presentations) on what they had to say.

How DNA Testing Works: The Science Behind Your DNA Results

Ben Wilson, Population Geneticist at AncestryDNA

The “What’s Coming Down the PIke” title of this post is a bit misleading, because Dr Ben Wilson’s talk wasn’t about upcoming new features at AncestryDNA. His talk was enlightening, nonetheless, because he helped me to understand ethnicity estimates better. I’ve long been skeptical of them.

The one word that best summarizes his presentation is “haplotypes”. A haplotype is a string of continuous DNA that is inherited together. It could be as large as an entire chromosome or as small as a few centimorgans. (I suppose, technically, a haplotype could be just a few units of DNA, called nucleotides, but we ignore strings that short for our purposes.) In genealogy, we often talk about mitochondrial haplotypes or Y-chromosome haplogroups (a set of similar haplotypes), because those two types of DNA are inherited as single units, with minor exceptions.

The haplotypes in autosomal DNA, on the other hand, become shorter over the generations though a process called “crossing over” or “recombination”. Large shared haplotypes, what we commonly call matching segments, are indicators of recent shared ancestry. Small shared haplotypes and individual SNPs (single nucleotide polymorphisms) reflect more distant connections and can be used for ethnicity estimates.

Each nucleotide and each haplotype exists at a certain frequency in a population. Most human haplotypes are present at 10%–20% frequency overall, but that frequency will vary among populations. Some populations may be enriched for certain haplotypes and lack others entirely. It’s these differences that form the basis of ethnicity estimates.

Why Populations Have Different Haplotype Frequencies

How do haplotype frequencies differ in the first place? There are three main factors at play. Let’s consider this hypothetical population with three different haplotypes: 40% red, 30% blue, and 30% green.

Imagine that a few individuals leave this population to settle elsewhere and pass on the haplotypes they bring with them to their children. Unless a very large number of individuals emigrate together, the haplotype frequencies in the new population will be different from those in the parent population, and some haplotypes may not make the journey at all. In this example, the resulting Population A is 67% blue and 33% red, although the parent population contained all three haplotypes at similar frequencies. In biology, this loss of genetic diversity is called a “founder effect”.

A parent population can also be divided by an event—geographic, religious, political, etc.— that isolates one part from the other. This can also affect haplotype frequencies. Below, we see a split from our original population that results in two daughter populations: Population B is 60% red, 20% blue, and 20% green while Population C is 20% red, 40% blue, and 40% green, although both originated from the same parent population.

Over time, these types of events lead to something called “fixation”, in which all but one haplotype is lost in a population. Fixation is great for estimating our affiliations with specific geographical populations, i.e., ethnicity. If one population is fixed for, say, the red haplotype, another is fixed for blue, and your DNA results indicate you have blue, it’s trivial to infer to which population your ancestors belonged.

Of course, in real life, ethnicity estimates are not so simple. One complicating factor is migration, which increases diversity rather than decreasing it the way founder effects and population splits do. Imagine that three individuals from Population C move into Population A, bringing the green haplotype along with them. Now, Population A has more genetic diversity than it did before and will be less distinct from neighboring populations.

The tug-of-war between events that lead to fixation and those that increase diversity in a population is an area of ongoing research.

Practical Estimation of Ethnicity

In practice, the companies estimate ethnicity by comparing the haplotypes in our results to those of a “reference panel”. The reference panel comprises carefully selected individuals with deep roots in a region. Because most people didn’t move very far historically, if all of someone’s grandparents or great grandparents were from the same area, we can often assume that their more distant ancestors were also from that locale. Such a person would be a good candidate to represent their population in the reference panel.

Because each of us has many haplotypes and because populations are not fixed for all of them, estimating ethnicity becomes a question of statistics rather than a simple this-or-that assessment. The reference panel is analyzed to find distinguishing haplotypes. The plot below (Figure 3.3 in the AncestryDNA Ethnicity Estimate White Paper, October 2013) shows some candidates for the European Panel used by AncestryDNA.

Our individual results are compared to the most definitive haplotypes to estimate the percent contribution that each population has had to our own genomes.

Haplotypes Are Also Used in the Matching Algorithm

AncestryDNA uses haplotypes when matching us to our DNA relatives in two different ways. The first is to “pseudophase” our DNA, separating the strings we inherited from one parent from those inherited from the other parent, even if our parents haven’t tested. The mathematical models are complicated and include the recombination rate, specific genotype information, and reference haplotypes.

Some haplotypes may match between two people because they exist at high frequency in a population and not necessarily because the those individuals share recent ancestors. Trying to chase down such a connection would be fruitless, because the segment may have originated dozens of generations back. Worse, one could be misled into thinking that such a segment is proof of a specific relationship when it isn’t.

To correct for this problem, AncestryDNA uses an algorithm called TIMBER that down-weights high-frequency haplotypes. These common haplotypes aren’t thrown out entirely, they are just given less emphasis. For example, before TIMBER a matching segment might be 7 cM, but if it is very widespread in the reference population affiliated with two matching people, it might be down-weighted to contribute only 5 cM to the total they share. This adjustment helps account for population history in the matching process.

Haplotypes in Genetic Communities

Finally, Genetic Communities at AncestryDNA bridge the gap between ethnicity estimates, which reflect shared history thousands of years ago, and close family matching, which only extends 5–10 generations back. Genetic Communities are based on extended networks of DNA matches. Not everyone in a network matches everyone else, but everyone matches someone. The networks can get big and messy, so a process called community detection helps to pull signal from the noise. The figure below from the AncestryDNA Help topic “How are Genetic Communities identified?” shows a network before and after community detection.

These networks are then annotated with pedigree information from public member trees to find common population history. For example, I was placed in the Southwestern Louisiana Acadians community, which fits perfectly with my family history.

In summary, Dr Wilson’s talk was a great overview of some of the population genetics principles underlying our results at AncestryDNA. He gave clear explanations of complex topics and answered questions from the audience in a way that the layperson could understand. As a geek, I appreciate that AncestryDNA sent someone who is actively working on the science behind our results to speak at the conference.