A song by the British–Irish folk-rock band The Waterboys has been playing on repeat in my head lately:
I pictured a rainbow
You held it in your hands
I had flashes
But you saw the plan
I wandered out in the world for years
While you just stayed in your room
I saw the crescent
You saw the whole of the moon
The whole of the moon
That’s because when I wrote about the risks to genetic privacy a couple of years ago using an example of cystic fibrosis, I saw only the crescent, a sliver of what is possible. The whole of the moon is much, much more sobering.
The Expectation of Genetic Privacy
The DNA testing companies go to great lengths to ensure that our raw data—the actual content of our genome tests—is protected. For example, AncestryDNA, 23andMe, and Living DNA all encrypt the genetic data in their possession so it’s unreadable without a special decoder “key”, and none of the genealogy matching databases knowingly expose our raw data to other users.
Most genealogists thus assume that their own genetic data is private. However, at sites with a chromosome browser, we can often infer some of our matches’ genomes simply by knowing which DNA segments they share with us. I previously showed how this would work for a medically relevant trait like cystic fibrosis. What I didn’t envision was how much of our genomes could be exposed with more sophisticated approaches.
Enter the Adversary
In a new manuscript entitled “Attacks on genetic privacy via uploads to genealogical databases,” Dr. Michael Edge and Professor Graham Coop of the University of California, Davis, describe how a so-called adversary could reconstruct personal genetic data held in any database that both accepts uploads and has a chromosome browser. That includes GEDmatch, Family Tree DNA, and MyHeritage. (Neither AncestryDNA nor 23andMe accepts uploads, and Living DNA does not currently offer a chromosome browser.)
An accompanying FAQ for the paper can be found here.
The intent of the paper was not to undermine genetic security but to alert the databases to vulnerabilities so that they can mitigate the risks. As part of this “white hat” approach, the authors first described three approaches for learning the genotypes of unsuspecting people then recommended measures the databases can take to protect their users. Representatives at GEDmatch, Family Tree DNA, MyHeritage, Living DNA, and DNA.Land (which has since restructured) were notified 90 days before the manuscript appeared online so they would have time to improve security.
The methods are called IBS tiling, IBS probing, and IBS baiting. I will summarize them here. The paper is technical and goes beyond my knowledge of population genetics; any failures to explain clearly or accurately are entirely my fault.
I have access to my own genome. Someone who matches me on a given segment in a chromosome browser has some of the same genetic variants as I do at that spot. By looking at my own matching segments, I can learn a little bit about a lot of people. The more closely related to me they are, the more DNA we will share, and the more about their genomes I can figure out. I could learn much less about a distant relative’s genome.
However, if an adversary were to upload hundreds or thousands of real data files, they could figure out a lot about me. The bits and pieces of my genome that matched the uploaded files could be assembled like tiles in a mosaic—some bigger, some smaller—to get a fuller picture of my genetic makeup.
Drs Edge and Coop tested this approach using 872 genomes available in public databases. With a minimum matching segment size of 1 cM, they were able to recover at least one of the two alleles across 82% of a typical European’s genome. (An allele is a version of a gene or SNP.) For some people, it was up to 93%. They could only reconstruct 11% of the typical genome using a 3-cM threshold and only 1% with an 8-cM limit.
(In genealogy, we should never use 1-cM or even 3-cM segments, because they’re not likely to be identical by descent, or IBD. That is, the matching segment could match the chromosome you inherited from your mother for some bases and the one you inherited from your father in others rather than to a single continuous sequence of DNA in your genome. For tiling, though, it doesn’t matter whether the segments are IBD, just that they’re identical by state, or IBS.)
Consider that the authors used only 872 genomes. There are somewhere between 10,000 and 25,000 publicly available genomes.
IBS probing extends the idea of IBS tiling, but rather than trying to reconstruct the genome of a target person or persons, the goal is to identify multiple people with a specific genotype of interest. With probing, the adversary inserts a real segment of DNA into an otherwise fabricated genome that is designed not to match anyone. Thus, anyone who matches the modified genome must match on the region of interest.
The authors demonstrated how IBS probing would work using the APOE gene. Some variants of this gene increase the risk of late-onset Alzheimer’s disease. Using the same 872 publicly available genomes as before, they created two sets of probes, one set for a 1-cM matching threshold and one for a 3-cM matching threshold.
With the 3-cM threshold, 9.4% of the target people in the dataset matched one of the probes, and with a 1-cM threshold, 86% of them matched a probe.
Millions of human DNA sequences are publicly available in the GenBank repository, providing ready material for probe design.
The third method, IBS baiting, uses completely fabricated genomes. These fake datasets are designed to match everyone everywhere except at the sites of interest.
Technically, any given spot on a chromosome could contain one of four nucleotides, A, C, G, or T. In practice, though, only two are present in almost all cases. That is, at Site S, some people will carry two copies of A (one from each parent), some two copies of T, and some an A and a T, but no one will have either a C or a G.
This “biallelic” nature of our SNPs permits baiting. First, the adversary creates a genome in which every SNP is heterozygous. Such a dataset will match everyone in the database, regardless of which SNP or SNPs they have at each spot. For example, if the constructed dataset has AT at Site B, it will match everyone, whether their genotypes are AA, TT, or AT.
The adversary creates two such genomes. For the site of interest, one artificial genome will be homozygous for one allele (e.g., CC) and the other dataset homozygous for the other (e.g., TT). If both artificial genomes match someone at that spot, the adversary knows that person is heterozygous (has CT), whereas the adversary can determine whether the person has CC or TT by which of the two fake genomes matches there.
Extending this method, the adversary could create pairs of genomes that are designed to test thousands of sites at once. Drs Edge and Coop predict that an adversary using 100 pairs of carefully constructed genomes with a 1-cM matching threshold could recover enough genome-wide data to impute the rest with 97–98% accuracy. (Imputing is a statistical inference method.) And with a few thousand pairs of uploads, every SNP could be determined directly, without imputation.
What Does It All Mean?
The key takeaway of this paper is that genealogy databases that allow uploads of DNA files and that have chromosome browsers are a much bigger risk to genetic privacy than previously thought. Precisely how risky is an open question, because the authors did not try to compromise a real database. (Remember, their goal was to alert the genealogy sites to vulnerabilities, not to exploit those weaknesses.) A valid question is whether baiting, which is the most intrusive approach, would work if the matching algorithm ignores occasional mismatches. Unfortunately, most of the companies do not explain their matching protocols in detail.
Another reasonable question is: Who would do this? I don’t have a good answer for that since it hasn’t, to our knowledge, happened yet. There have been suspicious uploads to GEDmatch from China that matched Europeans at abnormally high levels (hundreds of centimorgans), although those may have been innocent attempts to generate kits from incompatible technology.
I do believe that if there’s gain to be had, someone will try it eventually.
How the Databases Can Protect Us
Drs Edge and Coop outline ten approaches the databases can take to protect their users. I won’t enumerate them all here. Many would be invisible to the average user, like blocking file uploads that showed signs of manipulation. Some, like phased matching algorithms and ignoring small segments, would even improve the genealogical utility of the databases.
The single most effective protection would be cryptographic signatures, digital security codes to prove that the data files come from authorized sources and have not been altered. For this to work, though, all of the major databases would have cooperate. The two largest testing companies, AncestryDNA and 23andMe, are the least vulnerable to an adversary because they do not accept uploads, yet they would have to invest the most in cryptography for it to work. They would benefit only indirectly, in that cryptographic signatures would increase general trust in DNA testing.
How You Can Protect Yourself
It’s unclear at this time what measures the matching databases have taken to protect their users. For example, one recommendation to mitigate the dangers of IBS tiling, IBS probing, and IBS baiting is to report only matching segments of 8 cM and above, but as of this posting, both Family Tree DNA and GEDmatch still report segments as small as 1 cM. That doesn’t mean they haven’t done anything, of course, because several of the protective measures will not be apparent to users of the database.
To protect yourself, you must first decide how much protection you want. If you are comfortable uploading your raw genetic data to a public database like the Harvard Personal Genome Project, knowing that anyone could download it and re-identify you, then you really have nothing to worry about.
At the other extreme, if you don’t want anyone to know anything about the contents of your genome, then you should test only at sites without a chromosome browser (AncestryDNA or 23andMe, where you can opt out of the browser). There’s no right or wrong answer, and most of us will find ourselves somewhere in between.
The table below summarizes the risks at the main databases. Vulnerabilities are shown in red.
|Database||Accepts Uploads||Chromosome Browser||Minimum Match Threshold||Smallest Matching Segment|
|AncestryDNA||No||No||6 cM||6 cM|
|23andMe||No||Optional||7 cM||5 cM|
|MyHeritage||Yes||Yes||8 cM||6 cM|
|Family Tree DNA||Yes||Yes||7.7 or 9 cM||1 cM|
*The default match threshold at GEDmatch is 7 cM for one-to-many searches, but any kit can be compared directly to any other kit with a minimum segment size of 1 cM.
Updates to This Post
22 October 2019 — Added a link to the FAQ for the paper.