As this post was set to publish, MyHeritage announced a phishing attack that may stem from the GEDmatch hack. (Read more here.) Meanwhile, many GEDmatch users have seen odd matches over the past 2 days. This post addresses why those peculiar matches are probably not a concern. The bigger worry is that hackers may now try to use our GEDmatch email addresses to trick us into revealing our login credentials to other websites. Constant vigilance!
After the recent privacy breach at GEDmatch, users who logged in often saw unexpectedly close matches who shared “impossible” amounts of DNA. Several Europeans reported very close matches to fully Asian kits, which didn’t make sense given geography and family history.
Under the circumstances—a database hack followed by suspicious matches—it’s easy to be alarmed. There’s a potentially innocuous explanation, though, that GEDmatch users should be aware of.
First, a Little Background
DNA is made of four nucleotides (A, C, G, and T). Our genomes are 3 billion nucleotide pairs long, and at roughly 90% of those sites, all humans are identical. It’s those other sites, the ones that vary among individuals, that we care about for genealogy.
Those variable sites are called single nucleotide polymorphisms, because biologists just love fancy terms. Fortunately, they also love acronyms for their fancy terms, so we can just call them SNPs (pronounced “snips”).
In theory, a given SNP position could have any of the four nucleotides. In practice, though, the SNPs we use for genealogy only have two states. (A very few have three states.)
Remember, we have two copies of each SNP, one inherited from our mothers and one from our fathers. If you looked at Hypothetical Position P for everyone in a database, each person would have either AA, TT, or AT, depending on whether they inherited the same nucleotide from each parent or an A from one parent and a T from the other. No one would have a C or a G in any combination at that position.
For two people to match at that site, they only need to share one of their two nucleotides. Someone with AA at Position P would not match someone with TT, but someone with AT would match both AA and TT people as well as other AT individuals. That is, AT would match everyone in the database. The fancy term for such positions is heterozygous. (Sorry, no nifty acronym for that one.)
While having two-state SNPs simplifies the computational analyses involved in matching, it also makes the system easy to manipulate. An artificial kit that is heterozygous at every position would match everyone in the database as a parent–child relative.
Whoa! Why Would Anyone Do That?
There are both legitimate and nefarious reasons why someone might create a kit that was partially or entirely fake.
Some of the unsavory ways an adversary might use altered kits to extract personal genetic information were described in scientific papers from by Edge and Coop from the University of California Davis and by Ney et al. from the University of Washington.
The papers are complex, so you may prefer to read my layperson summary of the first one here.
After these papers were made public, GEDmatch implemented precautions to catch fake kits and prevent them from being uploaded to their database. They don’t seem to have deleted existing kits that had been altered, though, and that’s what people are probably seeing now.
Don’t Panic! It’s Not All Bad
As I said earlier, there are perfectly innocent reasons someone might manipulate kits. Some were undoubtedly created by hardcore genetic genealogists who were experimenting to learn more about how matching works. Others are Lazarus kits—reconstructions of an ancestor’s genome—that are matching imperfectly.
Many of the unusual Asian matches seem to be from legitimate genealogy companies. The excessive matching can be explained by differences in chips, the lab device used to read our DNA. If the Asian companies use chips that don’t line up well enough with the data in GEDmatch, the kits won’t upload.
A workaround is to make educated guesses at the missing data. This process is called imputing, and it’s error prone. If it doesn’t work right, you’d get exactly what we’re seeing now: kits that match everyone when they shouldn’t.
The same imputing process is undoubtedly also being used for forensic kits in the database. Why? Because perps and crime victims don’t always leave behind perfect samples. The labs do the best they can with what they’re given, then they have to impute the rest.
Basically, a lot of what we’re seeing are failed experiments for good causes. Those experimental kits should have remained hidden but got opted into matching during the hack.
What percentage of the “weird” kits are innocuous versus malign? Frankly, I have no idea. But to gauge your own level of comfort with the privacy breach, it’s important to understand that most of what we’re seeing is probably benign.
A much more serious concern is how hackers might use identifying information obtained from GEDmatch—like names, email addresses, and where we tested—to trick us into revealing sensitive login information.