As this post was set to publish, MyHeritage announced a phishing attack that may stem from the GEDmatch hack. (Read more here.) Meanwhile, many GEDmatch users have seen odd matches over the past 2 days. This post addresses why those peculiar matches are probably not a concern. The bigger worry is that hackers may now try to use our GEDmatch email addresses to trick us into revealing our login credentials to other websites. Constant vigilance!
After the recent privacy breach at GEDmatch, users who logged in often saw unexpectedly close matches who shared “impossible” amounts of DNA. Several Europeans reported very close matches to fully Asian kits, which didn’t make sense given geography and family history.
Under the circumstances—a database hack followed by suspicious matches—it’s easy to be alarmed. There’s a potentially innocuous explanation, though, that GEDmatch users should be aware of.
First, a Little Background
DNA is made of four nucleotides (A, C, G, and T). Our genomes are 3 billion nucleotide pairs long, and at roughly 90% of those sites, all humans are identical. It’s those other sites, the ones that vary among individuals, that we care about for genealogy.
Those variable sites are called single nucleotide polymorphisms, because biologists just love fancy terms. Fortunately, they also love acronyms for their fancy terms, so we can just call them SNPs (pronounced “snips”).
In theory, a given SNP position could have any of the four nucleotides. In practice, though, the SNPs we use for genealogy only have two states. (A very few have three states.)
Remember, we have two copies of each SNP, one inherited from our mothers and one from our fathers. If you looked at Hypothetical Position P for everyone in a database, each person would have either AA, TT, or AT, depending on whether they inherited the same nucleotide from each parent or an A from one parent and a T from the other. No one would have a C or a G in any combination at that position.
For two people to match at that site, they only need to share one of their two nucleotides. Someone with AA at Position P would not match someone with TT, but someone with AT would match both AA and TT people as well as other AT individuals. That is, AT would match everyone in the database. The fancy term for such positions is heterozygous. (Sorry, no nifty acronym for that one.)
While having two-state SNPs simplifies the computational analyses involved in matching, it also makes the system easy to manipulate. An artificial kit that is heterozygous at every position would match everyone in the database as a parent–child relative.
Whoa! Why Would Anyone Do That?
There are both legitimate and nefarious reasons why someone might create a kit that was partially or entirely fake.
Some of the unsavory ways an adversary might use altered kits to extract personal genetic information were described in scientific papers from by Edge and Coop from the University of California Davis and by Ney et al. from the University of Washington.
The papers are complex, so you may prefer to read my layperson summary of the first one here.
After these papers were made public, GEDmatch implemented precautions to catch fake kits and prevent them from being uploaded to their database. They don’t seem to have deleted existing kits that had been altered, though, and that’s what people are probably seeing now.
Don’t Panic! It’s Not All Bad
As I said earlier, there are perfectly innocent reasons someone might manipulate kits. Some were undoubtedly created by hardcore genetic genealogists who were experimenting to learn more about how matching works. Others are Lazarus kits—reconstructions of an ancestor’s genome—that are matching imperfectly.
Many of the unusual Asian matches seem to be from legitimate genealogy companies. The excessive matching can be explained by differences in chips, the lab device used to read our DNA. If the Asian companies use chips that don’t line up well enough with the data in GEDmatch, the kits won’t upload.
A workaround is to make educated guesses at the missing data. This process is called imputing, and it’s error prone. If it doesn’t work right, you’d get exactly what we’re seeing now: kits that match everyone when they shouldn’t.
The same imputing process is undoubtedly also being used for forensic kits in the database. Why? Because perps and crime victims don’t always leave behind perfect samples. The labs do the best they can with what they’re given, then they have to impute the rest.
Basically, a lot of what we’re seeing are failed experiments for good causes. Those experimental kits should have remained hidden but got opted into matching during the hack.
What percentage of the “weird” kits are innocuous versus malign? Frankly, I have no idea. But to gauge your own level of comfort with the privacy breach, it’s important to understand that most of what we’re seeing is probably benign.
A much more serious concern is how hackers might use identifying information obtained from GEDmatch—like names, email addresses, and where we tested—to trick us into revealing sensitive login information.
Hi thednageek,
had no idea these phishing attacks were going on until I read your post. Thanks very much for the heads-up. I have my DNA on various sites, as do many.
And also thanks for the biology/biochem lesson. Lovely to see such a nice diagram. Have a small query – re “DNA is made of four nucleotides (A, C, G, and T)”. Thought the A, C, G, T stood for the names of the bases within the nucleotides, as shown in the diagram (Adenine, Cytosine, Guanine and Thymine)? Of course the full names of the nucleotides are much more of an unwieldy mouthful. I’ll have a go at reading the papers you have included, and your summary. Many thanks again.
It’s complicated. Sometimes the distinctions between adenine, adenosine, and adenosine monophosphate are just too much to explain and distract from the main point, so I gloss that over and just use A, C, G, and T. I hope you can forgive me!
Great Post.
I had ran into that same situation with a more ancient Lazarus kit I had built.
It is for research, and private, but if I do a full search, the number of matches crashes
the program. ~200,000? The positive was, that by going back to specific segments,
doing the matches, and running the interesting ones through Q-Comparison, I
garnered a huge amount of information. I can fully understand someone building
weird kits, and it is easily possible with the Lazarus Program. I could understand the
confusion if it went public without explanation.
This kit actually pulled out matches to descent lines from around 1600. There was
no DNA entered that would have directed the results in this direction, but the matches
matched one of the theoretical descents of the family.
The loss of GEDMatch, and the capabilities that are there, but not yet fully understood,
would truly be a disaster.
(and thank you for your work on the DNA Geek Posts!)
Your talents never cease to amaze. I was just contemplating the source and use of these kits and behold..your well executed explanation. My tired brain no longer had to ponder. It has been a tiring two days dealing with another matter. So thank you.
And I had at one time thought I would need to construct a Lazarus kit. I was searching for a great aunt with no name and no DOB. Few DNA matches were generated from family testing overall. Then the third cousin match that solved the missing connection from one family to another in 1905 with a name change was delivered by the DNA angels!
Everyday I think how much would be accomplished if all the scammers and hackers put that energy and ingenuity to good use.
My research kit LB3843118 is a “mother of all people” kit. I created it just out of curiosity and as a proof of concept when I got the idea. I never thought it could become anything but a research kit. The kit is still available at GEDmatch for One to One matching if somebody wants to check. “Sun Mutsis :-D” is your mother, too. 😉
Very interesting! The kit doesn’t work for one-to-many, but it matches every other kit as a parent–child.