DNA data security
Public fears the misuse of genetic data
When it comes to assessing personal DNA (whether for medical or entertainment purposes), the most common worry people appear to have is the lack of privacy protection around such personal information. Their biggest concern is that the genetic information could be misused by corporations or governments to stratify people into specific subgroups that could determine how they receive employment, insurance or even access to education or maybe even to face public ostracism based on presumed ethnic backgrounds or conditions.
It should not come as a surprise then that this fear can be found being propagated by even some official channels, including genetic counseling sites, even though in Canada this fear is unfounded as in 2017 Canada passed a law protecting the privacy of one's genetic information thus safeguarding personal DNA information from being used for discriminatory purposes. It seems the majority of the internet and professional world has not yet caught up and so we hope this article can become a valuable source by exploring and explaining the current situation.
In the past, the fear around misuse of genetic information by interested third parties, has always been blasted your way in a bid to fulfill ethical demands to fully inform the patient of all potential benefits and limitations of genetic testing, and one of the most popular presumed limitations that has been latched onto to (without much of a supporting evidence) is that your genetic information could be used to discriminate against you by insurance companies. Ironically, now the breach of ethical standards occurs because of the misinformation:genetic discrimination that we are warned about is now against the law thus the fears are unfounded, or the offending party will find itself in court.
Nevertheless, this does not take away from the importance of the issue, which is the protection of such a valuable resource: your DNA code.
Thus it was exciting for us when we recently discovered a live discussion taking place on the Reddit topic dedicated to DNA data security. However, the discussion did not deal so much with current security protocols as it did with people spewing their emotional points of view on what DNA “should” be used for versus “should not” - with people using entire essays at times to vent their opinions! So instead of you having to wade through all that, we thought we would provide you with a summary of the more interesting tid-bits without all the brain abstractions.
Right off the top, there was a somewhat dated reference from Nature magazine on how genomic information could be protected. It is a great review lesson though that compliments the Reddit experience, so let us dive in. The authors of the article state that the prediction of genetic predispositions to complex traits, like the predisposition to type 2 diabetes currently offered by 23 and Me will require the genetic analyses of millions of people. That means there is a lot of genetic information that already needs to be kept safe now as well as into the future.
A broad range of security threats already exist against stored genetic data, some include: cracking weak database passwords; hacking the servers that store the data; the actual stealing or loss of physical storage devices due to inadequate security; as well as intentional criminal misconduct of those with access to the data.
It comes down to what information can be obtained. Apparently, just having access to the date of birth, sex and zip code information can uniquely identify >60% of all American citizens. In such way, more than 30% of Personal Genome Project (PGP) participants were identified based on such information available in their PGP profiles.
Another type of information that is often associated with genetic information is family pedigree. Apparently even just knowing how many biological males versus females are present in each generation can help to uniquely identify 30% of the families. Once personal information can be attached to such pedigrees, identification of entire families becomes entirely feasible. This has already happened in Israel where the entire population registry was leaked to the web which allowed detailed family pedigrees to be constructed for all of the citizens of the nation! Another example was a similar breach of the Turkish registry in 2016.
Tracing through genetic data
In terms of genetic information, the first potential route of identity discovery is by identifying a surname from the Y-chromosome genetic data. This is because in most societies, surnames are passed from father to son, as is the Y-chromosome. One can actually compare someone's Y-chromosome data with available public genealogical databases to try to obtain a potential last name of the victim, and this could be nearly as valuable in identifying the exact person as having access to their zip code. On top of that, last names are highly searchable. This exact approach has been used in the past by individuals who were conceived by anonymous sperm donors or by adoptees, who used their own DNA to determine who their biological families were.
Luckily though, Y-chromosome paternal relatedness is typically demonstrated when assessing DNA repetitive elements (a same stretch of DNA sequence repeated over and over), and usually that information is not typically released from research, which would mean that the raw sequencing data would have to be probed to get this type of information, which would require very specialized skills or it might simply not be available.
However, the remainder of an assessed genome in the genealogical public databases can also help zero in on potential family identity. In fact, this is even more powerful than scanning for Y-chromosome information.
For example, one study showed that having access to the genetic data of just 1% of a population would allow for over 50% of the population to have their anonymous DNA identified with a precise match to the correct individual or that individual’s same-sex sibling. The authors of that publication demonstrated that this entire process can be automated to make this even easier. Another group showed that access to genetic data of 2% of the population could provide a third-cousin genetic match to almost any person in that population.
In other words, due to genetic relatedness, this approach can be used to identify individuals who never submitted their genetic data to direct-to-consumer DNA testing companies if their relatives, even distant ones, provide their genetic information to these public databases. Considering how frequently this is now happening, and how readily genealogical records can be obtained, the authors questioned the possibility of sustaining individual genetic privacy for much longer without introducing new protection mechanisms.
This type of approach was used to demonstrate that specific persons could be identified in what was supposed to be anonymous individuals in the 1000 Genomes Project, one of the earliest research projects to catalogue the diversity of the human genome.
With well over 25 million ancestry-related tests sold worldwide, we are rapidly approaching the point of having 1-2% of a population of some countries being genetically screened! This is truly becoming relevant especially for the US citizens of the European descent who willfully provided their DNA in droves, and it is conceivable that the ability to precisely identify almost every single person in that group has already been reached. In other words, if you are white American and you commit a crime and leave DNA evidence behind (almost a guaranteed certainty), you can be caught! Or let’s twist it in another way. If you are assaulted, scratch your attacker! Your assailant’s DNA will be collected and depending on their ethnic background, potentially easily identified.
Genealogy as a culprit in privacy loss
On top of that, with increasing technologies genealogical records can also be quite extensive! One recent study created the largest family tree in the world using genealogical records assembled online from the general population. That included records on 86 million people which resulted in a single pedigree consisting of 13 million people!
In some instances both mitochondrial DNA and Y-chromosome short tandem repeat haplotypes were available for comparison which helped to determine how accurate the non-genetic genealogical records can be. The comparison of what we claim (of parental lineage) in comparison to the genetic reality indicated a nonmaternity rate of only 0.3% (closely resembling the historical rates of adoption in the US) but the nonpaternity rate was much higher though at 1.9%! This means that approximately 1 in 50 people have a father who is actually not their biological father and this is similar to what has been observed in other genetic studies. This shows a potential loss of privacy that can no longer be covered up due to with access to genetic information.
Below is a TED talk about this by one of the authors.
So the alarming news is that successful identification of a person (whose DNA sample is available to you) depends mainly on the accessibility of the genetic data of a matched relative - which is becoming increasingly more possible - and then subsequent access to the genealogical data of those relatives which can also be readily available.
The good news is that, at least for now, genetic data paired with genealogical data that would allow such identification has not yet been released to the public. So someone would have to do some digging and organizing of the genealogical data. But this is exactly what law enforcement is doing to identify suspects in criminal cold cases when they have the DNA sample from crime scene available -they submit the crime scene sample to public databases as if it was just another client.
Personal pictures painted with DNA
Having access to DNA data also allows a possibility of being able to predict certain traits from the genetic data. This was a frequent worry voiced by people about how their genetic data could be abused. Especially that genetic information could be used to infer intimate details related to personality type which could be abused down the road by future employers.
However, for the majority of personality traits, genetics explains only a small portion of the contribution of a given trait and thus the level of intimate personal detail would not be useful or reliable to categorize people.
As some of the researchers in the Reddit symposium pointed out however, one study has previously shown that it is possible to use Facebook data to reliably predict personality traits and that this fact could be a far greater cause of concern than use of genetic data for such purpose. We do have laws protecting us from this form of genetic information abuse while Facebook information is as readily accessible as you inadvertently allow it to be.
There are some traits however that have very high level of genetic contribution, such as eye color or the epigenetics used to determine the age of an individual. One company has even developed a DNA-based predictive model of facial features. However, databases normally do not collect these details to use towards personal identification (yet?). Although you can see how powerful DNA data could be in the future in creating an image of a suspect or a target of abuse.
DNA itself is easily identifiable if it can be compared to an existing database. It only takes about 300 common mutations (SNPs) to uniquely identify any person. That's it! That fact is well understood and why forensic DNA databases have been built around the world since the 1990s.
What are haplotypes?
And lastly, what perhaps many in the public might not even be aware of, is the possibility of “educated guessing” where some of the DNA data and the information it might hold can be guessed without even having access to it. This can be achieved with only partial genetic information from the attacked identity or maybe none at all if genetic information from relatives is available!
Guessing what the missing genetic information is from pieces of the available DNA information is called genotype imputation. It has to do with the fact that when DNA is being inherited, what you got from your ancestors came to you in specific large chunks of DNA called haplotypes. Half of your DNA comes from your mom and half from your dad. But the DNA you received from either of your parents was a scrambled DNA of their own parents which is typically jumbled in large chunks of DNA at a time. So the DNA you got from your mom is made up of some big continuous pieces of DNA from your maternal grandma which might be right next to another large stretch of DNA from your maternal grandpa and so on. And the DNA you got from your dad is made up of mixture of large segments of DNA from your paternal grandparents. In same fashion when you produce your own gametes to be used to create your own child, that egg or sperm will be a random mixture of your mom's and dad's DNA, again, pieced together in some big chunks.
Because of the fact that DNA is inherited in these large chunks (haplotypes), certain mutations are inherited together. This phenomenon is called linkage disequilibrium. Yes, that is a super weird name - but it has to do with the fact that when mutations were being analyzed, some appear to pop up along with others (or exhibiting “linkage”) which is not the pattern you would expect if mutations were to occur just randomly. Hence “disequilibrium” is to what would be expected just randomly.
By now enough genetic data has been analyzed so we know which mutations travel together. This means, if you have access to information regarding one mutation, you can comfortably start guessing that somewhere nearby you could expect another specific one. This is the imputation process! You can use this process to start guessing personal DNA information from whatever you already might have available.
A very famous example of imputation concerns the very first person to have their full genome decoded with next generation sequencing technology - Dr. James Watson, one of the co-discoverers of the DNA structure. He had his genome sequenced in 2007 and made it publicly available with exception of one gene, the APOE, which could provide information towards Alzheimer’s predisposition. Dr. Watson, who has witnessed the devastating effects of this condition first-hand with one of his family members, did not want to know this information but it was quickly revealed that his predisposition to the condition can easily be inferred by imputation, looking at other surrounding areas of DNA around the APOE gene.
In the same fashion, getting a hold of genetic information of your relatives would allow you to start completing a potential genetic picture of yourself! It has also been demonstrated that a person's genetic predisposition to Alzheimer’s could be predicted by accessing the publicly available DNA information of their relatives! In this case, this was not any ordinary relative, it was the genome of Henrietta Lacks, one of the most studied human genomes in history and done so without ever obtaining any consent from family members for this revelation of their genetic privacy.
Thus one can totally see how terribly such data could be abused! We should reconsider how publicly we share our genetic data and maybe give thought to not leaving genetic data for just anyone to query as who knows what the unintended consequences may be. Even the data available for research in the progress of medicine should be very well protected.
How to protect your DNA
So, what are some mitigating practices to secure your genetic data? The simplest one is to remove all obvious personal identifiers. The information that is included should be anonymized so that every piece of biological or demographic information would be similarly seen in many other participants and so that no records could stand out and be linked uniquely to one specific individual. Next is to actually control who has access to the data through a registration process, with demanding, rigorous data handling security that is closely monitored. The responsibility of monitoring can also be placed to some degree in the hands of the actual participants who deliver their genetic data to these databases. An example of this is a platform like genomes.io that controls access at an individual level.
In the next stage, the data can be encrypted and the system set up in such way that any interpretation is done only on encrypted information without releasing the genetic information. The field of cryptography has advanced rapidly in recent times and many demonstrations have been made how genetic information could be searched without revealing compromising information.
One example of this approach is LunaDNA, where the link between personally identifiable information and the DNA data is encrypted and not accessible by any third parties. When an individual uploads their DNA to the database, LunaDNA becomes the custodian of the data but offers the participant company shares in return for the DNA data (the more comprehensive the data, the more shares received). When researchers desire to address the data for a specific purpose, it is the participants who decide if their data can be used, including access for any personal information that could be linked to the understanding of the genetic role underlying the trait being investigated. In this scenario, individuals control the right to their information’s access while also reaping financial rewards for its use in the form of dividends. The participant can also remove themselves and all their data from the platform at any time.
Such a set up could also include searching the databases for a forensic or matching relative, where the identity could be provided without revealing the genetic information of an individual, for those who are in need of such a service. Another alternative proposed is that genetic data that consumer companies release to their clients be signed with an encrypted key. With this type of encryption of the DNA data, if this data is uploaded to third party services for the identification of relatives for example, the third party will have to obtain the key from the valid provider of the data before gaining access. This can ensure that the data is from a valid source and used in a valid setting. The ultimate goal here is to control access to an individual’s genetic information and prevent unauthorized access in any way, by anyone! Until this type of approach is enshrined in our culture and laws, the public will continue to face fears related to the potential misuse and of their genetic data. The recent example of a warrant granted for Florida police to be able to scrutinize public genetic database despite majority of the database genetic contributors not providing consent to do so is a prime example of why such fears have legitimate reasons to exist.
This article has been produced by Merogenomics Inc. and edited by Jason Chouinard, BSc. Reproduction and reuse of any portion of this content requires Merogenomics Inc. permission and source acknowledgment. It is your responsibility to obtain additional permissions from the third party owners that might be cited by Merogenomics Inc. Merogenomics Inc. disclaims any responsibility for any use you make of content owned by third parties without their permission.
Products and Services Promoted by Merogenomics Inc.