Human versus gorilla DNA – size does matter!
Introduction to the human genome and its mutations
If you were to ask me what the greatest way to unlock information about the personal human genome would be, I would immediately think of the method termed “single-molecule genome sequencing”. It is a constantly evolving technology that allows one to read single-stranded DNA molecules. Its main advantage is that the technology employed allows it to read very long pieces of DNA, so it allows for the study of the human genome on a macro scale that the most commonly employed methods cannot achieve. The primary producer of this technology is Pacific Biosciences. Oxford Nanopore is another later market addition, but one that has been stealing the limelight due to its size as a hand-held device with a cheap cost, allowing anyone to have their own sequencer if they wish. It was catapulted into fame when NASA used it for the first sequencing experiments in space.
How does it work?
Here are some basics first.
The human genome is enormous in size. It is composed of approximately 3 billion repeating four chemicals, called nucleotides, and it is the arrangement of these nucleotides that defines the program that cells use in order to execute their function. Cellular function then defines organ function, and eventually the function of entire organism. So you can think of the DNA as the program that runs the show, because it is the blueprint on how the body is to be produced and how it reacts to stimuli. These nucleotides pair up with one another, which is why DNA is made of two strands of nucleotides, and each of the paired up nucleotides are referred to as a “base pair” - this is why you always see DNA portrayed as a spiral of two ribbons. So the human genome refers to the 3 billion base pairs of your DNA (packaged into 23 large fragments called chromosomes).
On top of that, each cell of your body has a genome inherited from your mother and your father, 3 billion base pairs each (and 23 chromosomes each), so the size of the genome in each of your cells is approximately 6 billion base pairs (and 46 chromosomes total). The exception to this are your reproductive cells, which contain a scrambled version of the two genomes you inherited from your parents, but divided in half (back down to 23 chromosomes again), and the union of reproductive cells from a man and a woman reunite two independent genomes to produce a new human being. The genomes from the reproductive cells of your mom and dad are nearly identical, around 99% identical, but that still leaves millions upon millions of base pair differences.
When your genome is sequenced, it is the 6 billion base pairs of DNA that are being decoded in an instrument. That includes two sets of 3 billion base pairs of DNA that are nearly identical to one another, one set from your mom, and one set from your dad, and all of the mutations are captured that you have inherited from your mom and your dad.
These mutations, or alterations (the scientific term for it is “variants”), are typically just single DNA base pairs changing from one set to another, but can also include base pair insertions or deletions. But the number of affected base pairs in a row can also vary in number, and at times, a very large number of DNA base pairs can be deleted, inserted, inverted or just moved into a new location, or a bizarre combination of all of these possibilities. These large alterations of the genome are referred to as “structural variants”.
The grandest of structural variations is if an entire chromosome is duplicated, but typically structural variations affect smaller amounts of the genome. Nevertheless, with so much of the DNA code potentially impacted, you can imagine that such outcomes might not be free of biological consequences. For example, it could inﬂuence traits by either affecting the gene expression dosage (the degree to which genes are used by the cells), or by exposing recessive mutations, meaning the type of mutations that need to be mutated in both the maternal and paternal copies of the genome you inherited to produce a disease. Having only one mutated copy is fine because the second good copy is enough for cell to work, but what if a structural variant event messes up that one remaining good copy? Then there are problems!
Two types of technologies to sequence genomes
There are two types of technologies used to sequence genomes, any genomes.
The most common one takes the genome that is cut up into millions of short fragments of about 250 base pairs each, and it is these short bits of your genome that are all being decoded by the instrument all at the same time. This type of approach can be referred to as “short reads technology” since short fragments of the genome are being decoded (or “read” by the instruments, hence the name short reads). Computers then put all of these fragments back together to assemble your genome by comparing with an existing reference of what a human genome looks like.
Basically imagine if you shredded a book in one of those office shredding machines, and afterwards you had to put it back together. Except that for a human genome, you wouldn’t be shredding one book, you would be shredding a whole bunch of bookshelves worth of books, and then putting it back together. You can probably appreciate the computational power required to decode a human genome in this way!
This is basically how nearly all human genomes are currently being sequenced, and definitely will be your only commercial option, as this technology is brought to you nearly exclusively by a single company, the undisputed heavyweight of sequencing world, Illumina.
As you can imagine, there are some limitations with such an approach. One of them is that you need a reference genome to compare to when putting your fragmented genome back. That reference is a product of the famous Human Genome Project, that gave us the first ever look of what the human genome looks like in the early 2000s. But it is a combination of many different humans, and might not capture all of the different possibilities of what a normal variation among the entire human species could be.
Another limitation is that reassembling the genome from so many tiny fragments against a reference does not allow one to properly capture the large structural alterations that might be present in your genome, and the information that can be captured, is not deciphered in an easy manner. This is bound to lead to some misinterpretations, as human genomes can exhibit a great deal of large-scale variations, and structural variants are known to be able to contribute to disease. Related to this are regions of the genome that are highly repetitive, and their architecture might not be captured with short reads. So while the technology is very good in figuring out what all the base pairs present in the genome are, it might miss out on some of the big picture arrangements of all of those base pairs.
The final limitation I will comment on is that the short reads approach does not allow for one to differentiate the unique nucleotide content of your maternal genome contribution versus the paternal contribution. You get all of that information in one bag, and the maternal versus paternal components of your genome are not segregated. Such segregation of parental genomes is referred to as phasing. Again, having access to this type of information can have implications on understanding genetic disease origins.
On the other side of the spectrum, and the solution to the problems plaguing short read DNA sequencing technology, is a long read DNA sequencing technology. As its name implies, it is a technology that can decode long stretches of DNA at a time, and its undisputed champion is definitely the Pacific Biosciences company that has first highlighted this approach.
How long are the reads that we are talking about? In one of its earliest examples of sequencing a human genome with long reads technology, the PacBio sequencing instruments decoded thousands of reads of an average length of around 5000 base pairs. That’s a huge difference from short read technology, and that’s just an average that was already achieved many years ago (in the genomic world, three years is a long time span, as this field progresses at an absurd pace!). There are far longer reads to be found as well.
The point of such long reads is that it makes the assembly of the genome much more practical than trying to piece together a genome composed of tiny fragments. If we go back to the book analogy, the equivalent would be to try to put the book together with entire pages intact, as opposed to everything being shredded. Considering how repetitive or duplicated elements are frequently observed in the genome, and how complex structural variants can actually be, you can appreciate how such long reads can overcome many of the challenges observed with short reads. In the above cited example, just over 4% of such structural events were quite complex, such as the inversion of a DNA sequence combined with an adjacent sequence deletion. This technique is in fact so powerful, that it allows for a “de novo genome assembly”, meaning an assembly of the genome without the need of a reference for comparison. This is a very formidable strategy.
De novo sequencing, or how to assemble your genome from scratch
While the smaller structural scale variation has been studied extensively in the past with different technologies, the picture will remain incomplete until more de novo human genomes are produced. The reality is that while we have what we think is an image of a complete human genome, the DNA of our species can accommodate a giant amount of variation and much of this variation is still not known or understood precisely because it just has not been studied extensively.
Every single time a new human genome is assembled de novo, from scratch, we learn about new elements in the human genome that were never captured before in the current main reference. Going back to the above cited paper from 2015, sequencing of just that one human genome improved the most authoritative human genome reference standard by closing 34 thousand base pairs of previous gaps in the reference genome! Think about that! Sequencing of just one human genome with long read technology improved the gold standard reference used in short read genome assembly all over the world!
Why aren’t we just blasting through human genomes with PacBio technology? Well, first of all is the cost - in the above publication cited, the minimum cost was $30,000! That’s a lot of cash for a single genome, but we have heard that refrain before. Luckily, the cost has come down substantially since then, and rumors have it that the next generation of PacBio technology will rival the cost of the short reads technology offered by Illumina. Merogenomics will make sure you can have access to it in order to inspect your genome. For now, it is still not as cheap as the short reads alternatives (although, if you desire the best of what the world can offer, Merogenomics can set you up with that no problem).
In addition, in the above cited paper, the authors employed an additional very effective technique developed by the BioNano company, where a genome-wide map is produced by fluorescently labeling specific enzyme cuts along the entire genome. Sounds seemingly simple, but this type of technology has important implications in future genomic sequencing. Don’t take my word for it: it was voted as one of the top 5 inventions of 2014! In this way, the authors were able to combine their PacBio sequence reads genome map with the BioNano genome map for more superior genome assembly results. If you have already sequenced your genome with short reads technology, you can still employ BioNano technology to enhance the structural understanding of your already sequenced genome. Probably no one in the world has done that for private personal use, but again, if that was your desire, Merogenomics could connect you with that as well. Where there is a will, there is a way!
The other problem is that PacBio technology is not refined enough and has a high error rate, especially for single base pair insertion/deletion (indels). However, these are stochastic type of errors, meaning they pop up randomly. To overcome this problem and ensure extremely high levels of accuracy, the same genome can also be sequenced using short read Illumina technology at the same time to align the genomes together and correct the errors produced by PacBio. Therefore, both technologies can be employed to reinforce one another to get high quality data, and this is a dream combination if you want to sequence your genome and get the most out of it.
300 pound gorilla in the room
While many more human genomes have been sequenced and assembled de novo since then, only a few months after the first long read technology human genome publication, PacBio already boasted of new sequencing technology and presented it in a very unique form that grabbed some media attention: a new updated gorilla genome, to show off its enhanced capabilities. The average read length was already an impressive 13,000 base pairs, such a massive improvement that the BioNano genome map was not even used (although short reads technology was called into action to enhance the overall sequence quality). This has seriously improved the gorilla reference genome, which was previously produced with short reads only and assembled using the human genome reference, leading to potential misassemblies and 400,000 gaps in the genome.
The reason why I chose to focus on these older examples of technology is because their back-to-back publication allowed for the comparison of gorilla and human genomes. This revealed 117,512 insertions and deletions and 697 inversion differences between the two species, with 72% of these differences being specific to the gorilla lineage. This can show on the genomic scale how evolution has shaped the structural architecture of chromosomes besides the obvious specific point mutations. It also points once again to the need to study such architecture in great detail in humans on a wider scale. Delving deeper, 2,151 of the indels that were observed to affect regulatory elements (for gene expression, or in essence, regulating how genes are used), were fixed structural differences between humans and gorillas. There were only 15 indels observed in genes that could not be tolerated in the human genome (due to the biological consequences). The final verdict: there is only 1.6% divergence between human and gorilla genomes!
It doesn’t seem like much difference between a gorilla and a human, and yet these differences were enough to be worlds apart. Imagine how little it can take to impact human health! Since the gorilla genome, even further strides have been made in this technology to make it cheaper and more powerful, so if you are super rich and want the most informative high-quality genome sequence of your own, let Merogenomics know, and we will set you up with that! It only takes a small army of scientists! ;)
This article has been produced by Merogenomics Inc. and edited by Kerri Bryant. Reproduction and reuse of any portion of this content requires Merogenomics Inc. permission and source acknowledgment. It is your responsibility to obtain additional permissions from the third party owners that might be cited by Merogenomics Inc. Merogenomics Inc. disclaims any responsibility for any use you make of content owned by third parties without their permission.
Products and Services Promoted by Merogenomics Inc.