An international team of scientists says it has sequenced and assembled the entirety of the human genome, including parts that were missed in the sequencing of the first human genome two decades ago.
The claim, if confirmed, surpasses the achievement laid out by leaders from the Human Genome Project and Celera Genomics on the White House lawn in 2000, when they announced the sequencing of the first draft human genome. That historic draft, and subsequent human DNA sequences, have all missed about 8% of the genome.
The sequencing of the new genome fills in these gaps using new technology. It has different limitations, however, including the type of cell line that the researchers used in order to speed up their effort.
The work was detailed May 27 in a pre-print, meaning it has not yet been peer-reviewed.
“You’re just trying to dig into this final unknown of the human genome,” said Karen Miga, a researcher at the University of California, Santa Cruz, who co-led the international consortium that created the sequence. “It’s just never been done before and the reason it hasn’t been done before is because it’s hard.”
Miga emphasized that she won’t consider the announcement official until the paper is peer-reviewed and published in a medical journal.
The new genome is a leap forward, researchers say, that was made possible by new DNA sequencing technologies developed by two private sector companies: Pacific Biosciences of Menlo Park, Calif., also known as PacBio, and Oxford Nanopore, of Oxford Science Park, U.K.. Their technologies for reading out DNA have very specific advantages over the tools that have long been considered researchers’ gold standards.
Ewan Birney, the deputy director general of the European Molecular Biology Laboratory called the result “a technical tour de force.” The original genome papers were carefully worded because they did not sequence every DNA molecule from one end to the other, he noted. “What this group has done is show that they can do it end-to-end.” That’s important for future research, he said, because it shows what is possible.
George Church, a Harvard biologist and sequencing pioneer, called the work “very important.” He said he likes to note in his talks that up until now no one has sequenced the entire genome of a vertebrate — something that is no longer true, if the new work is confirmed.
One important and unanswered question: How important are these missing pieces of the human puzzle? The consortium said that it increased the number of DNA bases from 2.92 billion to 3.05 billion, a 4.5% increase. But the count of protein-coding genes increased by just 0.4%, to 19,969. That doesn’t mean, researchers emphasized, that the work couldn’t also lead to other new insights, including those related to how genes are regulated.
The DNA sequence used was not from a person, but from a hydatidiform mole, a growth in a woman’s uterus caused when sperm fertilized an egg that did not have a nucleus. This meant that it contained two copies of the same 23 chromosomes, instead of two differing sets of chromosomes, as normal human cells do.
The researchers chose these cells, which had been kept in a lab, because this made the computational effort of creating the DNA sequence simpler. The original draft genome created in 2003 also contained only 23 chromosomes, but as technologies for DNA sequencing have become cheaper and simpler, researchers have tended to sequence all 46 chromosomes.
Elaine Mardis, co-executive director of the Institute for Genomic Medicine at Nationwide Children’s Hospital, worried that because these cell lines were kept in the lab, potentially mutating, the new genetic information “may be largely the detritus that accumulates as a cell line is propagated over many years in culture.”
Miga said that studies of the cell line had shown it to be similar to human cells, and that the researchers used cells that had been kept frozen, not propagated for many years. “We went to great lengths in the preprints to demonstrate that these new sequences serve as biological reference for human genomes,” Miga wrote in an email. She agreed the next step was for the group to try to sequence all 46 chromosomes, known as a diploid genome.
Why did it take 20 years for this last 8% of the genome to be sequenced, even as the cost of sequencing the rest of the genome dropped from $300 million to as little as $300? The answer has to do with the way DNA sequencing technologies work.
The current workhorse DNA sequencers, made by Illumina, take little fragments of DNA, decode them, and reassemble the resulting puzzle. This works fine for most of the genome, but not in areas where DNA code is the result of long repeating patterns. If a supercomputer only had small fragments, how could it assemble a DNA sequence that repeated “AGAGAGA” for bases upon bases? That’s what the missing 8% of the genome looked like.
Among these “unmappable” regions were one of the most recognizable structures in biology. If you’ve ever looked at chromosomes (think back to high school biology), they look like strings that have been knotted together. Those knots are centromeres, bundles of DNA that hold the chromosomes together. They play a key role in cell division. And they are full of repeats.
It was the centromeres, in fact, that drew Miga to want to see these missing regions.
“Why are the regions that are so fundamental to life, so fundamental to how the cell operates, positioned over parts of our genome that are these giant seas of tandem repeats?” she remembers asking as a grad student.
It was that question that led her, in discussion with Adam Phillippy, a researcher at the National Institutes of Health, to propose starting their current initiative, called the Telomere 2 Telomere Consortium, after the telomeres, which are the ends of the chromosome, in 2019. They signed on Evan Eichler, a University of Washington biologist who had been worried about the missing parts of the genome for years, as a co-author.
The work was possible because the Oxford Nanopore and PacBio technologies do not cut the DNA up into tiny puzzle pieces. The Oxford Nanopore technology runs a DNA molecule through a tiny hole, resulting in a very long sequence. The PacBio tech uses lasers to examine the same sequence of DNA again and again, creating a readout that can be highly accurate. Both are more expensive than the existing Illumina technology.
The companies are in a heated race. For this project, the researchers say, the PacBio technology’s accuracy proved invaluable, and they used Oxford Nanopore to finish up some areas. But Oxford Nanopore has already been promising new, more usable tech. “In the here and now, PacBio has the advantage but it’s not clear how long they’ll be able to keep it,” said Michael Schatz, an associate professor at Johns Hopkins University.
All the researchers spoke of a vision of the future where instead of using a single reference genome, they would assemble hundreds of different, complete genomes that are interlinked and ethnically diverse, and can be used as references. Miga is helping lead that work, as well. And this is just a step in that direction.
But until now, Schatz says, there have always been questions about what was missing. Now finally we have the right data,” he said. “We have the right technology.”
Correction: A previous version of this story incorrectly described the chromosomes of a hydatidiform mole.