The feat made headlines around the world: “Scientists Say Human Genome is Complete,” the New York Times announced in 2003. “The Human Genome,” the journals Science and Nature said in identical ta-dah cover lines unveiling the historic achievement.

There was one little problem.

“As a matter of truth in advertising, the ‘finished’ sequence isn’t finished,” said Eric Lander, who led the lab at the Whitehead Institute that deciphered more of the genome for the government-funded Human Genome Project than any other. “I always say ‘finished’ is a term of art.”


“It’s very fair to say the human genome was never fully sequenced,” Craig Venter, another genomics luminary, told STAT.

“The human genome has not been completely sequenced and neither has any other mammalian genome as far as I’m aware,” said Harvard Medical School bioengineer George Church, who made key early advances in sequencing technology.

What insiders know, however, is not well-understood by the rest of us, who take for granted that each A, T, C, and G that makes up the DNA of all 23 pairs of human chromosomes has been completely worked out. When scientists finished the first draft of the human genome, in 2001, and again when they had the final version in 2003, no one lied, exactly. FAQs from the National Institutes of Health refer to the sequence’s “essential completion,” and to the question, “Is the human genome completely sequenced?” they answer, “Yes,” with the caveat — that it’s “as complete as it can be” given available technology.

Perhaps nobody paid much attention because the missing sequences didn’t seem to matter. But now it appears they may play a role in conditions such as cancer and autism.

“A lot of people in the 1980s and 1990s [when the Human Genome Project was getting started] thought of these regions as nonfunctional,” said Karen Miga, a molecular biologist at the University of California, Santa Cruz. “But that’s no longer the case.” Some of them, called satellite regions, misbehave in some forms of cancer, she said, “so something is going on in these regions that’s important.”

Miga regards them as the explorer Livingstone did Africa — terra incognita whose inaccessibility seems like a personal affront. Sequencing the unsequenced, she said, “is the last frontier for human genetics and genomics.”

Church, too, has been making that point, mentioning it at both the May meeting of an effort to synthesize genomes, and at last weekend’s meeting of the International Society for Stem Cell Research. Most of the unsequenced regions, he said, “have some connection to aging and aneuploidy” (an abnormal number of chromosomes such as what occurs in Down syndrome). Church estimates 4 percent to 9 percent of the human genome hasn’t been sequenced. Miga thinks it’s 8 percent.

The reason for these gaps is that DNA sequencing machines don’t read genomes like humans read books, from the first word to the last. Instead, they first randomly chop up copies of the 23 pairs of chromosomes, which total some 3 billion “letters,” so the machines aren’t overwhelmed. The resulting chunks contain from 1,000 letters (during the Human Genome Project) to a few hundred (in today’s more advanced sequencing machines). The chunks overlap. Computers match up the overlaps, assembling the chunks into the correct sequence.


Sign up for The Readout: A guide to what's new in biotech

Please enter a valid email address.

That’s between difficult and impossible to do if the chunks contain lots of repetitive segments, such as TTAATATTAATATTAATA, or TTAATA three times. “The problem is, when you have the same exact words, it’s hard to assemble,” said Lander, just as if jigsaw puzzle pieces show the same exact blue sky.

In 2004, the genome project reported that there were 341 gaps in the sequence. Most of the gaps — 250 — are in the main part of each chromosome, where genes make the proteins that life runs on. These gaps are tiny. Only a few gaps — 33 at last count — lie in or near each chromosome’s centromere (where the two parts of a chromosome connect) and telomeres (the caps at the end of chromosomes), but these 33 are 10 times as long in total as the 250 gaps.

That makes the centromeres in particular the genome’s uncharted Zambezi. Evan Eichler of the University of Washington said every chromosome has such sequence-defying repetitive elements — think of them as DNA stutters — including an infamous one that’s 171 letters long and repeated end-to-end for thousands of letters.

At the beginning of the Human Genome Project, said Lander, now director of the Broad Institute of MIT and Harvard, “it became very clear these highly repetitive sequences would not be tractable with existing technology. It wasn’t a cause of a great deal of agonizing at the time,” since he and other project leaders expected the next generation of scientists to find a solution.

That hasn’t really happened, partly because there hasn’t been much motivation to map these regions. “I’m between agnostic and a little skeptical that these bits will be important for disease, but maybe I’m saying that because we can’t read them,” Lander said.

As new sequencing technology has begun allowing scientists to peek into unsequenced territory, however, they have seen that “these tough-to-sequence regions frequently have important genes,” said Michael Hunkapiller, chairman and CEO of Pacific Biosciences, which makes DNA sequencers. (In 1998, Hunkapiller recruited Venter to his new company, Celera Genomics, to race the government-backed genome project; the race ended in a de facto tie.)

PacBio’s “reason for being” is to increase the length of DNA segments that can be read and assemble them, Hunkapiller said. Longer reads have an effect like enlarging jigsaw puzzle pieces; even though the pieces still contain a lot of repeated blue sky, the greater size makes it more likely they’ll also contain something sufficiently novel to make assembling them easier. PacBio’s maximum DNA read is now about 60,000 letters, Hunkapiller said, and averages 15,000.

With such long reads, Lander said, “you could get through a lot of these nasty [unsequenced] regions.”

That’s looking more and more like a worthy undertaking, and not only because the unsequenced regions might contain actual protein-making genes. There is evidence that the non-gene parts — especially the DNA stutters — “clearly have disease implications,” Hunkapiller said. “Three-quarters of the [genome] differences between one person and another are in [such] variants” rather than the single-letter spelling differences in A’s, T’s, C’s, and G’s which get all the attention. In a 2007 paper, Venter (now the chairman of Human Longevity Inc.)  and his team showed that there are more person-to-person differences like this, called structural variants, than there are single-letter changes.

Yet about 90 percent of the structural variants, the vast majority of which weren’t sequenced by either the genome project or a later effort called the 1000 Genomes Project, “have been missed,” Eichler and his colleagues reported last year.

One reason the stutters are unusually influential is that this repetitive DNA can move around, make copies of itself, flip its orientation, and do other acrobatics that “can have quite dramatic functional effects,” Hunkapiller said. For one thing, repetitive elements around the centromeres, called satellites, might cause a dividing cell to become cancerous, Miga said, because they can destabilize the entire genome.

When researchers at Stanford University tried to find the genetic cause of a young man’s mysterious disease, which caused non-cancerous tumors to grow throughout his body, they found nothing using the standard whole-genome sequencing, Hunkapiller said. But the “long reads” made possible by the PacBio machines “looked for structural variants and found the problem right away,” he said.

The stutters might even be what makes us human. Some of these complex duplications “appear to be important for the evolution of higher neuroadaptive function” — aka brain development, Eichler said. A gene called ARHGAP11B, which was created by one such duplication, causes the cortex to develop the myriad folds that support complex thought; SRGAP2C, also a duplication, triggers brain development.

“These are new genes that evolved specifically in our lineage over the last few million years,” said Eichler. The same duplications can also produce DNA rearrangements “associated with neurodevelopmental disorders such as autism and intellectual disability.”

“Finish the sequence!” hasn’t become a rallying cry, but maybe it should be, Venter said: “I’d be the last one to give you a quote saying that we don’t need to bother with these [unsequenced] regions.”

Leave a Comment

Please enter your name.
Please enter a comment.

  • It would seem that a multitude of frequent checksums must be present within invariant sections meant to stay true without apoptosis across countless generations while dynamism is requires in the raw code that defies pattern matching. We are sequencing the stable units of the genome while missing the program code running on its OS.

    • Great analogy Hugh!
      While complex genetic OS ACTIVITY caused dupplicants- ARHGAP11B and SRGAP2C genes, enabling complex thought and brain development, a disk clean up procedure could also enable neural developmental disorders and intellectual disabilities. That could potentially happen when a CPU is violently shaken or dropped, correct.
      So the flip-flopping, bouncing and stuttering makes us human. All with not only internal but external contributors that channel the stattering sequences multiplexing all thought and disability or ability in unthinkable complex action.
      Dr. Venter’s rallying cry “Finish the sequence” should be accompanied by a Chant & Smile “Find the real Human in us” AMAZING.

  • How about epigenetics? Genes explain only a small fraction of health outcomes. For example, if we consider something like schizophrenia, genes appear to explain only about 0.001 percent of outcomes – check out the following article:
    Also check out the following new review:
    Buric, I., et al. (2017) What Is the Molecular Signature of Mind–Body Interventions? A Systematic Review of Gene Expression Changes Induced by Meditation and Related Practices. Front. Immunol. 8:670.

  • It’s a long time since I did my biochemistry degree and life keeps getting more complicated than we thought possible.

  • I have already been preety sure that the whole genome cant be sequenced because of highly repetitive sequence called heterochromatin and but there is one solution as i think, this problem may overcome or can reduce the error rate during sequencing we may use restiction enzyme plus fluroscence dye.For example TTAGGG repeats in telomere – if we use R.E. against above repetitive segments and can be using such catalyst that causes end labelling on each of repeat segment or gives fluroscence after that we may ovelap these DNA repetes and can reduce the error rate

  • It’s not just DNA at the regions noted in the stoty. It’s where and when the sample is taken. Cellular replication causes an average 3 mistakes each time. Some cells replicate 1000 to 10000000 times over a lifetime. We need to establish baseline variance in the genome and this can only be done with oocytes. They are the least perturbed of all human cells. Basing the human genome on samples from older individuals complicates matters. Variance between individuals may be a few million to tens of millions of base pairs out of 3 billion. Sequencing errors and replication mistakes will forever cloud the picture unless removed from the process.

    • Defining “normal” is challenging when, as Brian Conner points out, we have an average of 3 mutations every time a cell divides. I personally have 3.5 million variants that differ from the reference sequence. We all do. Normal encompasses a lot of variation.

Sign up for our Daily Recap newsletter

A roundup of STAT’s top stories of the day in science and medicine

Privacy Policy