Skip to Main Content

The geneticists had high hopes of identifying the mutation that had caused the little boy’s abnormalities: a flattened face, cognitive delays, cleft palate, stubby thumbs, and a host of other skeletal malformations. They were pretty sure he had a rare disease called Baratela Scott syndrome, which had been identified only in 2012. But because its symptoms overlap with those of other hereditary disorders, they couldn’t be sure.

At the University of Washington, geneticists looked for a gene mutation that had been found in other children with the disorder. Following the usual procedure, they checked the spelling of the boy’s DNA sequences against the closest thing genomics has to a dictionary — the “reference” human genome, the grand product of the decade-long, $2.7 billion Human Genome Project.


To the scientists’ puzzlement, however, the boy’s sequence showed no sign of the mutation in the gene known to cause Baratela Scott, called XYLT1. Nor did the DNA of the next boy with the disorder, or the next. As they tried to compare the boys’ DNA sequences to the reference genome, it was like trying to check a spelling in a Webster’s from which a prankster had torn handfuls of pages. Many pieces of the boys’ genomes, called short reads, “weren’t in the reference genome at all,” said Katia Sol-Church of the University of Virginia School of Medicine, with whom the UW geneticists were collaborating. There was no way to check them for disease-causing misspellings.

The human reference genome, largely completed in 2001, has achieved near-mythic status. It is “the book of life,” the “operating manual for Homo sapiens.” But the reference genome falls short in ways that have become embarrassing, misleading, and, in the worst cases, emblematic of the white European dominance of science — shortcomings that are threatening the dream of genetically based personalized medicine.

“There are so many uses of the reference genome, and for every single one it has problems,” said computational biologist Jesse Gillis of Cold Spring Harbor Laboratory, one of many scientists arguing that it’s long past time to fix those problems. They’re not just hampering the diagnosis of rare disorders, but also leading prenatal genetic tests to miss mutations in non-Europeans, and causing the newest way to assess the genetic risk of diseases to be worthless in many of the world’s ethnic groups.


The reference genome’s shortcomings are rooted in its history. On March 23, 1997, the then-nascent Human Genome Project placed an ad in a newspaper in Buffalo, N.Y. (site of a project scientist’s lab), seeking volunteers to donate blood from which they would sequence DNA. Through a quirk of whose DNA got processed when, about 70 percent of the reference genome comes from an anonymous man designated RP11, said UW genome scientist Evan Eichler, with the rest from a few score other volunteers. The reference genome is therefore a mashup of the sequences of these everyday people.

As a result, it isn’t a perfectly healthy genome: It has at least 3,556 variants that increase the risk of diseases, including type 1 diabetes and hypertension.

Its most serious shortcoming, however, reflects the fact that 1990s Buffalo was not exactly the United Nations. Its ethnic populations are almost all European — German, Irish, Polish, and others. The reference genome, therefore, is as well. That had long been known, but was largely swept under the rug.

Now, though, scientists are quantifying the disconnect between the reference genome and most of humanity, and the numbers aren’t exactly rounding errors.

In a 2017 study, Eichler and colleagues estimated that a genome sequence from a random individual differs from the reference genome by up to 16 million bases — roughly 0.5 percent of the 3.1 billion pairs of A’s, T’s, C’s, and G’s that make up the genetic code. In some cases, the differences aren’t simply a single-nucleotide change like a C for a T, but long lengths of DNA inserted in odd places, moved from one spot to another, or running backwards — any of which can make it nearly impossible to map a patient’s DNA to the reference genome.

The numbers get even worse the more an individual’s ancestry differs from those in the reference genome.

In a paper published last week, for instance, scientists led by Dr. Pui-Yan Kwok of the University of California, San Francisco, analyzed 154 genomes from 26 ethnic populations, from Han Chinese and Tuscans to Yoruba, Esan, Puerto Ricans, and Peruvians. They found 60 million bases in one or more of these populations that are missing from the reference genome.

“The reference genome was a huge triumph, but when it was done people weren’t thinking that much about population-geographic genetic variation,” said bioinformatics professor Mark Gerstein of Yale University. “One of its problems is that it’s very European-biased, which means that an African has many more differences from the reference than a European does.”

That can keep non-Europeans from benefiting from the genetic revolution. At Duke University, geneticists recently analyzed the DNA of a young African-American woman’s intellectual disability and progressive cognitive decline, hoping to identify its genetic cause. They turned up 10 abnormal variants, said medical geneticist Dr. Queenie Tan. With white patients, whose genomes are a closer match to the reference, it’s rare to get more than a couple hits; that can guide parents if they wish to undergo prenatal genetic testing before having additional children. “But with 10 candidates, it’s hard to come to any conclusion about whether one gene is more important than the others,” said genetic counselor Heidi Cope of Duke. “With this patient, we were stuck.”

In other cases, the reference genome is missing vast quantities of the DNA found in non-Europeans. Computational biologist Steven Salzberg of Johns Hopkins University and colleagues sequenced the genomes of 910 African Americans and measured how many pieces are present in all of them but are missing from the reference genome. Their count: 296,485,284 base pairs — nearly 10 percent of the human genome — they reported last November. One missing fragment is 100,000 base pairs long, and millions are at least 1,000 long.

Some experts believe Salzberg’s count is too high, but none disagrees with his conclusion. “Eighteen years after finishing the human genome, why are we still relying on just one genome, a mosaic of a few dozen people, to guide thousands of experiments?” he asked. “We can do far better.”

The National Institutes of Health is placing a multimillion-dollar bet that he’s right. At a 2018 meeting convened by its National Human Genome Research Institute, experts concluded that the reference “does not adequately represent human [genetic] variation,” and that it needed to be improved by creating a “pan-genome” that has all of that variation stuffed into it. NHGRI is now evaluating proposals to do that, offering up to $6 million per year to produce high-quality sequences of about 350 genomes.

“The number is less important than what populations we should sample,” said NHGRI’s Adam Felsenfeld. The current reference genome “is good for many, many things, but it’s not as good or as complete as it could be.”

The problems start with the standard way of sequencing a genome, including for medical purposes such as finding the genetic cause of a mystery syndrome. Scientists chop it into millions of segments, about 100 base pairs long. They feed these short reads into next-generation sequencing machines, which determine the order of the A’s, T’s, C’s, and G’s. Algorithms then figure out where each short read falls on a chromosome by using the reference genome as a guide.

When the reference is missing a page, like that vandalized dictionary, scientists are stuck. That’s what happened to the Baratela Scott scientists. Only after time-consuming detours to the mouse genome and to alternative DNA sequencing that bypassed the reference genome did they finally find the abnormality — in a region upstream of XYLT1 — and confirm the boys had Baratela Scott.

“The problem was, this 238-base-pair region isn’t in the reference,” said Dr. Heather Mefford, the UW pediatrician and geneticist who led the sequencing analysis: The abnormality was a nucleotide stutter, with CGG repeated hundreds of times in a segment of DNA that activates XYLT1.

Because the region with the stutter is missing from the reference genome, if labs less advanced that UW’s analyze DNA from patients thought to have Baratela Scott, they will almost certainly miss it. The syndrome, a recessive disorder, has no cure, so that wouldn’t affect patient care. “But if you want to test for it prenatally,” said Mefford, perhaps when prospective parents know or suspect it runs in their family, “it wouldn’t be found.”

Mefford’s lab is grappling with a similar medical mystery. In a large fraction of patients with a form of epilepsy that she strongly suspects is genetic, she’s been unable to find any glitches in their DNA when she compares their short reads to the reference genome. “One of our nagging questions,” she said, “is, are the relevant regions missing from the reference genome?”

Experts aren’t sure why DNA sequencing identifies the genetic cause of a child’s mystery disease only about 40 percent of the time, said Felsenfeld, “but failure to align a patient’s short reads on the reference genome might be one reason.”

That’s especially likely to happen if the patient belongs to an ethnic group that is poorly represented in the reference. It has none of the thousands of variants that are specific to people from the Philippine island of Panay, for instance. That caused problems when scientists analyzed the genomes of 403 Panays with a rare neurodegenerative disease called X-linked-dystonia parkinsonism, looking for its precise genetic cause.

It turned out that “the causal mutation is in a stretch of DNA that exists only in the Panay population and isn’t in the reference genome,” said neuro-genomics expert Michael Talkowski of Massachusetts General Hospital, who led a 2018 study that, like the Baratela Scott team, eventually used an alternative approach to identify the cause of this parkinsonism. It turned out to be DNA that jumped into a gene called TAF1. That made the gene as meaningless as inserting letters into an English woPHOrd.

Several labs have tried to remedy the ethnic bias of the reference genome by producing Chinese, Korean, and Ashkenazi reference genomes. The problem is, people are often mistaken about their ancestry, so geneticists would get nowhere by trying to compare someone’s genome sequence to the wrong reference. Having a single reference genome is the only way to avoid that.

How to make one that best represents human diversity is a hot topic among computational biologists, with ideas such as “graph genomes” and “pan-genomes” competing for backing like presidential candidates for 2020 and promising to improve the solve rate of mystery diseases.

There’s no disagreement that, without a more representative reference genome, genetic medicine will never reach some ethnic groups, warns genome scientist Alicia Martin of Mass. General. Medical genetics is moving away from assessing disease risk from one or two genes and toward calculating a “polygenic risk score” based on hundreds.

But with the European bias of the reference genome and other tools, polygenic risk scores for people who trace their ancestry to Africa, in particular, are often only “marginally better, if at all,” than flipping a coin, Martin and her colleagues argue in a paper posted on the preprint site bioRxiv. “They are therefore least likely to benefit” from DNA-based medicine — at least until genome scientists move beyond Buffalo.

  • This article should be about: “NHGRI is now evaluating proposals . . . offering up to $6 million per year to produce high-quality sequences of about 350 genomes.” Its Woke tone buries the lede.

  • Reiterating the point that the “reference genome” is largely from a (most likely) African American individual, RP-11. If you’re going to pronounce on scientific matters, please do your homework. Perhaps shockingly there are still people who care about, you know, accuracy.

  • Blaming the reference genome for Baratela Scott seems weird. You report that that it’s a CGG repeat. Just like any disease that has a GC-rich expansion (Huntington’s, etc), NGS/short reads aren’t going to be able to sequence through it reliably. Even if the reference genome were correct, you cannot reliably sequence expansions if they are bound beyond the length or the read sizes. TL;DR – it’s not *just* reference. Choice of sequencing modality matters, and NGS isn’t going to be reliable for GC-rich expansions like poly-q diseases or Baratela Scott.

  • Where to start… 1) An informative piece. 2) Why is there a standard genome, when there is no standard person, just people arrayed along a movable bell curve depending on where one graphs the axis? 3) Even if one could average it all out, what does that do to the long tails, and won’t mutations change it by the time all the calculations and iterations are done? 4) Is this really the right question for medicine to solve the heavy burden of cancers and chronic diseases? 5) Can we please pay at least half as much attention to the exposome, from the external environment where we live, work, learn, play, and visit, to our diets, our dentistry and our medical devices? 6) Anyone done a polyexposome risk score yet? That said, genetic variants do explain some basics, and are as important as blood types for the proper application of medicine, and for the proper prescreening and selection of medical and dental device materials so they are right for each patient.

  • This is a hugely biased article which presents nothing new. The issues with the reference are well known and ancillary approaches to handle sequence and variant analysis exist. Trio analysis strategies inform inheritance for rare variants, ancestry, if not known apriori in an individual, can be largely discerned through sequence data, and long read data can fill gaps and repetitive regions. Steps are already underway to more fully characterize our references. The bigger issue will be user interfaces and other digital access tools which allow for exploration and use of that reference for translational research and the practice of medicine.

  • Ugh, this is a vile attempt to render the human genome reference sequence “problematic”. No single genome can represent all of humanity. This is not news, nor has the issue gone unaddressed. So much effort has gone into cataloging human genetic diversity. So much effort. The author does not even mention the 1000 Genomes Project, which was completed a decade ago. Also, the claim that having a reference genome is somehow hindering personalized medicine is unsupported.
    A more interesting article would discuss how short-read Illumina data is great for comparing DNA to a reference but not very good at de novo assembly of new reference genomes.
    As DNA sequencing gets cheaper, we will probably combine long- and short-read sequence data to assemble personalized genomes de novo instead of just comparing cheap short reads to a reference.
    There’s an interesting scientific story to be told here, but it’s not the virtue-signaling narrative that the author chose.

  • Seems that this “Reference Human Genome” is not at reference at all, with 70% derived from one subject. Although a huge exercise, this should be done all over again, with a collection of subjects that roughly represent the human populus variety. Maybe then there’d be a True Reference.

    • There are already large databases of human genetic variation. The key word though is “database”, as you cannot combine all this information into a single DNA sequence, which is all the reference genome is. Despite what the author suggests, a great deal of effort has gone into sampling diverse human populations.

    • It seems there’s some confusion about what the “reference genome” is good for, and what it’s not good for. Large scale assembly of a big genome, putting chromosomes together more or less correctly with genes and other features more or less in the right places, is a VERY different process than assessing variation across lineages. This whole article is an absurd misunderstanding of how genome science has worked, including with humans.

    • I was going to make the exact same point. The single person closest to the reference genome, RP-11, was almost certainly African-American. I also remember hearing once that an early draft of the reference genome accidentally included the sickle cell allele (which would obviously then have missed calling this variant). I think the article makes an important point (even databases like gnomAD that are trying to deal with genetic diversity still have some populations better represented than others), but let’s try to tell a more nuanced story about the reference genome.

    • This article does not use the word “race”, so your comment is misplaced. If you understand anything about the human diversity revealed by DNA sequencing, you will know that DNA sequences do not divide humans into the “races” that most of us divide people into (black, white, asian, etc). There is far more genetic diversity between regions within Africa than between any particular African region and the rest of the world. Another way to say that is that a black native of South Africa is likely to be as different, at the DNA sequence level, from a black Ethiopian, as he is from a white Swede. Yet we don’t typically think of Ethiopians and South Africans as being of different “races”.

Comments are closed.