The geneticists had high hopes of identifying the mutation that had caused the little boy’s abnormalities: a flattened face, cognitive delays, cleft palate, stubby thumbs, and a host of other skeletal malformations. They were pretty sure he had a rare disease called Baratela Scott syndrome, which had been identified only in 2012. But because its symptoms overlap with those of other hereditary disorders, they couldn’t be sure.
At the University of Washington, geneticists looked for a gene mutation that had been found in other children with the disorder. Following the usual procedure, they checked the spelling of the boy’s DNA sequences against the closest thing genomics has to a dictionary — the “reference” human genome, the grand product of the decade-long, $2.7 billion Human Genome Project.
To the scientists’ puzzlement, however, the boy’s sequence showed no sign of the mutation in the gene known to cause Baratela Scott, called XYLT1. Nor did the DNA of the next boy with the disorder, or the next. As they tried to compare the boys’ DNA sequences to the reference genome, it was like trying to check a spelling in a Webster’s from which a prankster had torn handfuls of pages. Many pieces of the boys’ genomes, called short reads, “weren’t in the reference genome at all,” said Katia Sol-Church of the University of Virginia School of Medicine, with whom the UW geneticists were collaborating. There was no way to check them for disease-causing misspellings.
The human reference genome, largely completed in 2001, has achieved near-mythic status. It is “the book of life,” the “operating manual for Homo sapiens.” But the reference genome falls short in ways that have become embarrassing, misleading, and, in the worst cases, emblematic of the white European dominance of science — shortcomings that are threatening the dream of genetically based personalized medicine.
“There are so many uses of the reference genome, and for every single one it has problems,” said computational biologist Jesse Gillis of Cold Spring Harbor Laboratory, one of many scientists arguing that it’s long past time to fix those problems. They’re not just hampering the diagnosis of rare disorders, but also leading prenatal genetic tests to miss mutations in non-Europeans, and causing the newest way to assess the genetic risk of diseases to be worthless in many of the world’s ethnic groups.
The reference genome’s shortcomings are rooted in its history. On March 23, 1997, the then-nascent Human Genome Project placed an ad in a newspaper in Buffalo, N.Y. (site of a project scientist’s lab), seeking volunteers to donate blood from which they would sequence DNA. Through a quirk of whose DNA got processed when, about 70 percent of the reference genome comes from an anonymous man designated RP11, said UW genome scientist Evan Eichler, with the rest from a few score other volunteers. The reference genome is therefore a mashup of the sequences of these everyday people.
As a result, it isn’t a perfectly healthy genome: It has at least 3,556 variants that increase the risk of diseases, including type 1 diabetes and hypertension.
Its most serious shortcoming, however, reflects the fact that 1990s Buffalo was not exactly the United Nations. Its ethnic populations are almost all European — German, Irish, Polish, and others. The reference genome, therefore, is as well. That had long been known, but was largely swept under the rug.
Now, though, scientists are quantifying the disconnect between the reference genome and most of humanity, and the numbers aren’t exactly rounding errors.
In a 2017 study, Eichler and colleagues estimated that a genome sequence from a random individual differs from the reference genome by up to 16 million bases — roughly 0.5 percent of the 3.1 billion pairs of A’s, T’s, C’s, and G’s that make up the genetic code. In some cases, the differences aren’t simply a single-nucleotide change like a C for a T, but long lengths of DNA inserted in odd places, moved from one spot to another, or running backwards — any of which can make it nearly impossible to map a patient’s DNA to the reference genome.
The numbers get even worse the more an individual’s ancestry differs from those in the reference genome.
In a paper published last week, for instance, scientists led by Dr. Pui-Yan Kwok of the University of California, San Francisco, analyzed 154 genomes from 26 ethnic populations, from Han Chinese and Tuscans to Yoruba, Esan, Puerto Ricans, and Peruvians. They found 60 million bases in one or more of these populations that are missing from the reference genome.
“The reference genome was a huge triumph, but when it was done people weren’t thinking that much about population-geographic genetic variation,” said bioinformatics professor Mark Gerstein of Yale University. “One of its problems is that it’s very European-biased, which means that an African has many more differences from the reference than a European does.”
That can keep non-Europeans from benefiting from the genetic revolution. At Duke University, geneticists recently analyzed the DNA of a young African-American woman’s intellectual disability and progressive cognitive decline, hoping to identify its genetic cause. They turned up 10 abnormal variants, said medical geneticist Dr. Queenie Tan. With white patients, whose genomes are a closer match to the reference, it’s rare to get more than a couple hits; that can guide parents if they wish to undergo prenatal genetic testing before having additional children. “But with 10 candidates, it’s hard to come to any conclusion about whether one gene is more important than the others,” said genetic counselor Heidi Cope of Duke. “With this patient, we were stuck.”
In other cases, the reference genome is missing vast quantities of the DNA found in non-Europeans. Computational biologist Steven Salzberg of Johns Hopkins University and colleagues sequenced the genomes of 910 African Americans and measured how many pieces are present in all of them but are missing from the reference genome. Their count: 296,485,284 base pairs — nearly 10 percent of the human genome — they reported last November. One missing fragment is 100,000 base pairs long, and millions are at least 1,000 long.
Some experts believe Salzberg’s count is too high, but none disagrees with his conclusion. “Eighteen years after finishing the human genome, why are we still relying on just one genome, a mosaic of a few dozen people, to guide thousands of experiments?” he asked. “We can do far better.”
The National Institutes of Health is placing a multimillion-dollar bet that he’s right. At a 2018 meeting convened by its National Human Genome Research Institute, experts concluded that the reference “does not adequately represent human [genetic] variation,” and that it needed to be improved by creating a “pan-genome” that has all of that variation stuffed into it. NHGRI is now evaluating proposals to do that, offering up to $6 million per year to produce high-quality sequences of about 350 genomes.
“The number is less important than what populations we should sample,” said NHGRI’s Adam Felsenfeld. The current reference genome “is good for many, many things, but it’s not as good or as complete as it could be.”
The problems start with the standard way of sequencing a genome, including for medical purposes such as finding the genetic cause of a mystery syndrome. Scientists chop it into millions of segments, about 100 base pairs long. They feed these short reads into next-generation sequencing machines, which determine the order of the A’s, T’s, C’s, and G’s. Algorithms then figure out where each short read falls on a chromosome by using the reference genome as a guide.
When the reference is missing a page, like that vandalized dictionary, scientists are stuck. That’s what happened to the Baratela Scott scientists. Only after time-consuming detours to the mouse genome and to alternative DNA sequencing that bypassed the reference genome did they finally find the abnormality — in a region upstream of XYLT1 — and confirm the boys had Baratela Scott.
“The problem was, this 238-base-pair region isn’t in the reference,” said Dr. Heather Mefford, the UW pediatrician and geneticist who led the sequencing analysis: The abnormality was a nucleotide stutter, with CGG repeated hundreds of times in a segment of DNA that activates XYLT1.
Because the region with the stutter is missing from the reference genome, if labs less advanced that UW’s analyze DNA from patients thought to have Baratela Scott, they will almost certainly miss it. The syndrome, a recessive disorder, has no cure, so that wouldn’t affect patient care. “But if you want to test for it prenatally,” said Mefford, perhaps when prospective parents know or suspect it runs in their family, “it wouldn’t be found.”
Mefford’s lab is grappling with a similar medical mystery. In a large fraction of patients with a form of epilepsy that she strongly suspects is genetic, she’s been unable to find any glitches in their DNA when she compares their short reads to the reference genome. “One of our nagging questions,” she said, “is, are the relevant regions missing from the reference genome?”
Experts aren’t sure why DNA sequencing identifies the genetic cause of a child’s mystery disease only about 40 percent of the time, said Felsenfeld, “but failure to align a patient’s short reads on the reference genome might be one reason.”
That’s especially likely to happen if the patient belongs to an ethnic group that is poorly represented in the reference. It has none of the thousands of variants that are specific to people from the Philippine island of Panay, for instance. That caused problems when scientists analyzed the genomes of 403 Panays with a rare neurodegenerative disease called X-linked-dystonia parkinsonism, looking for its precise genetic cause.
It turned out that “the causal mutation is in a stretch of DNA that exists only in the Panay population and isn’t in the reference genome,” said neuro-genomics expert Michael Talkowski of Massachusetts General Hospital, who led a 2018 study that, like the Baratela Scott team, eventually used an alternative approach to identify the cause of this parkinsonism. It turned out to be DNA that jumped into a gene called TAF1. That made the gene as meaningless as inserting letters into an English woPHOrd.
Several labs have tried to remedy the ethnic bias of the reference genome by producing Chinese, Korean, and Ashkenazi reference genomes. The problem is, people are often mistaken about their ancestry, so geneticists would get nowhere by trying to compare someone’s genome sequence to the wrong reference. Having a single reference genome is the only way to avoid that.
How to make one that best represents human diversity is a hot topic among computational biologists, with ideas such as “graph genomes” and “pan-genomes” competing for backing like presidential candidates for 2020 and promising to improve the solve rate of mystery diseases.
There’s no disagreement that, without a more representative reference genome, genetic medicine will never reach some ethnic groups, warns genome scientist Alicia Martin of Mass. General. Medical genetics is moving away from assessing disease risk from one or two genes and toward calculating a “polygenic risk score” based on hundreds.
But with the European bias of the reference genome and other tools, polygenic risk scores for people who trace their ancestry to Africa, in particular, are often only “marginally better, if at all,” than flipping a coin, Martin and her colleagues argue in a paper posted on the preprint site bioRxiv. “They are therefore least likely to benefit” from DNA-based medicine — at least until genome scientists move beyond Buffalo.