DNA Racism: 300 Million Letters of DNA Are Missing from the Human Genome

DNA Racism: 300 Million Letters of DNA Are Missing from the Human Genome

Nov 29, 2018 – In March 1997, a man answered an ad in The Buffalo News. He agreed to give 50 milliliters of blood. He did not leave a name, as the researchers who took his blood wanted to keep things anonymous. They called him RP11.

The ad RP11 answered turned out to be for the Human Genome Project. And it happened that, as the race to complete the first human genome heated up, it was RP11’s sample that was ready to be sequenced first. Ultimately, 70 percent of the Human Genome Project’s DNA came from RP11. The rest came from about 50 other volunteers.

Since then, that first human genome has been continually refined. It’s become a “reference genome,” the standard against which practically every human whose DNA has been sequenced is compared. But most of it, still, comes from RP11. Every person’s genetic code is unique, so using just one reference genome—most of it from one person—to stand in for all of humanity has introduced subtle biases into genetics research.


A new study of DNA from people of African descent shows just how much the reference genome is missing: Scientists found across the 910 people in their study 300 million letters of DNA that are not in the reference genome. Some of these newfound segments of DNA could represent new genes that were previously overlooked.

As impressive as 23andMe’s genetic database is, it still has noticeable gaps—especially among Africans, Middle Easterners, Central Asians, Southeast Asians, and indigenous Americans. Almost any group that is not European, basically. In the scientific literature, in fact, nearly 80 percent of the people who have participated in studies of associations between genes and diseases are of European descent.

DNA Racism: 300 Million Letters of DNA Are Missing from the Human Genome

Africans collectively are far more genetically diverse than people in other parts of the world, so their DNA is especially likely to differ from the reference genome. Currently, DNA sequencers work by chopping up a genome into segments that are individually “read” and then assembled like a jigsaw puzzle. Algorithms use the reference genome to tell them where to put each segment. If a new segment pops up—one that doesn’t match anything in the reference genome—the algorithms don’t know what to do. Usually, scientists ignore it.

“We wanted to look at what’s in the pieces we’re ignoring,” says Rachel Sherman, a computational biologist at Johns Hopkins University who is the lead author of the study. Sherman’s adviser, Steven Salzberg, collaborates with researchers studying the genetics of asthma in people of African descent. That study includes people living in West Africa, North America, South America, and the Caribbean. “It occurred to me we had this very genetically diverse group,” says Salzberg. In other words, their research presented an opportunity to study the diversity missing from the reference genome.

23andMe has recently launched a number of initiatives to add underrepresented groups to its database. The latest is the Populations Collaborations Program, which allows U.S.-based scientists already studying underrepresented groups to apply for free spit kits and DNA analysis. In return, 23andMe gets to add the DNA to its database. The company has had one-off collaborations with researchers in Sierra Leone, the Democratic Republic of Congo, and Tanzania, but the new program sets up a formal application process.

Smaller studies have found enough novel DNA sequences that they estimated there are as many as 40 million letters that can appear in the human genome but are not in the current reference genome. Sherman and Salzberg found nearly 300 million missing letters in 125,715 separate DNA segments—much more than they expected. “I thought I must have done something wrong,” says Sherman. But when she went back and combed through all her code, she couldn’t find any errors.

Now the question is what those 300 million previously overlooked letters of DNA contain. Perhaps they code for novel, interesting mutations related to disease that studies using the reference genome overlooked. If you’re only comparing against the reference genome, “you’re never going to find that missing piece,” says Tina Graves-Lindsay, a geneticist at Washington University of St. Louis.

DNA Racism: 300 Million Letters of DNA Are Missing from the Human Genome

This is especially true in people of African descent, as their high genetic diversity means their DNA is more likely to not match up to the reference genome. RP11, scientists later surmised, was probably African American himself, but the problem of using one reference genome to represent the whole human population still holds true.

DNA Racism: 300 Million Letters of DNA Are Missing from the Human Genome

Earlier this year, 23andMe also announced the Global Genetics Project to give free tests to people who can trace all four grandparents to one of 61 underrepresented countries. That, in turn, is an expansion of the African Genetics Project, which did the same for certain African countries, specifically chosen for being areas from where slaves and recent immigrants to the United States came.

Graves-Lindsay says she wasn’t surprised by the number of new DNA sequences Sherman found, as she and her collaborators are also working to address the gaps in the reference genome. They’re assembling reference genomes from a diverse group of people so that scientists can compare against more than just the single current version. The team has sequenced 15 samples so far, including five from Africa. “It was a very good paper for us to read about,” Graves-Lindsay says of Sherman’s study. “It proves what we’re doing is really needed.” But her work is still constrained by what exists in DNA repositories. For example, she doesn’t have a sample to sequence from people indigenous to Australia.

De Vries says the key question is what’s in it for the people giving up their DNA. 23andMe does plan to evaluate applications based on how people sampled can benefit from the research. For example, Mountain says, could the research ultimately reveal insights about a group’s history or genetic predispositions to disease?

Genetic discrimination occurs when people are treated differently by their employer or insurance company because they have a gene mutation that causes or increases the risk of an inherited disorder. Fear of discrimination is a common concern among people considering genetic testing.

The Genetic Information Nondiscrimination Act of 2008, also referred to as GINA, is a new federal law that protects Americans from being treated unfairly because of differences in their DNA that may affect their health. The new law prevents discrimination from health insurers and employers.

Mary Greeley News

credit: https://www.theatlantic.com/science/archive/2018/11/human-genome-300-million-missing-letters-dna/576481/