Research
The new AI tool sorts the results of 71 million ‘missense’ mutations.
Unraveling the root causes of disease is one of the greatest challenges in human genetics. With millions of possible mutations and limited experimental data, it remains largely a mystery which ones could cause disease. This knowledge is vital for faster diagnosis and the development of life-saving treatments.
Today, we’re releasing one list of ‘missense’ mutations where researchers can learn more about what they might have. Missense variants are genetic mutations that can affect the function of human proteins. In some cases, they can lead to diseases such as cystic fibrosis, sickle cell disease or cancer.
The AlphaMissense catalog was developed using AlphaMissense, our new artificial intelligence model that classifies missense variants. In a paper published in Science, we show that it categorized 89% of all 71 million possible variants as pathogenic or possibly benign. In contrast, only 0.1% has been confirmed by experts in humans.
Artificial intelligence tools that can accurately predict the effect of variants have the power to accelerate research in fields from molecular biology to clinical and statistical genetics. Experiments to uncover disease-causing mutations they are expensive and laborious – each protein is unique and each experiment must be designed individually, which can take months. Using AI predictions, researchers can preview results for thousands of proteins at a time, which can help prioritize resources and speed up more complex studies.
We have made all of our predictions freely available to the research community and open sourced them model code for AlphaMissense.
AlphaMissense predicted the pathogenicity of all possible 71 million missense variants. It classified 89% – predicting that 57% were likely benign and 32% were likely pathogenic.
What is a failure variant?
A missense variant is a substitution of a single letter in DNA that results in a different amino acid within a protein. If you think of DNA as a language, changing one letter can change a word and completely change the meaning of a sentence. In this case, a substitution changes which amino acid is translated, which can affect the function of a protein.
The average person carries more than 9,000 failure variations. Most are benign and have little or no effect, but others are pathogenic and can seriously disrupt protein function. Missense variants can be used in the diagnosis of rare genetic diseases, where several or even a single variant can directly cause disease. They are also important for studying complex diseases, such as type 2 diabetes, which can be caused by a combination of many different types of genetic changes.
Sorting out missense variants is an important step in understanding which of these protein changes could cause disease. Of more than 4 million variants already observed in humans, only 2% have been classified as pathogenic or benign by experts, about 0.1% of the 71 million possible missense variants. The rest are considered “variants of unknown significance” due to a lack of experimental or clinical data on their impact. With AlphaMissense we now have the clearest picture to date, classifying 89% of variants using a threshold that yielded 90% accuracy in a database of known disease variants.
Pathogenic or benign: How AlphaMissense classifies variants
AlphaMissense is based on our innovative model AlphaFold, which predicted structures for almost all proteins known to science from their amino acid sequences. Our adapted model can predict the pathogenicity of variants that alter single amino acids of proteins.
To train AlphaMissense, we tuned AlphaFold to tags that discriminate between variants observed in humans and closely related primate populations. Commonly observed variants are treated as benign and never observed variants are treated as pathogenic. AlphaMissense does not predict the change in protein structure after mutation or other effects on protein stability. Instead, it leverages databases of related protein sequences and structural framework of variants to produce a score between 0 and 1 approximating the likelihood that a variant is pathogenic. Continuous scoring allows users to select a threshold for classifying variants as pathogenic or benign that matches their accuracy requirements.
An illustration of how AlphaMissense classifies human missense variants. A variant of faulty logic is introduced and the AI system scores it as pathogenic or possibly benign. AlphaMissense combines structural framework and protein language modeling and is optimized on human and primate variant population frequency databases.
AlphaMissense achieves cutting-edge predictions across a wide range of genetic and experimental benchmarks, all without being explicitly trained on such data. Our tool outperformed other computational methods when used to classify variants from ClinVar, a public dataset on the relationship between human variants and disease. Our model was also the most accurate method for predicting results from the laboratory, indicating that it is consistent with different ways of measuring pathogenicity.
AlphaMissense outperforms other computational methods for predicting missense variant effects.
Left: Comparison of the performance of AlphaMissense and other methods in classifying variants from the Clinvar public archive. Methods shown in gray were trained directly on ClinVar and their performance on this benchmark is likely to be overestimated as some of their training variants are included in this test set.
Correctly: Graph comparing the performance of AlphaMissense and other methods in predicting measurements from biological experiments.
Create a community resource
AlphaMissense builds on AlphaFold to advance the world’s understanding of proteins. A year ago, we released 200 million protein structures predicted using AlphaFold – which helps millions of scientists around the world accelerate research and pave the way to new discoveries. We look forward to seeing how AlphaMissense can help solve open questions at the heart of genomics and across biological science.
We have made AlphaMissense predictions freely available to the scientific community. Along with EMBL-EBI, we are also making them more user-friendly for researchers through it Ensembl Variant Effect Predictor.
In addition to the lookup table of missense mutations, we have shared the expanded predictions of all possible 216 million individual amino acid sequence substitutions in more than 19,000 human proteins. We’ve also included the average prediction for each gene, which is similar to measuring the evolutionary constraint of a gene – it shows how important the gene is to the survival of the organism.
Examples of AlphaMissense predictions overlaid on predicted AlphaFold structures (red=predicted as pathogenic, blue=predicted as benign, grey=uncertain). Red dots represent known pathogenic variants, blue dots represent known benign variants from the ClinVar database.
Left: HBB protein. Variations in this protein can cause sickle cell disease.
Correctly: CFTR protein. Variations in this protein can cause cystic fibrosis.
Accelerating research into genetic diseases
A key step in translating this research is collaboration with the scientific community. We are working in partnership with Genomics England to explore how these predictions could help study the genetics of rare diseases. Genomics England cross-referenced the AlphaMissense findings with variant pathogenicity data previously collected with human participants. Their evaluation confirmed that our predictions are accurate and consistent, providing another real-world benchmark for AlphaMissense.
Although our predictions are not designed for use directly in the clinic – and should be interpreted with other sources of evidence – this work has the potential to improve the diagnosis of rare genetic disorders and aid in the discovery of new disease-causing genes.
Ultimately, we hope that AlphaMissense, along with other tools, will enable researchers to better understand diseases and develop new life-saving treatments.