ESMRank: using AI to predict the effects of protein mutations

At TIGEM, ESMRank uses artificial intelligence to rank protein mutations and help researchers interpret genetic variants linked to disease.

Technological advances have made genome sequencing faster, more accurate and increasingly accessible, generating an unprecedented volume of biological data. Today, however, the main challenge is no longer simply to produce this data, but to interpret it. Among the thousands of genetic variants found in each person’s DNA, which ones can affect health, and which are harmless?

This is the context for the work led by Gennaro Gambardella, tenure-track researcher at the Scuola Superiore Meridionale and independent investigator at TIGEM in Pozzuoli, where he heads a laboratory dedicated to computational biology. His group has developed ESMRank, a computational method that uses Artificial Intelligence to predict the effects of protein mutations.

The method is based on an idea that has become central to computational biology: proteins can be studied as sequences written in a twenty-letter alphabet, the letters being amino acids. Artificial Intelligence models trained on millions of protein sequences learn some of the “rules” that govern how these amino acids combine with one another. ESMRank uses this information to evaluate possible amino acid substitutions in a protein and rank them along a continuous scale: from the most damaging variants to the most tolerated, and even to those that may potentially improve protein properties.

This marks an important difference from many tools that classify variants in binary terms, for example as “pathogenic” or “benign”. Instead, ESMRank produces a ranking that can distinguish different degrees of impact. This approach can help researchers select the most relevant variants to investigate in the laboratory and, in some contexts, better interpret the relationship between mutations, protein stability and drug response.

The method starts from the amino acid sequence alone, without requiring the three-dimensional structure of the protein as an input, and has proved particularly effective in predicting the effects of mutations on protein stability. When applied to CFTR, the protein involved in cystic fibrosis, the ESMRank score showed a correlation with experimental measurements of function, maturation and pharmacological response. This suggests that it could help identify variants that may be corrected by the modulator drugs currently used in clinical practice.

This ability to rank mutations along a gradient of effect also opens up new perspectives in protein engineering. At TIGEM, the method is being explored to identify “enhanced” versions of therapeutic enzymes, potentially more effective than their natural counterparts, for use in enzyme replacement therapy for certain lysosomal diseases. This possibility still needs to be validated experimentally, but it shows how Artificial Intelligence can become a practical tool not only for interpreting genetic variants, but also for designing new therapeutic strategies.

The challenge to be addressed: variants of uncertain significance

Human genome sequencing has revealed the extraordinary extent of genetic variability between individuals: each of us is born with millions of variants that make us unique. Most are harmless, but some can alter the function of a protein and cause disease. The problem is that, for a very large number of these variants, their effect is still unknown. These are known as variants of uncertain significance, or VUS: genetic variants that may be disease-causing, irrelevant, or even potentially beneficial.

“In rare genetic diseases, there is usually one variant, among the many identified, that is responsible for the condition. When searching for mutations in a patient’s genome, the real difficulty lies in understanding which one may be the cause” says Gambardella.

For disease genes that are already known, researchers can draw on the available scientific literature. For newly identified genes, however, the situation is far more complex: there may be thousands of variants to assess, and it is not possible to test them all in the laboratory to determine which one is responsible. ESMRank is designed to help fill precisely this gap. It does not replace experimental validation, but helps indicate which variants should be prioritised for further investigation.

“With our method, we try to predict which variants may affect protein function and which may not, helping to reduce the uncertainty associated with VUS. It is important to stress that any bioinformatics tool narrows down the hypotheses; it does not provide a definitive answer. Its role is to guide research, but the predictions must always be validated in the laboratory” explains Gambardella.

ESMRank and the language of proteins

Large Language Models, such as GPT, are trained on millions of texts with one core task: to predict or reconstruct the next word from the surrounding context. When trained at scale, the model learns the grammar and syntax of the language on its own.

A similar principle can be applied to proteins. Each amino acid sequence can be read as a sentence, and the proteome of an organism as a book. Training a model on millions of these sequences, collected from different organisms, means teaching it the rules by which amino acids combine and influence one another. In doing so, the model learns to predict which amino acids are more compatible with a given sequence context. Among the best-known models trained on the language of proteins is ESM, developed by Meta, which can extract information about protein sequence and structure starting from the amino acid sequence alone.

ESMRank starts from this foundation, but takes it in a different direction: it does not predict protein structure, but the effect of mutations. “What we did with our method was to extract some of these features from Large Language Models and use them to our advantage in order to predict the effect of mutations” says Gambardella.

The system takes an amino acid sequence as its input and extracts two types of information. The first consists of the features learned by the ESM model, including information on possible contacts between amino acids in the three-dimensional structure and the position occupied by each protein in a multidimensional mathematical space: proteins with similar properties are close to one another, while different proteins are further apart. The second type of information concerns the physicochemical properties of possible amino acid substitutions. This compact set of data is then fed into a machine learning algorithm — LambdaMART, implemented in XGBoost — trained on around one million integrated and normalised variants derived from more than two million experimental measurements.

The final result is what, in biology, is known as in silico deep mutational scanning. For each position within a protein, the system simulates replacement with each of the twenty possible amino acids and classifies the effect, producing a ranking that ranges from variants causing loss of function to neutral or potentially beneficial ones. An equivalent laboratory experiment would require researchers to synthesise and test thousands, or even tens of thousands, of variants. ESMRank performs this process computationally, starting from the sequence alone and without requiring prior knowledge of the protein’s three-dimensional structure.

“Our system is very simple: it receives the amino acid sequence without needing to know the three-dimensional structure of the protein. We do not predict structure like AlphaFold does: our score is correlated with stability” explains Gambardella.

The method does, however, have two acknowledged limitations. The first is that, by working exclusively on the amino acid sequence, it does not directly use either an experimental or predicted three-dimensional structure, nor does it explicitly model the dynamics of protein folding. Folding is what determines the activity and function of a protein, and mutations can interfere with this process. The second limitation is that ESMRank predicts the effect of mutations on protein stability with high accuracy, but it is not yet optimised to predict other properties directly, such as protein activity.

The validation phase: ESMRank achieves leading results in stability benchmarks

To validate the method, the researchers used data from the scientific literature and external datasets. One of the key resources was the dataset generated by Ben Lehner and published in Nature, which includes around 500 mutagenised protein domains, for a total of approximately 500,000 mutations that the system had not encountered during training.

When asked to rank mutations from the most damaging to the most tolerated, ESMRank outperformed the methods currently used to predict effects on protein stability. The same result was obtained with ProteinGym, a collection of deep mutational scanning experiments on proteins that were also excluded from the training phase. Here too, ESMRank proved to be the most accurate method for predicting the effect of mutations on protein stability, outperforming much more complex deep learning approaches that also use information on three-dimensional structure.

“We believe the reason lies in the training dataset: we collected and harmonised more than two million experimental measurements, largely related to protein stability. The system learned the general substitution rules that alter stability” explains Gambardella.

This advantage also emerges when ESMRank is compared with AlphaMissense, the tool developed by the AlphaFold group. AlphaMissense is highly effective at distinguishing between pathogenic and benign variants, whereas ESMRank shows its strength when variants need to be ranked along a gradient of effect — which is precisely its main purpose.

The most emblematic validation was carried out on CFTR, the protein involved in cystic fibrosis. The researchers analysed a dataset of 585 mutations for which the effects on both activity and protein maturation/stability had been measured. Compared with existing methods, ESMRank proved to be among the most informative tools for connecting mutations, protein function and pharmacological response.

One of its main advantages is that the output is not a binary verdict — pathogenic or non-pathogenic — but a continuous ranking, from loss of function to more tolerated or potentially beneficial variants. In the case of CFTR, this gradient proved particularly meaningful: there is a correlation between the degree of damage indicated by the score and the ability of drugs to intervene. A corrector can rescue a partially compromised protein, but not one that has been completely disrupted.

“When our score indicates that the effect of a mutation is significant but not devastating, CFTR correctors and stabilisers appear to be effective” says Gambardella.

This score could therefore help predict which variants still retain part of the protein’s function and may therefore respond to drugs, while always requiring experimental and clinical validation.

The future of ESMRank: learning from evolution to design better proteins

ESMRank’s ability to identify not only damaging mutations, but also potentially beneficial ones, opens up a further direction: protein engineering applied to therapy. In lysosomal diseases, enzyme replacement therapy involves administering the natural version of the defective enzyme to the patient. This can improve the patient’s condition, but treatment may require high doses and can be associated with unwanted effects. If a modified version of the enzyme proved more effective than the natural one, it could potentially reduce the required dose and, with it, the risk of side effects.

This is the focus of the work being carried out by Gambardella’s group in collaboration with Nicola Brunetti-Pierri, Professor of General and Specialist Paediatrics at the University of Naples, and Andrea Pasquadibisceglie, a structural biologist at TIGEM. The aim is to combine ESMRank with other computational methods to identify combinations of mutations capable of improving the activity of a therapeutic enzyme.

The challenge is far from straightforward. Proteins are the result of millions of years of natural selection, and the space of possible combinations is vast. “When you combine two, three or four mutations at a time, the possibilities reach billions, or even thousands of billions. Instead of starting from millions of candidates, our method allows us to focus on around one hundred promising variants and then test them in the laboratory” explains Gambardella.

At the same time, the group is developing a successor to ESMRank to overcome the limitations linked to the lack of explicit three-dimensional information. Current Large Language Models, because of the way they are trained, often struggle to distinguish the effect of single mutations: two sequences that differ by just one amino acid occupy almost identical positions within the model’s internal space. Even highly powerful tools such as AlphaFold may be relatively insensitive to individual amino acid substitutions, producing very similar structures even when a mutation may have important functional consequences. The goal is to build a model that not only recognises mutations, but can also detect these minimal differences and connect each mutation to its effect on the three-dimensional structure of the protein.

To achieve this, the group is teaching the new model to compare similar proteins with one another, drawing on the information that evolution has embedded in their sequences. “The information is already there: our genetic make-up, and therefore our genes, are the result of a very long process of selection. Over time, evolution has explored countless possible solutions; we need to learn how to read and model them” says Gambardella.

Many proteins, across different organisms, share a common evolutionary origin: their sequences are similar, but they differ in some mutations. When their three-dimensional structure is also known — either because it has been measured experimentally or accurately reconstructed using AlphaFold — these pairs become valuable examples for understanding how a change in sequence can affect the shape of a protein.

“We used pairs of homologous proteins for which we know the three-dimensional structure, and we taught the model to recognise how a mutation changes their shape” explains Gambardella.

By analysing hundreds of thousands of examples, the model gradually learns the general rules that evolution has used to alter protein structures. In doing so, it becomes sensitive even to very small amino acid differences, which current Large Language Models often struggle to detect.

The preliminary results are encouraging. In early analyses, the new model outperforms the current version of ESMRank and, because of the way it has been built, it could be specialised to predict not only protein stability, but also protein activity — the main limitation of the current method.

ESMRank is already being used internally at TIGEM as a tool to narrow down hypotheses, not as a definitive answer, since every prediction must be verified experimentally.

From the interpretation of variants of uncertain significance to the design of better therapeutic proteins, and even the possible prediction of drug response, TIGEM’s work shows how computational biology and Artificial Intelligence can become practical, effective tools in research into genetic diseases.

Il tuo browser non è più supportato da Microsoft, esegui l'upgrade a Microsoft Edge per visualizzare il sito.