Meta’s ESMfold: the rival of AlpahFold2
Meta uses a new approach to predict over 600 million protein structures
Last year AlphaFold2 was announced and for many, it represents a revolution. A short time ago, META also announced a model capable of predicting protein structure from sequence. What changes? Why is it important? This article tries to answer these questions.
The little engines of life
In short, why is knowing the structure of proteins important? Proteins can be considered the engine of life. In fact, they mediate all the functions proper to cells and organisms. In classic parallelism, DNA is seen as the physical memory, RNA as the RAM, and proteins as the software.
Protein structure determines a protein’s function, and its three-dimensional structure is decided by its sequence (a protein can be seen as a sequence whose amino acids are seen as the letters of an alphabet). Predicting structure is an extremely complex problem, both because we do not know perfectly the physical laws that guide assembly into the final structure and because a sequence of 100 amino acids has a huge number of potential theoretical conformations.
Last year, DeepMind’s AlphaFold2 won CASP14 by showing that its predictions were virtually the same as those obtained experimentally. Here I explained more about the protein folding problem and how AlphaFold works:
In any case, although AlphaFold2 is free of limitations it has opened up many interesting perspectives that I have described here:
META joins the race
Meta has recently announced a new model capable to predict the folding of a protein starting from its sequence. We are used to AlphaFold2? What’s new?
As I described earlier in another article, there are many analogies between language and the language of life. In fact, protein sequences contain the message (the folding) and have precise rules of syntax. Mutations in the sequence disrupt the assembly and function of the protein. Also, the function of a protein family can be inferred from the pattern of sequences (proteins of the same family such as kinases have both similar functions and similar patterns in the sequences).
In addition, there has been an explosion of large language models in recent years. Models based on the transformer and implementing self-attention have been shown to be efficient and successful in a wide range of tasks (NLP, image analysis, text-to-image, mathematical problem-solving, and so on). AlphaFold2 in contrast has a multiblock structure similar to the transformer but still addresses the problem differently (treating proteins as graphs and using multi-alignment sequences). In addition, the training of AlphaFold2 is much more complex than a transformer.
in brief, Meta’s authors posed the question: we have used transformers everywhere, can we also use them to predict protein structure?
First, the researchers structured the training so that it could be used for the desired task. In fact, they thought that if the model wanted to predict the amino acid that is missing in a protein sequence, it should be aware of the underlying structure. And according to the authors, this understanding increases by showing the right model enough protein sequences:
As the representational capacity of the language model and the diversity of protein sequences seen in its training increase, we expect that deep information about the biological properties of the protein sequences could emerge, since those properties give rise to the patterns that are observed in the sequences. — original article
In fact, the authors used a 15-billion-parameter model and noticed that scaling the parameters improved predictions (decreased perplexity).
ESM-2 was trained with a masked language modeling objective (basically, a random amino acid is covered with a mask and the model has to predict which one it is using context). As the authors state even though this objective is simple it allows it to learn information about the protein (function, structure, and so on):
the model to learn dependencies between the amino acids. Although the training objective itself is simple and unsupervised, performing well on this task over millions of evolutionarily diverse protein sequences requires the model to internalize sequence patterns across evolution. We expect that this training will also cause structure to materialize since it is linked to the sequence patterns.
The model was trained using millions of sequences from UniRef. Elegantly, as the model trains, it develops attention patterns that represent interactions between amino acids in the sequence. By extracting these attention maps, they found that they correspond to the tertiary structures of proteins.
The researchers then used an equivariant transformer to transform these maps into spatial maps of the protein atoms. In other words, using the internal representation of the model, one can recreate the atomic resolution structure prediction of a protein.
The authors developed ESMfold: a sequence of a protein is the input to ESM-2 (the model seen earlier) and its internal representation is provided to a series of folding blocks (these use a sequence representation and a pairwise representation). Next, the output is provided to an equivariant transformer that will produce a final atomic-level structure and predicted confidence
Meta’s network, called ESMFold, isn’t quite as accurate as AlphaFold, Rives’ team reported earlier this summer, but it is about 60 times faster at predicting structures, he says. “What this means is that we can scale structure prediction to much larger databases.” — (source)
The model is much faster than AlphaFold2 (a prediction can take several minutes). This allowed the authors to predict the structure of 600 million proteins. In fact, the authors used a metagenome database (DNA sequencing obtained from soils, human gut, other microbial environments, and so on).
Meta announced that metagenome predictions are now accessible to the community (here). He also stated that ESM-2 and ESMfold models will soon be available on HuggingFace so that everyone can experiment.
Parting thoughts
Transformers from 2016 are state-of-the-art in a multitude of tasks, yet DeepMind has taken a different approach. Meta shows how we can use a language model for the prediction of protein structures.
This is important because the model is much faster (thus allowing much faster predictions even if not as accurate as AlphaFold2). A faster model allows them to be able to experiment with different solutions, and they also showed that the capacity of the model increases with the number of parameters. In addition, AlphaFold2 has several limitations; being able to experiment with different types of models will allow the community to develop new solutions and new applications.
In addition, ESMfold only uses protein sequence as input while AlphaFold2 uses multiple sequence analysis (MSA) which improves predictions but also makes it computationally more expensive.
The authors also predicted the structure for a metagenome, a huge amount of sequences belonging to bacteria that we have not been able to culture and whose proteins could have important practical applications. Obtaining the structure is the first step in being able to know its function and study its possible use (clean energy, fighting pollution, helping find new cures).
The authors have made the predictions and code available to the community (GitHub repository). In any case, there are also other different prediction models ( RoseTTAFold, IntFOLD, RaptorX, and so on) and the field is evolving fast, so we will see new models and applications soon.
If you have found it interesting:
You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!
Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.
Or feel free to check out some of my other articles on Medium: