Searching for proteins with similar structures
How do we compare three-dimensional shapes?
One way to identify the function of a protein is to compare the protein to others that have already been described. For 40 years, the most common way to compare proteins has been to align their amino acid sequences. Until recently, our ability to compare proteins by shape was limited; we lacked structural descriptions of most proteins because experimental techniques such as x-ray crystallography and nuclear magnetic resonance are time consuming and expensive (and don’t work for many proteins). With the advent of AI tools such as AlphaFold, we are no longer limited to experimental methods for describing protein structure: we now have the ability to computationally predict them. Because structure is more conserved than sequence, structural comparison can identify distant homologs that sequence comparison methods might miss. But how do we compare complex three-dimensional shapes?
Digital representations of protein structure
Before we can compare protein structures, we need to define a standard way to represent them computationally. Experimentally-determined structures are stored in the Protein Data Bank (PDB) in the PDBx/mmCIF format (mmCIF stands for macromolecular Crystallographic Information File). These files contain a variety of information, including (x, y, z) coordinates that describe the position of each atom in the molecule. Computational techniques for predicting protein structures, including AlphaFold, have inherited the PDBx/mmCIF format.
Because the PDBx/mmCIF file format was originally designed to describe experimental data, it not necessarily ideal for use with modern computational methods and the massive data sets we have today. Some researchers are exploring alternative representations of protein structure that are optimized for use with modern machine learning algorithms. For example, PyUUL is a PyTorch library that can be used to represent Protein Data Bank structures using voxels (3D pixels) or point clouds. Unlike the idiosyncratic PDBx/mmCIF file format, these data structures can be used with neural network-based tools that were originally designed for fields such as computer vision.
Graphein, a Python library, provides a similar service: it can be used to convert molecular structures into graph and surface-mesh representations that can be used with existing geometric deep learning libraries.
FoldSeek takes an alternative approach to codifying protein structure. Instead of using an alphabet to represent a two-dimensional sequence of amino acids, Foldseek uses an alphabet to represent the predicted three-dimensional conformations of adjacent amino acids. These “3Di” sequences are essentially “shorthand” for the more complex representations of protein structure described above, which allows for simplified methods of data analysis.
Comparing three-dimensional protein structures
People have been developing algorithms to compare protein structures for decades. The first protein structure, that of myoglobin, was solved in 1958. By the 1970s, researchers were developing approaches for aligning one experimentally-determined structure with another. As computing power grew, the structure of a protein could be compared to hundreds of thousands of structures in a database.
Over the past couple of years, however, the scale of the protein structure comparison problem has changed dramatically. Now that we can predict the structure of most proteins with relative accuracy, we have hundreds of millions of predicted protein structures to compare. This massive increase in data has necessitated the development of new tools that strike a balance between sensitivity and speed.
TM-align, which was developed in 2005, is one example of an algorithm that works well for smaller datasets but not for massive databases. TM-align produces an optimized alignment between the full-length structures of two sequences based on the position of the alpha carbon in each amino acid. The tool uses heuristic dynamic programming, which is computationally expensive: using TM-align to compare a single structure to a database containing 100 million structures would take one CPU core a month to complete.
One approach to reducing the computational complexity of protein structure comparisons is to borrow from sequence comparison tools like BLAST. Sequence comparison tools find the optimal alignment between two or more strings of letters, with each string representing the sequence of a protein and each letter representing an amino acid. This process is much less computationally demanding than the structural alignments performed by tools like TM-align.
Foldseek, developed in 2023, turns the three-dimensional structural alignment in to a two-dimensional sequence alignment problem. As mentioned above, Foldseek uses a “3Di” alphabet to represent the 3D structure of a protein. In this case, each string represents the structure of a protein and each letter represents the conformation of adjacent amino acids. Foldseek then uses a sequence comparison algorithm (a modified version of MMseqs2) to align the 3Di sequences. Foldseek is reportedly 88% as sensitive as TM-align but four to five orders of magnitude faster.
Another modern approach to identifying homologous proteins is the application of Natural Language Processing (NLP), a type of artificial intelligence that processes and extracts meaning from language. Protein sequences are well-suited for NLP because they are analogous to written language in many ways: amino acids are represented by letters, motifs/domains are similar to words, and the entire sequence can be thought of as a sentence. Large language models applied to proteins are referred to as protein language models. One such model, Protein structure-sequence T5 (ProstT5), bridges protein sequence and structure by translating a protein’s amino acid sequence to a 3Di sequence (and vice versa).
Deep learning is another field of artificial intelligence that can be applied to protein structural alignment. For example, the Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA) tool, developed in 2021, is trained on the alignments of tens of thousands of experimentally-determined protein structures. After learning to recognize patterns in this data, SAdLSA can predict the structural alignment of proteins based solely on their amino acid sequences.
As mentioned above, applying machine learning tools developed for other fields often requires converting protein structural data into more generic data types, such as graphs. The Graph-based protein Structure Representation (GraSR) tool, developed in 2022, constructs a graph of protein structure based on the coordinates of the alpha carbon in each amino acid. GraSR uses multiple neural networks to learn the geometric features of a protein and can compare proteins structures without the need for alignment.
Computer vision can also be used to compare protein structures. The Protein Cavity Registration (ProCare) tool, developed in 2020, takes a computer vision-based approach to identify similarities in the shapes of potential ligand binding sites on protein structures. ProCare uses a technique called point cloud registration to align and compare protein cavities in order to identify potential drug targets.
As powerful as these new tools are, they should also be used with caution. For example, tools that predict protein structures are generally trained on biological datasets and may not accurately predict the structures of engineered proteins. Tools that use predicted protein structures assume these structures are correct; this is not always the case.
The approaches discussed here generally compare the entire structures of monomeric proteins. Tools that identify proteins with partial structural similarity — e.g. proteins with shared domains—also have the potential to inform our understanding of protein function. As our ability to predict the structures of protein complexes improves, our need for tools that can compare the structures of protein complexes will also increase.
Closing Thoughts
Just as organisms evolve from pre-existing organisms, proteins tend to evolve from pre-existing proteins. In theory, all (or nearly all) natural proteins should share at least partial homology with other proteins. The ability to identify homologous proteins throughout the tree of life would enable us to identify corresponding molecular pathways in different species, as well as pathways within an organism that have evolved through gene duplication. An improved ability to predict protein function would accelerate the study of diseases and potential treatments. The best method for comparing protein structures is still an area of active research, but the creative application of artificial intelligence methodologies developed for other fields is already yielding promising results.