Using AI to help describe protein function

9 min readFeb 22, 2024

Proteins are the molecular machines of the cell. Understanding protein function is necessary for understanding how cells work — and for understanding why cellular systems break down. At present, however, the functions of most proteins remain poorly understood.

Over the past several years, increasingly powerful experimental and computational approaches have enabled us to analyze complex data in novel ways. Given the resources we now have at our disposal, how can we use artificial intelligence to maximize the efficiency with which we characterize protein function?

From “protein” to “proteoform”

In the post-genomics era, the meaning of the word “protein” is more complex than it used to be. In the sense of characterizing protein function, a “protein” can be thought of as the family of similar protein products produced from the same genetic locus. At present, we tend to represent each protein with a canonical sequence. For example, the protein “Interferon-induced, double-stranded RNA-activated protein kinase,” or PKR, is produced from the EIF2AK2 gene on human chromosome 2. However, thanks to factors like sequence variation, RNA splicing, and post-translational modifications, many different versions of PKR exist. Each individual protein product from the same gene is referred to as a “proteoform.” The human genome is thought to contain approximately 20,000 protein-coding genes, but the number of “proteoforms” is probably in the millions.

Because different proteoforms of the same protein may have different functions, we should probably characterize function at the level of the proteoform rather than at the level of the protein.

The term proteoform was developed to describe each of the modified versions of a protein that can be produced from the same gene. **Image source:** Smith, et al. “The Human Proteoform Project: Defining the human proteome.” *Science Advances (2021)*

Defining “function”

In colloquial terms, the function of a proteoform is the “job” the molecule does for a cell or organism. At a fundamental level, the function of a proteoform can be defined as the way the proteoform interacts with one or more molecules. Function can also be defined on more of a “systems” level: for example, a proteoform may participate in a specific molecular pathway or may perform a role that benefits the cell or organism as a whole. A single proteoform can — and often does — have multiple functions.

Let’s limit this discussion to proteins that are naturally found in people. In a world with unlimited resources, our ultimate goal would be to characterize every function of every proteoform in every context in which it naturally appears in the human body. In addition, we would be able to predict the effect that a change in structure or expression would have on the function of the proteoform and the system as a whole. In theory, this would allow us to repair or improve upon existing biological processes.

Achieving the goal of even partially characterizing the structures, localizations, and interactions of millions of human proteoforms will require a combination of high-throughput proteomics techniques and advanced computational models.

A brief overview of artificial intelligence

Increasingly, molecular biologists need to consider the capabilities and limitations of artificial intelligence when designing their experiments. For molecular biologists who do not come from a computer science background, it can be difficult to understand the relationships between artificial intelligence, machine learning, neural networks, deep learning, and generative AI. As I’ll explain below, it’s actually fairly simple: each of these terms is nested within the previous one.

The terms “artificial intelligence,” “machine learning,” “neural networks,” “deep learning,” and “generative AI” are displayed within five nested circles to demonstrate that generative AI is a subset of deep learning, deep learning is a subset of neural networks, neural networks are a subset of machine learning, and machine learning is a subset of artificial intelligence. — Many of the buzzwords related to artificial intelligence can be organized in to a nested hierarchy: generative AI is an application of deep learning, deep learning is a type of machine learning that involves the use of neural networks, and machine learning is a type of artificial intelligence.

“Artificial intelligence” is a broad term: it refers to the ability of a machine (e.g. a computer) to simulate human intelligence. There are many possible approaches to creating artificially intelligent computer programs: researchers should choose which approach(es) to use based on the characteristics and quantity of the data they wish to analyze, as well as the type of analysis they wish to perform.

One way to categorize the various approaches to artificial intelligence is by whether or not they involve machine learning. “Traditional” approaches to artificial intelligence require programmers to explicitly define a set of rules that dictate a program’s behavior; any changes to the algorithm require human intervention. In contrast, machine learning algorithms “learn” how to improve their performance on a task as they gain experience with analyzing input data; the algorithm can basically alter itself. Traditional AI is becoming less common in molecular biology as approaches that involve machine learning gain traction.

Like “artificial intelligence,” “machine learning” is an umbrella term that includes a variety of approaches. One way to categorize these approaches is by whether or not they involve the use of neural networks, computational systems inspired by the way that information is transmitted between neurons in the brain. “Traditional” machine learning algorithms (e.g. decision trees, random forests, nearest neighbor models, and support vector machines) do not rely on neural networks. Many traditional machine learning algorithms are still highly relevant to contemporary proteomics.

Machine learning algorithms involving neural networks are often more powerful than traditional machine learning approaches, but they are also more computationally expensive and their results can be more difficult to interpret. A neural network consists of interconnected “nodes” that are often organized into layers, including an input layer, an output layer, and one or more additional layers sandwiched in the middle. Neural networks comprised of only a few layers are considered “shallow.” “Deep learning” involves neural networks comprised of many layers of nodes; it is used in computer vision, natural language processing, speech recognition, and search engines. Generative AI is a type of deep learning model that can “remix” raw data to produce something new.

Machine learning methods involving neural networks include [A] multilayer perceptrons, [B] convolutional neural networks (CNN), [C] recurrent neural networks (RNN), [D] graph convolutional networks (GCN), and [E] autoencoders. **Image source:** Greener, Joe G., et al. “A guide to machine learning for biologists.” *Nature Reviews Molecular Cell Biology* 23.1 (2022): 40–55.

Generally speaking, researchers should choose the most efficient type of data analysis that is suited to the problem at hand. Using artificial intelligence to analyze data often isn’t necessary. When artificial intelligence is applied to modern proteomics, it generally entails machine learning. Traditional machine learning approaches may be useful for smaller and/or less complex data sets, for decreasing the time and cost of data analysis, or in cases where the computational model needs to be easily understood. The SciKit-Learn library in Python and the caret package in R can be used for traditional machine learning applications. Deep learning approaches should be reserved for large and/or complex data sets, such as data sets with millions of points or where each data point has many features. Deep learning has numerous applications in modern proteomics; for example, many deep learning approaches are being applied to the study of protein structure, while generative AI has potential applications in synthetic biology and drug design. The PyTorch library in Python and the TensorFlow library (plus Keras) can be used for deep learning.

Another way to classify the various approaches to machine learning is by whether they involve supervised, unsupervised, or reinforcement learning. Which approach is used to analyze proteomic data depends on the nature of that data and the questions being asked. In supervised machine learning, algorithms “learn” how to recognize features by analyzing labeled data sets. Supervised machine learning can be used to classify data points or make predictions via regression. In unsupervised machine learning, algorithms identify patterns in unlabeled data sets. For examples, unsupervised machine learning can be used to cluster data. Some models use a combination of both supervised and unsupervised learning. Most neural networks involve supervised learning, although some (e.g. autoencoders) involve unsupervised learning. Unlike supervised and unsupervised learning, reinforcement learning doesn’t involve a model that is trained on an input data set. Instead, reinforcement learning involves improving a model through “trial and error.” The model makes an “observation” about the state of various variables, uses this information to take an “action” or make a decision, receives “feedback” on its performance, and adjusts itself accordingly. Reinforcement learning is often used in robotics, but also has applications in molecular biology; for example, some researchers are exploring the use of reinforcement learning in protein design.

Using AI to help identify every human proteoform

Advances in machine learning have coincided with advances in our ability to collect large amounts of proteomic data. As artificial intelligence infiltrates molecular biology, the scientific process is becoming an iterative, alternating cycle of real-world experiments and computational analyses. Experimental results are used to train computational models, computational models are used to make predictions about the functional characteristics of molecules, and predictions are used to inform the design of future experiments. A combination of carefully designed experiments and strategically chosen analyses should greatly accelerate our ability to characterize protein function.

One of the first steps to characterizing the function of every human proteoform would be identifying what those proteoforms are. The Human Proteoform Project, an initiative spearheaded by the Consortium for Top-Down Proteomics, aims to create a definitive list of all of the proteoforms found in every type of human cell. According to their paper, doing this will require the development of new technologies, including advancements in mass spectrometry; it will also require us to finish identifying all of the types of human cells.

How can we use artificial intelligence to accelerate the development of a comprehensive list of human proteoforms? Although an infinite variety of mutations and modifications could exist, natural selection tends to impose constraints on which versions of a protein actually appear in nature. By using experimental techniques to characterize the various proteoforms produced from a subset of human genes, we should be able to train models that allow us to predict the variety of protein products that are likely to be produced from any given gene.

**Image source:** Aebersold, Ruedi, et al. “How many human proteoforms are there?” *Nature chemical biology* 14.3 (2018): 206–214.

One way computational models can be used to help characterize proteoforms is by predicting the most probable sites of variation in protein sequence. Thanks to advances in machine learning and high-throughput experimental methods, we are increasingly able to predict where genetic sequence variation is likely to occur, how proteins tend to evolve over time, and what effect a change in sequence will have on protein function.

Computational models can also be used to detect and predict sites of post-translational modification. These alterations to a protein during or after synthesis — e.g. phosphorylation, ubiquitination, etc. — can have a large effect on protein function. For example, post-translational modifications can help regulate pathways and have been implicated in a variety of diseases. Hundreds of types of post-translation modification have been identified, and more are still being discovered. A single protein may be modified in more than one way, including multiple instances of the same modification. As a result, post-translational modifications can greatly increase the number of proteoforms that correspond to a given protein. These modification are often identified through mass spectrometry, although current techniques often lack sensitivity in detecting and quantifying modified proteins. This sensitivity can be improved by using neural networks to predict the fragmentation spectra of modified peptides, then comparing this “library” of predicted spectra to experimental data. Experimental data on protein modification sites can be used to train deep learning models, which in turn can predict possible sites of modification in other proteins.

Alternate transcriptional start sites, alternative splicing, and other factors also contribute to proteoform diversity. As with the topics above, we can use experimental data (RNA-seq, ribosome profiling, etc.) from characterized proteins to train computational models that will predict sources of variation in uncharacterized proteins. Even variation due to errors in transcription, translation, post-transcriptional modification, and protein folding may follow trends that can be predicted with the help of experimental data in combination with artificial intelligence.

Using AI to help characterize protein function

Identifying a comprehensive list of human proteoforms is just the first step in characterizing protein function; we also need to describe their various structures, localizations, and interactions, including the organization of these interactions into complex pathways. Acquiring this information will require countless experimental and computational approaches, including many that have not yet been developed. The general idea behind the integration of experimental biology and artificial intelligence, however, is likely to be the same: experiments will produce large quantities of proteomic data; machine learning will be used to classify, cluster, make predictions about, and/or make decisions based on this data; and the outputs of these models will be tested and applied in subsequent experiments.

Over time, the accuracy of computational models may approach or even exceed the accuracy of experimental results. Given enough computational power, we should ultimately be able to use experimental data and the laws of physics to produce computational models of cells that allow us to make accurate predictions about protein function without needing to return to the laboratory; we have a long way to go before we achieve that goal, but the distance we have come in recent years is remarkable.