Project

Investigation of Multitask Learning for Fine-Tuning Protein Language Models on Mutagenesis Studies

During protein engineering, mutants have to be synthesized and tested experimentally many times. As this process is time and resource intensive and the most common bottleneck during these studies, faster methods are required for improving the efficiency of these mutagenesis studies. In the recent years, researchers investigated the use of machine learning for this purpose. Additionally, the usage of large neural networks from the field of natural language processing allowed for recent breakthroughs in protein structure prediction and related tasks. We proposed to combine these two approaches by fine-tuning protein language models on the mutagenesis study data. This allows for transferring knowledge from billions of unlabeled protein sequences to the few mutants at hand, thereby improving performance of the mutation effect prediction. As this leads to severe overfitting, we additionally proposed to fine-tune the protein language model on many datasets from different mutagenesis studies simultaneously to regularize the model. These large models are computationally expensive in fine-tuning, and as such require us to use HPC resources for these experiments. The usage of high-performance GPUs allowed us to accelerate our experiments.

Project Details

Project term

May 1, 2022–January 31, 2023

Affiliations

Forschungszentrum Jülich

Institute

Jülich Supercomputing Centre

Principal Investigator

Birgit Strodel

Researchers

Tilman Hoffbauer

Methods

For transferring knowledge from large collections of unlabeled protein sequences, Facebook AI created a protein language model that tries to predict masked amino acids from their context. This forces the model to learn the grammar of natural amino acid sequences, and as such encodes valuable information in its parameters. To use this knowledge for target value prediction of mutants of a specific protein, two concurring approaches exist:

• Fine-tuning First, one may take the existing model, reset the parameters of its
last layer, and train it on a set of known mutant / target value pairs.
• Embedding Second, one may use an intermediate representation derived by the
protein language model, i.e. the activation of the last layer, as an input to a
downstream method.

In this work, we tested an extension of the first approach by simultaneously fine-tuning the model on multiple datasets corresponding to different proteins. To this end, we extended the protein language model with a protein-specific last layer. This forces the model to train a shared representation for all proteins, which is then evaluated by the small, task-specific last layer. We hypothesized that this increases the performance of the model on each individual dataset in contrast to a single-task fine-tuning approach.

Results

In a first experiment, we compared the predictions of fine-tuning one single-task model per dataset with a multi-task model shared across all datasets. Unfortunately, the single-task models outperformed the multi-task model. Additionally, we repeated this experiment on subsets of the datasets collection, e.g. only including proteins of the same type. This also did not provide any benefits, either. Finally, we compared the finetuning approach to the second approach using a Gaussian process as the downstream model on small subsamples of each dataset. In these cases, the Gaussian process usually outperformed the fine-tuning approach.

Discussion

Despite our expectations, the multi-task fine-tuning approach did not improve over single-task fine-tuning and embedding approaches using a Gaussian process as the downstream model. As such, we did not expect significant improvements from further research and more detailed comparisons. Instead, we further explored the Gaussian process approach, which required significantly less computing resources. Thus, we did not use the full computing time budget provided.

Additional Project Information

DFG classification: 409-05 Interactive and Intelligent Systems, Image and Language Processing, Computer Graphics and Visualisation, 201-07 Bioinformatics and Theoretical Biology
Software: CUDA, NumPy
Cluster: CLAIX

Publications

Tilman Hoffbauer and Birgit Strodel, TransMEP: Transfer learning on large protein language models to predict mutation effects of proteins from a small known dataset, pre-print