Recently, a novel protein language model (PLM) was published by Liang Hong group in
Science Advances1, introducing PRIME (PRotein language model for Intelligent Masked pretraining and Environment prediction,
Fig. 1). PRIME is a deep learning model designed to predict and improve protein stability and activity without relying on experimental mutagenesis data. This innovative approach leverages a vast dataset of 96 million protein sequences annotated with their host bacterial optimal growth temperatures (OGTs) to develop a model that effectively guides protein engineering across various applications.
Protein engineering for pharmaceutical and industrial applications faces several major challenges. Traditional methods, such as directed evolution and rational design, typically demand extensive experimental screening or deep mechanistic insights into protein structures and functions
2,3. In recent years, PLMs have emerged as promising tools for protein engineering
4. However, many existing PLMs struggle to recommend mutations that enhance both stability and activity, two critical properties for engineered proteins.
PRIME successfully addressed these challenges by offering a data-driven approach that predicts promising mutations to increase both stability and activity without relying on experimental data. The model's architecture is built on a transformer-based encoder, augmented with two specialized modules: one for Masked Language Modeling (MLM
5) and another for OGT prediction
6. This setup enables the model to capture the fundamental relationship between sequences and temperature-related attributes that are crucial for the stability and function of proteins, making it particularly advantageous for engineering industrial enzymes or proteins that need high-temperature tolerance and resilience in practical applications.
One of the most notable strengths of PRIME lies in its “zero-shot” capability, which allows it to identify beneficial mutations for a given protein without any experimental data. The authors compared PRIME's zero-shot performance against several state-of-the-art models, including deep learning approaches such as, SaProt
7 and Stability Oracle
8, as well as traditional computational methods like GEMME
9 and Rosetta
10.
Across 283 protein assays, PRIME demonstrated superior performance in predicting changes in melting temperature (Δ
Tm) and excelled in the ProteinGym benchmark
11, which encompasses diverse protein properties including catalytic activity, binding affinity, stability, and fluorescence intensity. Notably, PRIME achieved a score of 0.486 on the ProteinGym benchmark, significantly surpassing the second-best model, SaProt, which scored 0.457 (
P = 1 × 10
−4, Wilcoxon test).
To validate PRIME's efficacy, the authors conducted wet-lab experiments on five distinct proteins: LbCas12a, T7 RNA polymerase, creatinase, nonnatural nucleic acid polymerase, and the variable domain of the heavy chain of a nano-antibody against growth hormone (VHH). PRIME was used to select top-ranking single-site mutants for each protein. Remarkably, over 30% of these mutations demonstrated notable improvements in physicochemical properties, such as thermostability, catalytic activity, binding affinity, or resilience to extreme alkaline conditions and the ability to polymerize nonnatural nucleic acids.
The effectiveness of PRIME was further demonstrated through the engineering of LbCas12a and T7 RNA polymerase. For LbCas12a, a complex multidomain protein with 1228 amino acids, PRIME guided an iterative optimization process through three rounds of mutagenesis and experimental validation. In the final round, all 30 multisite mutants exhibited higher melting temperatures (Tm) than the wild type. The best-performing eight-site mutant achieved a Tm of 48.15 ℃, representing a significant 6.25 ℃ improvement over the wild type. The engineering of T7 RNA polymerase further showcased PRIME's capabilities. Aiming to enhance the enzyme's thermostability for applications such as mRNA vaccine production and isothermal amplification detection techniques, the team conducted the AI-guided mutagenesis and wet-lab validation of 95 mutants. This process successfully yielded a 12-site mutant with a melting temperature 12.8 ℃ higher than the wild type.
Notably, in both the LbCas12a and T7 RNA polymerase projects, PRIME demonstrated the ability to selectively combine certain individually negative single-site mutations into positive multi-site mutants. Such epistatic insights are typically elusive in conventional protein engineering but proved crucial for generating superior variants.
These case studies illustrate PRIME's efficiency in protein engineering. PRIME was able to guide the development of notable improved enzyme variants in just a few rounds of mutagenesis. This approach not only enhances the precision of protein engineering but also substantially reduces the time and resources required for experimental validation.
Still, several limitations warrant further exploration. The reliance of PRIME on bacterial OGTs may restrict its applicability to certain protein families. Additionally, integrating structural information or combining PRIME with other computational methods could expand its applications in drug development, enzyme design, and synthetic biology. As researchers continue to refine and adapt PRIME, it holds great promise for transforming how we discover, design, and optimize proteins in a growing range of industrial and pharmaceutical applications.