Deep Learning for Predicting Biomolecular Binding Sites of Proteins

Deep Learning for Predicting Biomolecular Binding Sites of Proteins

PDF

Minjie Mou^†, Zhichao Zhang^†, Ziqi Pan, Feng Zhu^*

Research. Vol 8 Article ID 0615

Less

Research. Vol 8 Article ID 0615

• Perspective •

Deep Learning for Predicting Biomolecular Binding Sites of Proteins

Full

Minjie Mou^†, Zhichao Zhang^†, Ziqi Pan, Feng Zhu^*

Affiliations

College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China.

Published: 2025-02-24 doi: 10.34133/research.0615

Outline

Abstract

Less

The rapid evolution of deep learning has markedly enhanced protein–biomolecule binding site prediction, offering insights essential for drug discovery, mutation analysis, and molecular biology. Advancements in both sequence-based and structure-based methods demonstrate their distinct strengths and limitations. Sequence-based approaches offer efficiency and adaptability, while structure-based techniques provide spatial precision but require high-quality structural data. Emerging trends in hybrid models that combine multimodal data, such as integrating sequence and structural information, along with innovations in geometric deep learning, present promising directions for improving prediction accuracy. This perspective summarizes challenges such as computational demands and dynamic modeling and proposes strategies for future research. The ultimate goal is the development of computationally efficient and flexible models capable of capturing the complexity of real-world biomolecular interactions, thereby broadening the scope and applicability of binding site predictions across a wide range of biomedical contexts.

Cite this Article

Minjie Mou, Zhichao Zhang, Ziqi Pan, Feng Zhu. Deep Learning for Predicting Biomolecular Binding Sites of Proteins[J]. Research, 2025 , 8 (2) : 0615 . DOI: 10.34133/research.0615

Full Text

Less

Introduction

Less

Recent advancements in sequence-based and structure-based deep learning approaches have markedly propelled the field of binding site prediction, offering profound insights into protein interactions and molecular mechanisms. These developments have accelerated key applications in target identification, mutation analysis, and drug design [1,2]. Furthermore, these advances highlight emerging trends, identify critical challenges, and outline forward-thinking directions for future research and practical implementation.

Advances in Binding Site Prediction

Less

Binding site prediction methods are generally classified into 2 categories: sequence-based and structure-based approaches, each with its own advantages and limitations (Table). Sequence-based methods leverage amino acid sequences and evolutionary information, primarily focusing on linear sequence features [3–5]. Conversely, structure-based approaches rely on 3-dimensional protein structures to capture spatial arrangements crucial to binding interactions [6]. High-precision models, such as those in the AlphaFold series, provide the structural data needed for accurate binding site identification [7]. The integration of sequence and structural data has become a key trend for increasing prediction accuracy [8,9].

Geometric deep learning provides flexible ways to model protein structures by leveraging local and global geometric relationships. Point cloud models capture detailed spatial features of complex binding interfaces [6,10,11]. Surface property-based methods focus on overall surface features, such as hydrophobic patches or charge distributions, critical for protein interactions [12,13]. Graph neural networks (GNNs) encode proteins as graphs, incorporating physicochemical constraints and spatial relationships to improve binding predictions [11,14,15].

Transformer-based approaches capture long-range dependencies in sequences using attention mechanisms and pretrained models, making them effective for complex sequence relationships. However, they are computationally intensive, especially with long sequences, as seen in methods like PepBCL. Convolutional neural network (CNN)-based approaches focus on local sequence patterns using convolutional layers, which is efficient for motif detection in smaller datasets. However, they struggle to capture global sequence dependencies, as demonstrated by methods like DELPHI.

Multi-task frameworks, like DeepDISOBind, capture shared features across different interaction types, such as RNA, DNA, and proteins [3,16]. Additionally, ensemble learning frameworks enhance model robustness by combining diverse neural network architectures, as seen in EnsemPPIS, which integrates Transformer and gated CNNs to effectively capture both global and local interaction features within protein sequences [17]. Recently, advanced protein language models (PLMs) have also been applied to binding site prediction, such as ESM-DBP, which can substantially enhance prediction accuracy [18,19].

Challenges and Prospects in Enhancing Binding Site Prediction

Less

Both sequence-based and structure-based methods, despite substantial advances, have certain limitations. Structure-based methods, while highly accurate, rely heavily on high-quality structural data, often obtained from experiments or advanced prediction tools like AlphaFold [7,20]. However, even AlphaFold's high-precision predictions may not fully capture protein dynamics in complex biological contexts, where conformational changes and environmental factors influence binding. Furthermore, structure-based methods are inherently limited in addressing protein mutations that modify the protein's 3-dimensional configuration, underscoring the need for flexible models that can incorporate structural dynamics and account for binding site alterations. For instance, many biological processes, including enzyme catalysis and molecular signaling, require the consideration of protein flexibility and transient conformational states, which cannot always be captured by static structural models. These factors highlight the ongoing challenge of integrating structural dynamics into binding site prediction models, which could better reflect the complexity of biological systems.

In contrast, sequence-based methods offer computational efficiency and adaptability, making them valuable in scenarios where structural data are unavailable. These models leverage amino acid sequences and evolutionary conservation to identify binding residues but often struggle to capture spatial features crucial for precise predictions [21]. Although these methods excel in efficiency, they fall short in capturing the spatial context of protein interactions, which limits their prediction accuracy. To mitigate this limitation, incorporating spatial constraints such as predicted residue–residue interactions or sequence-based structural motifs could provide a more nuanced understanding of binding sites without relying on structural data. This flexibility enables dynamic predictions, particularly when proteins undergo conformational changes, positioning sequence-based approaches as valuable tools for studying mutation impacts. Incorporating such spatial and dynamic elements would not only improve prediction accuracy but also enhance the model's robustness across diverse biological conditions, expanding the potential applications of sequence-based methods. Given their simplicity, efficiency, and adaptability, sequence-based methods warrant further research, particularly as they evolve to integrate spatial and dynamic features.

In the future, dynamically integrating sequence and structural data holds potential for advancing binding site prediction. Hybrid models that combine sequence specificity with structural context can more effectively capture a broad spectrum of biomolecular interactions [10,15]. Multi-task learning and ensemble frameworks, which leverage shared features across tasks and combine the strengths of individual models, offer promising strategies to achieve this integration. In particular, ensemble frameworks enhance adaptability and robustness across diverse biomolecular interactions [17,22].

In fact, the successful integration of multiple data modalities has already demonstrated substantial improvements in related fields. A notable example is SurfDock [23], which combines sequence, structural, and physicochemical information to enhance protein–ligand binding pose and affinity predictions. By leveraging this multi-faceted approach, SurfDock has achieved a remarkable 20% improvement in predicting binding affinities over traditional single-modal methods. This highlights the promising potential of multi-modal integration, providing compelling evidence that such an approach could substantially advance the accuracy and reliability of protein–biomolecule binding site predictions.

Furthermore, while some models already integrate basic physicochemical properties such as charge distribution and hydropathy, advancements in molecular science can facilitate the inclusion of more complex molecular properties that offer potential for improving prediction accuracy. For example, with the development of advanced experimental techniques like molecular dynamics simulations and cryo-electron microscopy, researchers are now able to capture real-time atomic-level interactions and transient molecular states that are critical for accurate protein–biomolecule interaction predictions. These properties, such as the flexibility of binding interfaces or the detailed molecular forces at play, are often overlooked in traditional static models. By further incorporating these dynamic factors, as shown in the Figure, prediction models could provide a more refined and realistic depiction of protein–biomolecule interactions. This is especially important in capturing molecular flexibility and transient binding events, which play key roles in biological processes like enzyme catalysis, molecular signaling, and protein folding. Moreover, integrating these complex properties can enhance model robustness by accounting for the inherent variability and dynamism of biological systems. As a result, these enhanced models could offer greater accuracy in predicting binding sites, thus pushing the boundaries of current prediction technologies [18,24,25].

Achieving computational efficiency remains crucial given the increasing complexity of deep learning models. Developing lightweight models that maintain high accuracy while reducing computational demands is essential, particularly in data-limited or dynamic prediction settings [26]. Sequence-based methods are particularly promising in this regard, as they are inherently less computationally intensive. Future work may focus on streamlined architectures capable of capturing essential spatial and evolutionary patterns without sacrificing accuracy. Incorporating efficient neural architectures like Transformers and deep reinforcement learning (RL) could speed up training and improve generalization across diverse protein sequences, addressing model complexity. Additionally, future research might explore reducing dependency on multiple sequence alignment (MSA) or developing MSA-independent alternatives. Such innovations could lower computational costs and expedite data processing, thereby enhancing model applicability across a broader range of prediction contexts.

Conclusion

Less

Advancements in deep learning have profoundly enhanced protein–biomolecule binding site prediction. Hybrid models that seamlessly integrate sequence and structural data promise to substantially enhance predictive accuracy, mitigating the limitations inherent in individual approaches. Furthermore, the strategic integration of lightweight architectures and multimodal data will optimize computational efficiency. Continued progress in these areas will broaden the impact of binding site prediction, driving transformative advances in drug target identification, mutation analysis, and therapeutic development.

Funding

Less

National Natural Science Foundation of China(82373790)
National Natural Science Foundation of China(22220102001)
National Natural Science Foundation of China(81872798)
National Natural Science Foundation of China(U1909208)
Natural Science Foundation of Zhejiang Province(LR21H300001)
Natural Science Foundation of Zhejiang Province(RG25H300001)
National Key R&D Program of China(2022YFC3400501)
Leading Talents of “Ten Thousand Plan” National High-Level Talents Support Plans of China, The Double Top-Class Universities(181201*194232101)
Fundamental Research Funds for Central Universities(2018QNA7023)
Key R&D Program of Zhejiang Province(2020C03010)

References

Less

Senior

, Evans

, Jumper

, Kirkpatrick

, Sifre

, Green

, Qin

, Zidek

, Nelson

AWR

, Bridgland

, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710.

Gainza

, Wehrle

, Van Hall-Beauvais

, Marchand

, Scheck

, Harteveld

, Buckley

, Ni

, Tan

, Sverrisson

, et al. De novo design of protein interactions with learned surface fingerprints. Nature. 2023;617(7959):176–184.

Zhang

, Zhao

, Shi

, Li

, Kurgan

. DeepDISOBind: Accurate prediction of RNA-, DNA- and protein-binding intrinsically disordered residues with deep multi-task learning. Brief Bioinform. 2022;23(1):bbab521.

Shen

, Liu

, Song

, Yu

. SAResNet: Self-attention residual network for predicting DNA-protein binding. Brief Bioinform. 2021;22(5):bbab101.

Huang

, Li

, Xiao

, Zhao

, Zheng

, Li

, Wang

. PepCA: Unveiling protein-peptide interaction sites with a multi-input neural network model. Iscience. 2024;27(10): Article 110850.

, Zhao

, Tian

, Li

, Chu

, Gu

, Zheng

, Wang

, Li

, Jiang

, et al. Highly accurate carbohydrate-binding site prediction with DeepGlycanSite. Nat Commun. 2024;15(1):5163.

Jumper

, Evans

, Pritzel

, Green

, Figurnov

, Ronneberger

, Tunyasuvunakool

, Bates

, Zidek

, Potapenko

, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589.

Yuan

, Chen

, Rao

, Zheng

, Zhao

, Yang

. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23(2):bbab564.

Shafiee

, Fathi

, Taherzadeh

. DP-site: A dual deep learning-based method for protein-peptide interaction site prediction. Methods. 2024;229:17–29.

10.

Xia

, Xia

, Pan

, Shen

. GraphBind: Protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res. 2021;49(9): Article e51.

11.

, Liu

. GeoBind: Segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res. 2023;51(10): Article e60.

12.

Gainza

, Sverrisson

, Monti

, Rodola

, Boscaini

, Bronstein

, Correia

. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods. 2020;17(2):184–192.

13.

Krapp

, Abriata

, Cortes Rodriguez

, Dal Peraro

. PeSTo: Parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nat Commun. 2023;14(1):2175.

14.

Mahbub

, Bayzid

. EGRET: Edge aggregated graph attention networks and transfer learning improve protein-protein interaction site prediction. Brief Bioinform. 2022;23(2):bbab578.

15.

Ding

, Li

, Han

, Tian

, Jing

, Wang

, Song

, Fu

, Kang

. MEG-PPIS: A fast protein-protein interaction site prediction method based on multi-scale graph information and equivariant graph neural network. Bioinformatics. 2024;40(5):btae269.

16.

Wang

, Yan

, Zhang

, Liu

. iDRNA-ITF: Identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform. 2022;23(4):bbac236.

17.

Mou

, Pan

, Zhou

, Zheng

, Zhang

, Shi

, Li

, Sun

, Zhu

. A transformer-based ensemble framework for the prediction of protein-protein interaction sites. Research. 2023;6:0240.

18.

Zeng

, Dou

, Pan

, Xu

, Peng

. Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. Nat Commun. 2024;15(1):7838.

19.

Liu

, Tian

. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform. 2024;25(1):bbad488.

20.

Alam

, Mahbub

, Bayzid

. Pair-EGRET: Enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models. Bioinformatics. 2024;40(10):btae588.

21.

Wang

, Jin

, Zou

, Nakai

, Wei

. Predicting protein-peptide binding residues via interpretable deep learning. Bioinformatics. 2022;38(13):3351–3360.

22.

Zeng

, Zhang

, Wu

, Li

, Wang

, Li

. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. 2020;36(4):1114–1120.

23.

Cao

, Chen

, Zhang

, Wang

, Huang

, Yu

, Jiang

, Fan

, Zhang

, Zhou

, et al. SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction. Nat Methods. 2024.

24.

Yin

, Mi

, Shukla

. Leveraging machine learning models for peptide-protein interaction prediction. RSC Chem Biol. 2024;5(5):401–417.

25.

, Golding

, Ilie

. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics. 2021;37(7):896–904.

26.

Baranwal

, Magner

, Saldinger

, Turali-Emre

, Elvati

, Kozarekar

, VanEpps

, Kotov

, Violi

, Hero

. Struct2Graph: A graph attention network for structure based predictions of protein-protein interactions. BMC Bioinformatics. 2022;23(1):370.

Appendix

Less

Year 2025 volume 8 Issue 2

PDF

185

101

Cite this Article

BibTeX

Article Info

doi: 10.34133/research.0615

Receive Date：2024-11-16
Online Date：2025-07-23
Published：2025-02-24

Article Data

Affiliations

History

Received：2024-11-16
Revised：2025-01-21
Accepted：2025-01-24

Funding

National Natural Science Foundation of China(82373790)

National Natural Science Foundation of China(22220102001)

National Natural Science Foundation of China(81872798)

National Natural Science Foundation of China(U1909208)

Natural Science Foundation of Zhejiang Province(LR21H300001)

Natural Science Foundation of Zhejiang Province(RG25H300001)

National Key R&D Program of China(2022YFC3400501)

Leading Talents of “Ten Thousand Plan” National High-Level Talents Support Plans of China, The Double Top-Class Universities(181201*194232101)

Fundamental Research Funds for Central Universities(2018QNA7023)

Key R&D Program of Zhejiang Province(2020C03010)

Affiliations

Corresponding:

^* Address correspondence to: zhufeng@zju.edu.cn

References

Share

https://castjournals.cast.org.cn/joweb/research/EN/10.34133/research.0615

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Table. Comparison of sequence-based and structure-based binding site prediction methods. The methodologies for predicting protein–biomolecule binding sites are categorized into sequence-based and structure-based approaches. The comparison includes representative models, highlights key features, and evaluates advantages and limitations.

Type	Subcategory	Models	Key features	Advantages	Limitations
Sequence-based methods	Transformer-based methods	EnsemPPIS, PepBCL	Leverages attention mechanisms and pretrained models to capture long-range sequence dependencies	Excels at capturing complex sequence relationships; highly adaptable to various tasks	High computational cost for long sequences
	CNN-based methods	DeepDISOBind, DELPHI	Uses convolutional layers to identify local sequence features, ideal for detecting motifs	Efficiently extracts local features; suitable for small datasets with minimal computing costs	Limited in capturing global sequence information
	Other sequence-based methods	ESM-DBP, SAResNet	Applies diverse neural networks (e.g., RNN, ResNet) for flexible and customizable sequence analysis	Customizable for specific tasks; easily integrates with biological features and databases	Relies on quality input features and sensitive to noisy or limited data
Structure-based methods	Geometric deep learning methods	DeepGlycanSite, GeoBind	Constructs 3-dimension protein representations using point clouds or surface graphs	Handles complex surface shapes effectively for protein binding site analysis	Relies on data diversity; costly for large inputs
	GNN-based methods	EGRET, MEG-PPIS	Models proteins as residue graphs, capturing spatial and topological features effectively	Excels at integrating local and global residue interactions with detailed constraints	High complexity; sensitive to graph generation quality
	Surface property-based methods	MaSIF, PeSTo	Analyzes surface properties like electrostatics, and hydrophobicity via point clouds or meshes	Effectively and precisely identifies binding sites based on surface properties	May not capture internal structural information

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House