收藏切换
Deep Learning for Predicting Biomolecular Binding Sites of Proteins
收藏切换
PDF
Minjie Mou, Zhichao Zhang, Ziqi Pan, Feng Zhu*
Research. Vol 8 Article ID 0615
Less
收藏切换
Research. Vol 8 Article ID 0615
Perspective
Deep Learning for Predicting Biomolecular Binding Sites of Proteins
Full
Minjie Mou, Zhichao Zhang, Ziqi Pan, Feng Zhu*
Affiliations
  • College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China.
Published: 2025-02-24 doi: 10.34133/research.0615
Outline
收藏切换

The rapid evolution of deep learning has markedly enhanced protein–biomolecule binding site prediction, offering insights essential for drug discovery, mutation analysis, and molecular biology. Advancements in both sequence-based and structure-based methods demonstrate their distinct strengths and limitations. Sequence-based approaches offer efficiency and adaptability, while structure-based techniques provide spatial precision but require high-quality structural data. Emerging trends in hybrid models that combine multimodal data, such as integrating sequence and structural information, along with innovations in geometric deep learning, present promising directions for improving prediction accuracy. This perspective summarizes challenges such as computational demands and dynamic modeling and proposes strategies for future research. The ultimate goal is the development of computationally efficient and flexible models capable of capturing the complexity of real-world biomolecular interactions, thereby broadening the scope and applicability of binding site predictions across a wide range of biomedical contexts.

Minjie Mou, Zhichao Zhang, Ziqi Pan, Feng Zhu. Deep Learning for Predicting Biomolecular Binding Sites of Proteins[J]. Research, 2025 , 8 (2) : 0615 . DOI: 10.34133/research.0615
Recent advancements in sequence-based and structure-based deep learning approaches have markedly propelled the field of binding site prediction, offering profound insights into protein interactions and molecular mechanisms. These developments have accelerated key applications in target identification, mutation analysis, and drug design [1,2]. Furthermore, these advances highlight emerging trends, identify critical challenges, and outline forward-thinking directions for future research and practical implementation.
Binding site prediction methods are generally classified into 2 categories: sequence-based and structure-based approaches, each with its own advantages and limitations (Table). Sequence-based methods leverage amino acid sequences and evolutionary information, primarily focusing on linear sequence features [35]. Conversely, structure-based approaches rely on 3-dimensional protein structures to capture spatial arrangements crucial to binding interactions [6]. High-precision models, such as those in the AlphaFold series, provide the structural data needed for accurate binding site identification [7]. The integration of sequence and structural data has become a key trend for increasing prediction accuracy [8,9].
Geometric deep learning provides flexible ways to model protein structures by leveraging local and global geometric relationships. Point cloud models capture detailed spatial features of complex binding interfaces [6,10,11]. Surface property-based methods focus on overall surface features, such as hydrophobic patches or charge distributions, critical for protein interactions [12,13]. Graph neural networks (GNNs) encode proteins as graphs, incorporating physicochemical constraints and spatial relationships to improve binding predictions [11,14,15].
Transformer-based approaches capture long-range dependencies in sequences using attention mechanisms and pretrained models, making them effective for complex sequence relationships. However, they are computationally intensive, especially with long sequences, as seen in methods like PepBCL. Convolutional neural network (CNN)-based approaches focus on local sequence patterns using convolutional layers, which is efficient for motif detection in smaller datasets. However, they struggle to capture global sequence dependencies, as demonstrated by methods like DELPHI.
Multi-task frameworks, like DeepDISOBind, capture shared features across different interaction types, such as RNA, DNA, and proteins [3,16]. Additionally, ensemble learning frameworks enhance model robustness by combining diverse neural network architectures, as seen in EnsemPPIS, which integrates Transformer and gated CNNs to effectively capture both global and local interaction features within protein sequences [17]. Recently, advanced protein language models (PLMs) have also been applied to binding site prediction, such as ESM-DBP, which can substantially enhance prediction accuracy [18,19].
Both sequence-based and structure-based methods, despite substantial advances, have certain limitations. Structure-based methods, while highly accurate, rely heavily on high-quality structural data, often obtained from experiments or advanced prediction tools like AlphaFold [7,20]. However, even AlphaFold's high-precision predictions may not fully capture protein dynamics in complex biological contexts, where conformational changes and environmental factors influence binding. Furthermore, structure-based methods are inherently limited in addressing protein mutations that modify the protein's 3-dimensional configuration, underscoring the need for flexible models that can incorporate structural dynamics and account for binding site alterations. For instance, many biological processes, including enzyme catalysis and molecular signaling, require the consideration of protein flexibility and transient conformational states, which cannot always be captured by static structural models. These factors highlight the ongoing challenge of integrating structural dynamics into binding site prediction models, which could better reflect the complexity of biological systems.
In contrast, sequence-based methods offer computational efficiency and adaptability, making them valuable in scenarios where structural data are unavailable. These models leverage amino acid sequences and evolutionary conservation to identify binding residues but often struggle to capture spatial features crucial for precise predictions [21]. Although these methods excel in efficiency, they fall short in capturing the spatial context of protein interactions, which limits their prediction accuracy. To mitigate this limitation, incorporating spatial constraints such as predicted residue–residue interactions or sequence-based structural motifs could provide a more nuanced understanding of binding sites without relying on structural data. This flexibility enables dynamic predictions, particularly when proteins undergo conformational changes, positioning sequence-based approaches as valuable tools for studying mutation impacts. Incorporating such spatial and dynamic elements would not only improve prediction accuracy but also enhance the model's robustness across diverse biological conditions, expanding the potential applications of sequence-based methods. Given their simplicity, efficiency, and adaptability, sequence-based methods warrant further research, particularly as they evolve to integrate spatial and dynamic features.
In the future, dynamically integrating sequence and structural data holds potential for advancing binding site prediction. Hybrid models that combine sequence specificity with structural context can more effectively capture a broad spectrum of biomolecular interactions [10,15]. Multi-task learning and ensemble frameworks, which leverage shared features across tasks and combine the strengths of individual models, offer promising strategies to achieve this integration. In particular, ensemble frameworks enhance adaptability and robustness across diverse biomolecular interactions [17,22].
In fact, the successful integration of multiple data modalities has already demonstrated substantial improvements in related fields. A notable example is SurfDock [23], which combines sequence, structural, and physicochemical information to enhance protein–ligand binding pose and affinity predictions. By leveraging this multi-faceted approach, SurfDock has achieved a remarkable 20% improvement in predicting binding affinities over traditional single-modal methods. This highlights the promising potential of multi-modal integration, providing compelling evidence that such an approach could substantially advance the accuracy and reliability of protein–biomolecule binding site predictions.
Furthermore, while some models already integrate basic physicochemical properties such as charge distribution and hydropathy, advancements in molecular science can facilitate the inclusion of more complex molecular properties that offer potential for improving prediction accuracy. For example, with the development of advanced experimental techniques like molecular dynamics simulations and cryo-electron microscopy, researchers are now able to capture real-time atomic-level interactions and transient molecular states that are critical for accurate protein–biomolecule interaction predictions. These properties, such as the flexibility of binding interfaces or the detailed molecular forces at play, are often overlooked in traditional static models. By further incorporating these dynamic factors, as shown in the Figure, prediction models could provide a more refined and realistic depiction of protein–biomolecule interactions. This is especially important in capturing molecular flexibility and transient binding events, which play key roles in biological processes like enzyme catalysis, molecular signaling, and protein folding. Moreover, integrating these complex properties can enhance model robustness by accounting for the inherent variability and dynamism of biological systems. As a result, these enhanced models could offer greater accuracy in predicting binding sites, thus pushing the boundaries of current prediction technologies [18,24,25].
Achieving computational efficiency remains crucial given the increasing complexity of deep learning models. Developing lightweight models that maintain high accuracy while reducing computational demands is essential, particularly in data-limited or dynamic prediction settings [26]. Sequence-based methods are particularly promising in this regard, as they are inherently less computationally intensive. Future work may focus on streamlined architectures capable of capturing essential spatial and evolutionary patterns without sacrificing accuracy. Incorporating efficient neural architectures like Transformers and deep reinforcement learning (RL) could speed up training and improve generalization across diverse protein sequences, addressing model complexity. Additionally, future research might explore reducing dependency on multiple sequence alignment (MSA) or developing MSA-independent alternatives. Such innovations could lower computational costs and expedite data processing, thereby enhancing model applicability across a broader range of prediction contexts.
Advancements in deep learning have profoundly enhanced protein–biomolecule binding site prediction. Hybrid models that seamlessly integrate sequence and structural data promise to substantially enhance predictive accuracy, mitigating the limitations inherent in individual approaches. Furthermore, the strategic integration of lightweight architectures and multimodal data will optimize computational efficiency. Continued progress in these areas will broaden the impact of binding site prediction, driving transformative advances in drug target identification, mutation analysis, and therapeutic development.
  • National Natural Science Foundation of China(82373790)
  • National Natural Science Foundation of China(22220102001)
  • National Natural Science Foundation of China(81872798)
  • National Natural Science Foundation of China(U1909208)
  • Natural Science Foundation of Zhejiang Province(LR21H300001)
  • Natural Science Foundation of Zhejiang Province(RG25H300001)
  • National Key R&D Program of China(2022YFC3400501)
  • Leading Talents of “Ten Thousand Plan” National High-Level Talents Support Plans of China, The Double Top-Class Universities(181201*194232101)
  • Fundamental Research Funds for Central Universities(2018QNA7023)
  • Key R&D Program of Zhejiang Province(2020C03010)
1.
Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Zidek A, Nelson AWR, Bridgland A, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710.
2.
Gainza P, Wehrle S, Van Hall-Beauvais A, Marchand A, Scheck A, Harteveld Z, Buckley S, Ni D, Tan S, Sverrisson F, et al. De novo design of protein interactions with learned surface fingerprints. Nature. 2023;617(7959):176–184.
3.
Zhang F, Zhao B, Shi W, Li M, Kurgan L. DeepDISOBind: Accurate prediction of RNA-, DNA- and protein-binding intrinsically disordered residues with deep multi-task learning. Brief Bioinform. 2022;23(1):bbab521.
4.
Shen LC, Liu Y, Song JN, Yu DJ. SAResNet: Self-attention residual network for predicting DNA-protein binding. Brief Bioinform. 2021;22(5):bbab101.
5.
Huang JX, Li WK, Xiao B, Zhao CQ, Zheng HC, Li YR, Wang J. PepCA: Unveiling protein-peptide interaction sites with a multi-input neural network model. Iscience. 2024;27(10): Article 110850.
6.
He X, Zhao L, Tian Y, Li R, Chu Q, Gu Z, Zheng M, Wang Y, Li S, Jiang H, et al. Highly accurate carbohydrate-binding site prediction with DeepGlycanSite. Nat Commun. 2024;15(1):5163.
7.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589.
8.
Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23(2):bbab564.
9.
Shafiee S, Fathi A, Taherzadeh G. DP-site: A dual deep learning-based method for protein-peptide interaction site prediction. Methods. 2024;229:17–29.
10.
Xia Y, Xia CQ, Pan X, Shen HB. GraphBind: Protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res. 2021;49(9): Article e51.
11.
Li P, Liu ZP. GeoBind: Segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res. 2023;51(10): Article e60.
12.
Gainza P, Sverrisson F, Monti F, Rodola E, Boscaini D, Bronstein MM, Correia BE. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods. 2020;17(2):184–192.
13.
Krapp LF, Abriata LA, Cortes Rodriguez F, Dal Peraro M. PeSTo: Parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nat Commun. 2023;14(1):2175.
14.
Mahbub S, Bayzid MS. EGRET: Edge aggregated graph attention networks and transfer learning improve protein-protein interaction site prediction. Brief Bioinform. 2022;23(2):bbab578.
15.
Ding H, Li X, Han P, Tian X, Jing F, Wang S, Song T, Fu H, Kang N. MEG-PPIS: A fast protein-protein interaction site prediction method based on multi-scale graph information and equivariant graph neural network. Bioinformatics. 2024;40(5):btae269.
16.
Wang N, Yan K, Zhang J, Liu B. iDRNA-ITF: Identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform. 2022;23(4):bbac236.
17.
Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. A transformer-based ensemble framework for the prediction of protein-protein interaction sites. Research. 2023;6:0240.
18.
Zeng W, Dou Y, Pan L, Xu L, Peng S. Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. Nat Commun. 2024;15(1):7838.
19.
Liu YF, Tian BX. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform. 2024;25(1):bbad488.
20.
Alam R, Mahbub S, Bayzid MS. Pair-EGRET: Enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models. Bioinformatics. 2024;40(10):btae588.
21.
Wang R, Jin J, Zou Q, Nakai K, Wei L. Predicting protein-peptide binding residues via interpretable deep learning. Bioinformatics. 2022;38(13):3351–3360.
22.
Zeng M, Zhang F, Wu FX, Li Y, Wang J, Li M. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. 2020;36(4):1114–1120.
23.
Cao DH, Chen MG, Zhang RZ, Wang ZK, Huang ML, Yu J, Jiang XY, Fan ZH, Zhang W, Zhou H, et al. SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction. Nat Methods. 2024.
24.
Yin S, Mi X, Shukla D. Leveraging machine learning models for peptide-protein interaction prediction. RSC Chem Biol. 2024;5(5):401–417.
25.
Li Y, Golding GB, Ilie L. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics. 2021;37(7):896–904.
26.
Baranwal M, Magner A, Saldinger J, Turali-Emre ES, Elvati P, Kozarekar S, VanEpps JS, Kotov NA, Violi A, Hero AO. Struct2Graph: A graph attention network for structure based predictions of protein-protein interactions. BMC Bioinformatics. 2022;23(1):370.
Year 2025 volume 8 Issue 2
PDF
185
101
Cite this Article
BibTeX
Article Info
doi: 10.34133/research.0615
  • Receive Date:2024-11-16
  • Online Date:2025-07-23
  • Published:2025-02-24
Article Data
Affiliations
History
  • Received:2024-11-16
  • Revised:2025-01-21
  • Accepted:2025-01-24
Funding
National Natural Science Foundation of China(82373790)
National Natural Science Foundation of China(22220102001)
National Natural Science Foundation of China(81872798)
National Natural Science Foundation of China(U1909208)
Natural Science Foundation of Zhejiang Province(LR21H300001)
Natural Science Foundation of Zhejiang Province(RG25H300001)
National Key R&D Program of China(2022YFC3400501)
Leading Talents of “Ten Thousand Plan” National High-Level Talents Support Plans of China, The Double Top-Class Universities(181201*194232101)
Fundamental Research Funds for Central Universities(2018QNA7023)
Key R&D Program of Zhejiang Province(2020C03010)
Affiliations
    College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China.

Corresponding:

* Address correspondence to:
References
Share
https://castjournals.cast.org.cn/joweb/research/EN/10.34133/research.0615
Share to
QR

Scan QR to access full text

Cite this article
BibTeX
Citations
表12种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏
  • BibTeX
  • EndNote
  • RefWorks
  • TxT