Highly Accurate and Efficient Deep Learning Paradigm for Full-Atom Protein Loop Modeling with KarmaLoop

Highly Accurate and Efficient Deep Learning Paradigm for Full-Atom Protein Loop Modeling with KarmaLoop

PDF

Tianyue Wang¹^,^†, Xujun Zhang¹^,^†, Odin Zhang¹^,^†, Guangyong Chen², Peichen Pan¹, Ercheng Wang¹^,², Jike Wang¹, Jialu Wu¹, Donghao Zhou³, Langcheng Wang⁴, Ruofan Jin¹^,⁵, Shicheng Chen¹, Chao Shen¹, Yu Kang¹^,^*, Chang-Yu Hsieh¹^,^*, Tingjun Hou¹^,^*

Research. Vol 7 Article ID 0408

Less

Research. Vol 7 Article ID 0408

• Research Article •

Highly Accurate and Efficient Deep Learning Paradigm for Full-Atom Protein Loop Modeling with KarmaLoop

Full

Affiliations

¹Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.

² Zhejiang Laboratory, Hangzhou 311100, Zhejiang, China.

³Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, Guangdong, China.

⁴Department of Pathology, New York University Medical Center, New York, NY 10016, USA.

⁵College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.

Published: 2024-07-25 doi: 10.34133/research.0408

Outline

Abstract

Less

Protein loop modeling is a challenging yet highly nontrivial task in protein structure prediction. Despite recent progress, existing methods including knowledge-based, ab initio, hybrid, and deep learning (DL) methods fall substantially short of either atomic accuracy or computational efficiency. To overcome these limitations, we present KarmaLoop, a novel paradigm that distinguishes itself as the first DL method centered on full-atom (encompassing both backbone and side-chain heavy atoms) protein loop modeling. Our results demonstrate that KarmaLoop considerably outperforms conventional and DL-based methods of loop modeling in terms of both accuracy and efficiency, with the average RMSDs of 1.77 and 1.95 Å for the CASP13+14 and CASP15 benchmark datasets, respectively, and manifests at least 2 orders of magnitude speedup in general compared with other methods. Consequently, our comprehensive evaluations indicate that KarmaLoop provides a state-of-the-art DL solution for protein loop modeling, with the potential to hasten the advancement of protein engineering, antibody–antigen recognition, and drug design.

Cite this Article

Tianyue Wang, Xujun Zhang, Odin Zhang, Guangyong Chen, Peichen Pan, Ercheng Wang, Jike Wang, Jialu Wu, Donghao Zhou, Langcheng Wang, Ruofan Jin, Shicheng Chen, Chao Shen, Yu Kang, Chang-Yu Hsieh, Tingjun Hou. Highly Accurate and Efficient Deep Learning Paradigm for Full-Atom Protein Loop Modeling with KarmaLoop[J]. Research, 2024 , 7 (7) : 0408 . DOI: 10.34133/research.0408

Full Text

Less

Introduction

Less

Loops are irregular segments in protein structures that often link 2 regular secondary structures such as α helices or β sheets [1,2]. These structures are typically situated near the surface of proteins [3], positioned in a way to allow them to facilitate numerous essential biological functions [4], including regulation of enzyme activity [5–8], protein–protein interaction (PPI) [9], and protein-ligand recognition [10–12]. Given their high flexibility, loops heavily influence the overall protein dynamics [13], rendering loop prediction one of the most challenging yet indispensable tasks in protein modeling [14]. Furthermore, since over half of the experimentally determined structures in the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) [15] contain missing fragments, and these missing regions predominantly correspond to loops [4], accurate modeling of loop structures is of high importance. For example, the prediction of antibody complementarity determining region (CDR) H3 loops is the most challenging task in antibody modeling [16], mainly due to the much higher flexibility of H3 compared with other CDR loops. To address this issue, various computational methods including traditional methods (knowledge-based, ab initio, and hybrid) and deep learning (DL) methods have been proposed. Knowledge-based methods, such as FREAD [17], LoopIng [18], SuperLooper [19], and DaReUS-Loop [20], rely on template repositories collected from pre-existing protein structures and perform modeling based on certain standards, such as structural similarity. For instance, FREAD performs database searching using the 4 rule-based filters based on anchor C_α separations and the environmentally constrained substitution score. LoopIng selects templates from the database by taking advantage of both sequence- and geometry-related features. SuperLooper provides the first online server for modeling globular and membrane protein loops, where the loop candidates are chosen by the standard of the root mean square deviation (RMSD) of the stem atoms. DaReUS-Loop uses fragments from remote or unrelated proteins to complete loop modeling and filters by sequence similarity and conformational profile. These methods are widely used in loop modeling, particularly in cases where a comparable fragment is available in the template database. However, the modeling performance of these methods for novel structures (especially for long loops) would drop sharply without proper references, suggesting that the accuracy of these methods is severely limited by the available templates [21,22].

Ab initio methods, mainly consisting of 2 components, conformational sampling and scoring, including DISGRO [1], LEAP [23], cyclic coordinate descent (CCD) [24,25], robotics-based kinematic closure (KIC) [26], and next-generation KIC (NGK) [27], offer advantages over knowledge-based methods due to their independence from templates and the ability to explore a broader conformational space. However, they need more computational resources than knowledge-based methods, and their computational cost increases exponentially with the loop length [21]. DISGRO, for instance, employs a chain growth sequential Monte Carlo sampling strategy for loop modeling and an atom-based distance-dependent empirical potential scoring function specifically designed for loops, and it has been described as an efficient computational method and particularly effective for modeling loops with 10 to 17 residues. CCD minimizes the distances between the backbone atoms of the C-terminal moving anchor and the corresponding atoms in the fixed anchor by adjusting one dihedral angle at a time. KIC utilizes a robotics-inspired local loop reconstruction method for loop modeling, achieving sub-angstrom accuracy. Among ab initio methods, NGK has been commonly regarded to be the state of the art [20,21], as it combines intensification and annealing strategies and yields a 4-fold improvement over standard KIC.

Hybrid methods, such as CODA [28] and Sphinx [22], integrate both ab initio and knowledge-based methods to improve performance [20]. CODA utilizes a combination of the knowledge-based approach FREAD and the ab initio method PETRA [29] for generating a consensus prediction that should satisfy a set of rule-based filters. Sphinx combines FREAD and an in-house ab initio method that first selects potential fragments shorter than the length of loop to be modeled. Then, the ab initio method adjusts the length of the fragments to generate accurate decoys of the desired length.

With the continual development of computing resources, DL has become increasingly popular and has demonstrated its potential in predicting protein structures. For instance, AlphaFold [30] incorporates biological and physical information and utilizes multi-sequence alignments (MSAs) to design a DL algorithm, achieves atomic-level accuracy, and ranks first in the 14th Critical Assessment of Structure Prediction (CASP) competition. Likewise, RoseTTAFold [31] utilizes a 3-track neural network that simultaneously considers protein sequences, amino acid interactions, and structures to attain accuracy close to AlphaFold [32]. However, both methods often face great challenges in accurately modeling loop regions [32,33], which is related to the fact that these methods typically do not incorporate factors such as ligand presence, the complexities of PPIs, and multiple conformations [34]. AlphaFold-Multimer [35] is developed as a useful tool in protein complex prediction and exhibits outstanding performance in antibody structure prediction [36]. More recently, AlphaFold3 [37] has shown remarkable capabilities in extending more accurate structural predictions to a broader scope that includes interactions with ligands, proteins, nucleic acids, and posttranslational modifications. Despite the need for further validation from the scientific community, these advancements in AlphaFold3 have brought great hope to this field.

The first DL method particularly for loop modeling was proposed in 2019 [38], which employs convolutional neural networks (CNNs) [39] on the distance matrix to predict the pairwise distances of loop atoms for conformation reconstruction. Recently, 2 DL architectures for loop modeling have been proposed [40], including one using a combined CNN–recursive neural network (RNN) structure (DeepMUSICS) and the other based on a refinement of histograms using a two-dimensional (2D) CNN architecture (DeepHisto). However, the source code of the aforementioned methods is not publicly available, which limits their broader applications. In the case of CDR H3 loop modeling, DeepH3 [41] was introduced using CNN to predict the interresidue distances and orientations from antibody sequences. Furthermore, DeepAb [42], the updated version of DeepH3, has demonstrated improvements with the inclusion of a pre-trained antibody sequence model and an interpretable attention mechanism to accurately predict the structures of antibodies.

So far, both traditional and DL-based methods have been unable to achieve high accuracy and efficiency simultaneously in loop prediction. Moreover, most proposed methods have focused solely on the prediction accuracy of heavy atoms in the backbone within the loop region [1,24,43–47]. Nonetheless, it is crucial and challenging to predict the conformation of side chains accurately [48–50], which highlights the importance of executing meticulous full-atom modeling of loops.

For the reasons above, we present a novel end-to-end DL paradigm named KarmaLoop, which is designed for accurate and efficient full-atom prediction of protein loops. Figure 1A depicts an example of a loop modeled by KarmaLoop, which starts with a random initialization of the loop coordinates and then predicts an accurate loop structure. KarmaLoop (Fig. 1B) employs graph neural network (GNN) architectures, which comprise 2 encoders [i.e., Graph Transformer (GT) [51–53] and Geometric Vector Perceptrons (GVPs) [54]] for intramolecular interaction encoding, a mixture density network (MDN) block is employed to assess confidence levels in the predictions, and an E(n) equivariant GNN (EGNN) block is used to generate the loop conformation. While MDN is adept at distinguishing between high-RMSD and low-RMSD conformations, it does not display a satisfactory linear correlation with the RMSD results. The innovations are as follows: (a) better representation: by integrating atom features within their respective residues, we achieve a more holistic representation of proteins (Fig. 1C); (b) enhanced accuracy: the MDN block is trained prior to the introduction of the EGNN to better learn the distribution of distances between loop atoms and nonloop residues, thereby refining the learning process for pose generation; (c) higher efficiency: instead of relying on searching algorithms, loop conformations are updated using pseudo-forces as predicted by EGNN, thereby accelerating the modeling process.

KarmaLoop has been evaluated on multiple benchmark datasets, including the general protein loop benchmark CASP13+14 (the combination of CASP13 [55] and CASP14 [56]), the CASP15 dataset, and the antibody H3 loop benchmark (RosettaAntibody and therapeutic benchmark used in DeepAb), and filter the samples with high sequence similarity with the training dataset. In terms of the general protein loop modeling capability, KarmaLoop outperforms all the tested traditional and DL-based methods on both accuracy and efficiency, with the medium full-atom RMSDs of 1.41 and 1.31 Å, the average RMSDs of 1.77 and 1.95 Å, the success rates of 75.09% and 71.71% for single-conformation generation and 77.51% and 76.21% for generating 5 conformations, and the speeds of 0.047 s and 0.049 s per task on the CASP13+14 and CASP15 benchmark datasets, respectively. Several successful predictions of antibody CDR H3 loops are shown in Fig. 1E, where KarmaLoop achieves the average full-atom and medium RMSDs of 3.25 and 2.97 Å, respectively, compared with the baseline methods with the best performance (average and medium RMSDs of 3.49 and 3.34 Å, respectively). In protein engineering and structure-based drug design, accurately modeling protein loops under an imprecise structural environment holds tremendous potential, but precisely reconstructing loop structures within an accurate framework structure is also quite important, a focus that has been extensively explored in most previous studies. To provide a comprehensive assessment, we also conducted the task of antibody H3 modeling based on inaccurate structural frameworks predicted by other methods. When using the AlphaFold-Multimer, AlphaFold, and DeepAb modeled antibody structures as the input, KarmaLoop could further refine the conformations of H3 loops and increase the accuracy by an average of 9%. Hence, we proposed a universal protocol for precise antibody modeling by employing DeepAb to first predict an overall antibody structure, followed by KarmaLoop to refine the H3 loops. In summary, KarmaLoop has shown a remarkable superiority (over prior arts) in predicting both general loops and antibody H3 loops across most loop lengths.

Results and Discussion

Less

As shown in Fig. 1B, KarmaLoop consists of 3 components: (a) 2 encoders, designed to comprehend representations of both residues and atoms, thereby implementing a hierarchical representation; (b) E(n) EGNNs enhanced with self-attention, aiming at updating the loop conformation by considering both intermolecular and intramolecular interactions; (c) an MDN designed for confidence evaluation and introducing the prior knowledge to facilitate prediction. To assess the ability of KarmaLoop, it is critical to make comparisons with other widely used methods in terms of prediction accuracy and efficiency. Consequently, in the subsequent sections, we conducted some experiments to evaluate the performance including the accuracy of generating conformations, the effect of loop length on accuracy, computational efficiency, the capacity for antibody H3 loop modeling, and 2 cases to prove that KarmaLoop is capable of reproducing diverse conformations according to varying ligand-binding conditions.

Prediction accuracy for general protein loops

The accuracy of loop modeling is of utmost importance for many practical applications. To validate the capability of our model, 7 widely used methods including 3 DL-based methods (AlphaFold, RoseTTAFold, and ColabFold [57], and we used a customized template database that contains all the nonloop structures in the CASP13+14 and CASP15 benchmarks), one knowledge-based method (FREAD), and 3 ab initio methods [DISGRO, NGK, and Rosetta model_missing_loop (RML) [58]] were employed for comparison with KarmaLoop in a comprehensive benchmark. The full-atom RMSDs were calculated for each method, with the AlphaFold, ColabFold, and RoseTTAFold predicted structures being aligned before computation.

To perform quantitative comparisons with the other methods, empirical cumulative distribution function (eCDF) was selected to describe the proportion of samples at given RMSD values. We defined a task as successful if the RMSD of a given sample falls below a specified threshold. Figure 2A and B demonstrates that when the RMSD threshold was set to 2 Å, KarmaLoop exhibited the highest success rates of 75.09% and 71.71%, respectively, on the CASP13+14 and CASP15 benchmark datasets, yielding impressive improvements by 22.92% and 14.30% over the next best method (i.e., AlphaFold, 52.17%; NGK, 57.41%). Even adopting a stricter RMSD standard of 1 Å, KarmaLoop (23.87% and 29.93%) still outperformed the other methods, ranging from RoseTTAFold (1.74% and 1.07%) to NGK (19.23% and 22.26%).

As some baseline methods may generate multiple conformations for a given loop sample, KarmaLoop was further evaluated in the context of generating multiple conformations (all methods produced 5 conformations). It is important to note that, regardless of the number of conformations generated for a given task, a successful prediction is acknowledged when the lowest RMSD is no more than 2 Å. As shown in Fig. 2C, KarmaLoop achieved the best performance and improved the success rates by 11.38% and 4.26%, respectively, on the 2 benchmark datasets, compared with the best model NGK with the success rates of 66.13% and 71.95%, respectively. ColabFold and RoseTTAFold exhibited the lowest success rates of only 34.40% and 30.58%, respectively.

To directly compare the performance, the distribution of each method was plotted in Fig. 2D. KarmaLoop showed the lowest medium RMSDs on the CASP13+14 and CASP15 datasets, with 1.41 and 1.31 Å, respectively, as compared to the other methods, and showed much lower RMSDs than the second-best method NGK (1.60 Å) and AlphaFold (1.90 Å). Besides, KarmaLoop achieved the average RMSD of 1.77 Å (±1.43 Å) and 1.95 Å (±2.33 Å), holding a marked leading margin of 42.34% and 34.56% over the next best baseline method NGK on the CASP13+14 and CASP15 datasets (Fig. 2E).

The impact of loop length on accuracy

We further investigated the influence of loop length on prediction accuracy. As shown in Fig. 1E, KarmaLoop exhibits the possibility to generate satisfactory conformations of loops with varying lengths. Figure 3A and B displays the RMSDs of loop length with more than one sample in the CASP13+14 and CASP15 datasets to avoid statistically inconclusive results. Our results indicated that KarmaLoop outperformed the baseline methods for the majority of length cases (9/11 and 8/12 on the CASP13+14 and CASP15 datasets) and maintained a relatively consistent performance across different loop lengths. In our analysis of over 20,000 proteins from AlphaFold pdb_mmcif dataset, we found that regions comprising no more than 10 residues, predominantly short loops, account for close to 80% of the total missing regions. This finding underscores the critical importance of accurate short loop prediction in protein modeling. Interestingly, AlphaFold and RoseTTAFold fall short in accurately predicting short loops compared with more traditional methods (Fig. 3). In contrast, Karmaloop emerges as a superior method, far surpassing nearly all other methods in short loop precision, with the RMSDs for most loops shorter than 10 residues around 2.0 Å. Besides, the observed increase in RMSD with loop length may be attributed to the fact that the training dataset predominantly consisted of short loops (residues ≤ 10), and long loops (particularly those over 20 residues) were only represented by 0.058% of all the samples (Fig. S1).

Evaluation of loop modeling efficiency

Computational efficiency is also a key factor in loop modeling. In this section, the time consumption of each method was evaluated. KarmaLoop was tested on a Tesla V100S GPU, traditional methods (FREAD, DISGRO, NGK, and RML) were run on a single core Intel Xeon Gold 6240R CPU @ 2.40 GHz, and AlphaFold, ColabFold, and RoseTTAFold were evaluated in parallel with 20 cores on the Intel Xeon Gold 6240R CPU @ 2.40 GHz and a Tesla V100S GPU. It is important to emphasize that the results presented by KarmaLoop do not include any postprocessing step, which typically requires approximately 5 s to complete a sample. Figure 4A shows the distribution of time spent for each method, illustrating the remarkable efficiency of KarmaLoop in terms of time cost on each task ranging from 0.015 to 0.14 s. Contrastingly, the time cost of DISGRO ranges from 0.773 to 51.07 s, and the other methods are considerably slower. As shown in Fig. 4B, the average time costs of KarmaLoop on the CASP13+14 and CASP15 datasets are 0.047 and 0.049 s, respectively, displaying a considerable speed advantage over traditional methods. Specifically, the minimum speed advantages of KarmaLoop over DISGRO (17.006 and 21.579 s) are 362× and 440× for CASP13+14 and CASP15, respectively, and the maximum speed advantages of KarmaLoop over RML (1172.342 s) and NGK (978.427 s) are 24,943× and 19,968×, respectively. However, it should be noted that, unlike other methods that only model loop region, AlphaFold, ColabFold, and RoseTTAFold require predicting the entire protein conformation. Although the modeling speed for these methods cannot be directly compared, ColabFold with a customized template database performs faster than AlphaFold and RoseTTAFold. While the reported speedup may vary slightly depending on the hardware used, the 2 to 4 orders of magnitude in speedup should still hold unless these baseline tools have been completely re-engineered to boost their efficiency.

Prediction of antibody H3 loops

The CDR H3 of antibody is widely acknowledged as the most flexible region and contributes the most to the structural diversity and binding site topography [16]. Consequently, accurate modeling of the H3 conformation is urgently needed. Sphinx, proposed by Marks et al. [22], is a hybrid method that models both general loops and antibody H3 loops with the same architecture but using different fragment databases. In contrast, KarmaLoop consistently employs the same process for H3 modeling.

In this section, we present a comparative analysis of the H3 modeling performance of KarmaLoop against 4 widely used methods (i.e., DeepAb [42], RosettaAntibody-G [59], RepertoireBuilder [60], AbodyBuilder [61], AlphaFold, and AlphaFold-Multimer) for predicting antibody structures on the antibody H3 benchmark. To evaluate the quality of the modeling results, we used the experimentally determined structures from SAbDab as the input. Figure 5A depicts the distribution of the RMSDs between the predicted samples from the baseline methods and the experimentally determined structures. The results demonstrate that KarmaLoop (2.97 and 3.25 Å) outperforms DeepAb (3.35 and 3.49 Å) in terms of the medium and average RMSDs by 0.38 and 0.24 Å, respectively. Figure 5B displays the RMSDs of samples with varying loop lengths, and KarmaLoop exhibits the lowest average RMSD for most lengths of the H3 loop samples.

Next, we assessed the ability of refinement by using the antibody structures predicted by AlphaFold-Multimer, AlphaFold, and DeepAb as the input for KarmaLoop. Notably, different from the version of the models used before, as for the models of the loops in imprecise protein structures, a refined version of the models was used (more details in the “Loop initialization and pocket selection” section). Figure 5C and D illustrates the predictions based on the structures modeled by DeepAb, which can be used to test the capability of KarmaLoop to improve structures under a suboptimal structural situation. The results depicted in Fig. 5 (C and D) demonstrate that KarmaLoop is capable of refining the loop structures with an average RMSD reduction from 3.55 to 3.32 Å, 3.73 to 3.42 Å, and 3.49 to 3.21 Å, and a medium RMSD reduction from 3.45 to 3.23 Å, 3.70 to 3.38 Å, and 3.34 to 2.95 Å for AlphaFold-Multimer, AlphaFold, and DeepAb, respectively. It is worth noting that this improvement is observed across most lengths of the H3 loops, highlighting the consistent ability of KarmaLoop to improve the H3 loop structures regardless of their length. Thus, a universal protocol can be applied for antibody prediction, whereby DeepAb models the entire antibody structures and KarmaLoop reconstructs the H3 structures.

Multi-preferential conformation prediction

In studying the conformational dynamics of protein loops, it is observed that these structures exhibit diverse conformations when binding with different small-molecule ligands, which is very relevant to drug design scenarios. To elucidate the predictive capabilities of KarmaLoop in capturing these multi-conformational states, we investigated the ground-truth structure of 2 well-documented protein families from PDB: the androgen receptor (AR) and Renin, each under distinct conditions and exhibiting unique loop conformations. Specifically, we examined AR in scenarios of small-molecule binding (PDB ID: 2PIX) and homodimer formation (PDB ID: 5JJM); and 2 holo forms of Renin in interactions with different small-molecule inhibitors (PDB IDs: 1BIL, 2I4Q). Figure 6 presents a comparative analysis between the conformational predictions made by KarmaLoop and the experimentally determined ground-truth conformations, along with an examination of the discrepancies between 2 sets of ground-truth conformations. In both cases, the loop regions exhibit distinct preferential conformations in complex with different guest molecules, and KarmaLoop demonstrates the possibility of reproducing diverse conformational responses across varying ligand-binding conditions.

Conclusion

Less

To verify the effectiveness of KarmaLoop, several evaluations and comparisons have been conducted, including general loop modeling accuracy, loop-length impact and compute time, as well as potential applications in antibody H3 prediction. Of note, KarmaLoop achieved a substantial improvement in the average RMSD by more than 2-fold and the medium RMSD by nearly 2-fold. Regarding the success rate, the minimum enhancement of 22.92% and 14.30% in the CASP13+14 and CASP15 benchmark datasets was observed on one conformation generated by each method. Even when compared to other methods that generate multiple conformations, KarmaLoop still achieved a minimum improvement of 11.38% and 4.26%. KarmaLoop is also capable of accurately modeling loops of different lengths and obtained a majority length preponderance against the other methods. Surprisingly, KarmaLoop achieved at least a 362× speed advantage over other traditional methods. In terms of antibody CDR H3 loop modeling, which is considered the most challenging task in antibody structure prediction, KarmaLoop also showed the best performance and maintained its superiority across various lengths. Notably, based on the AlphaFold-Multimer, AlphaFold, and DeepAb modeled antibody structures, KarmaLoop has successfully promoted its accuracy and achieved improvement upon those methods in almost all lengths of H3 loop.

In summary, KarmaLoop has demonstrated the potential to accurately model loops, refine loop conformations based on inaccurate structure environments, and markedly accelerate batch data processing. Besides, a universal protocol was proposed for antibody structure prediction by utilizing DeepAb to model the entire structure of the antibody and KarmaLoop to refine the H3 conformation. Except for an accurate loop modeling paradigm proposed, a large training dataset (0.27 million) and several benchmark datasets have been made available for researchers to further evaluate the prediction accuracy of the general loop and antibody H3. We anticipate that KarmaLoop will accelerate the advancement of protein design, antibody–antigen recognition, and drug discovery and will become an essential tool in the field of biology moving forward.

Methods

Less

Architecture overview

Inspired by Jin et al. [62], we consider the loop modeling problem as the docking task and selected pockets for reducing computational resources, which is a common and frequently used protocol for ligand docking [51]. As shown in Fig. 1B, the protein pocket is modeled as a 2D molecular graph, G_p, where nodes represent atoms and edges denote covalent bonds. For the nonloop region, a 3D residue graph (G_nl) is employed, where nodes represent residues. This graph connects each node to its 30 closest neighbors, enhancing the ability to capture long-range interactions more effectively and with less computational demand than an atom-based graph.

Rather than serving as direct inputs for subsequent tasks, the molecular graphs and residue graphs undergo encoding through GT [51–53] and GVP [54] to learn intra-molecular interactions and update node embeddings. Then, for nonloop residues, the node embeddings for atom nodes in the same residue are summed and concatenated to the corresponding residue node embeddings, followed by a feature merging block (FMB) to generate hierarchical node embeddings.

H p, s 1 = GT H p, s 0, E p, s 0

(1)

H nl, s 0.5 = GVP S nl, s 0, H nl, s 0, H nl, v 0, E nl, s 0, E nl, v 0

(2)

H nl, s, i 1 = FMB concat H nl, s, i 0.5, ∑ j = k k ′ H p, s, j 1

(3)

H l, s 1 = H p, s 1 loop_mask

(4)

where H denotes the features of nodes and E is the characteristics of edges; S indicates the type of residue involved; loop mask is a bool array for selecting loop atoms from the whole protein atoms; and p, nl, l, s, v, i, j, k, and k' on the lower right (i.e., subscripts) denote protein pocket atoms, nonloop residues, loop residues, scalar, vector, the index of residues, the index of atoms, the start index of the atom corresponding to ith residue, and the end index of the atom corresponding to ith residue, respectively.

Embeddings of nodes

H l, s 1, H nl, s 1

generated by the encoders, along with the node coordinates

X l, s 1, X nl, s 1

of loop atoms and nonloop residues, are integrated to create an interaction graph [G_l,nl = (H_l,nl, E_l,nl, X_l,nl)]. This graph is designed to encapsulate intermolecular interactions between molecules at both the levels of atoms and residues.

In the procedure for generating poses, the loop conformation, depicted as the coordinates of the loop atom nodes, is initialized by sampling from the normal distribution with means and SDs varying for various tasks (loop reconstruction and loop refinement, details in the “Loop initialization and pocket selection” section). The embeddings for nodes and edges begin from a baseline established by graph normalization and are further processed through an MLP layer (Eq. 5). Subsequently, an EGNN consisting of 8 layers with self-attention is employed to update the embeddings of nodes and edges as well as the positions of the loop atoms. This network architecture is designed to effectively handle interactions both within loop atoms and between loop residue and nonloop atoms (Eq. 6). Inspired by AlphaFold, a recycling approach is adopted to ensure that the EGNN block progressively enhances its ability to refine the structural poses. At the beginning of each recycling iteration, a gating mechanism combines the updated embeddings with the original raw embeddings (Eq. 7).

H l, nl 0, 0 = GraphNorm H l, s, H nl, s

(5)

H l, nl r, i, E l, nl r, i, X l, nl r, i = EGNN_Layer i H l, nl r, i − 1, E l, nl r, i − 1, X l, nl r, i − 1

(6)

F l, nl r + 1, 0 = Gate_Block F l, nl r, 8, F l, nl 0, 0

(7)

In this framework H and E denote the embeddings of node and edge, respectively; F is used to represent a generic embedding; and l, nl, r, and i represent loop atoms, nonloop residues, the index of recycling, and the index of EGNN layer, respectively.

Upon obtaining node embeddings, the MDN block is used to predict essential statistical parameters such as the mean, SD, and coefficient of variation. These parameters collectively provide a comprehensive representation of the internodal distance distribution, as outlined in Eq. 8. Following this step, the node-pair embeddings and internodal distances are integrated with the predicted distribution to ascertain the likelihood that an internodal distance mirrors that observed in the crystal conformation (Eq. 9). This resultant probability estimation can then be utilized as a scoring metric to evaluate the appropriateness of the current conformation in subsequent analyses.

μ l, nl, σ l, nl, π l, nl = MDN_Block H l, nl 1

(8)

U x = − ∑ nl = 1 NL ∑ l = 1 L log P d l, nl | h l, h nl = − score

(9)

where H represents the node embeddings; l and nl denote loop atoms and nonloop residues, respectively; and μ_l,nl, σ_l,nl, and π_l,nl represent the means, SDs, and mixing coefficients, respectively.

Loop initialization and pocket selection

Selecting the pocket instead of the entire protein can simplify the input of the network and guide KarmaLoop to focus on the relevant structural features that are essential for understanding the interaction between loop and nonloop regions. In our study, we introduce 2 methods for loop initialization and pocket selection.

In cases of loop reconstruction where loop conformations remain undetermined, we calculate the center of mass for each loop residue by uniformly dividing the distance between its anchor residues (those immediately preceding and following the loop region). The initial conformation of each residue is then sampled from a Gaussian distribution, centered on its individual center of mass with an SD of 1.75 Å. The pocket selection similarly relies on the anchor residues, retaining the surface residues within a sphere whose center is the midpoint between the 2 anchor residues (see the “Surface pocket selection” section).

Conversely, for loop conformation refinement, where KarmaLoop is tasked with optimizing the loop conformations that are inaccurately predicted by other algorithms, the conformations are generated from a normal distribution. This distribution has its mean set around the original loop's center of mass with an SD of 4 Å. Pockets, in this scenario, comprise the residues located within 12 Å of any original loop atom.

Graph representation

In this research, each protein is depicted as an undirected graph, symbolized by [G_p = (H_p, E_p, l_p)], with nodes corresponding to atoms and edges to covalent bonds. H_p and E_p represent the node features and edge features, while the coordinates of the atoms are represented by the node positions X_p.

To facilitate the prediction of loop conformations through nonbonded interactions, it is essential to establish connections between protein and loop nodes within the model. However, creating full connections between all nonloop and loop nodes is highly resource-intensive, particularly when dealing with a vast array of protein atoms. For obtaining both atom and residue information, we propose to use residues as nodes by merging atom information with the corresponding residues in the nonloop graph, which can also greatly reduce the number of nodes and edges in the graph. Furthermore, residue graphs are capable of encoding the geometric details of each residue while effectively capturing interactions over long distances between loop and nonloop residues, which enhances the prediction of loop conformations. To enhance the prediction of loop conformations, we devise a K-nearest neighbor (KNN) graph, denoted as [G_nl = (H_nl, E_nl, X_nl)], setting K to 30. In this graph, residues serve as nodes and their coordinates are represented by the coordinates of CA atoms. As detailed in Table S2, the features of both nodes and edges are not limited to scalar features (H_{nl_s}, E_{nl_s}) but also vector features (H_{nl_v}, E_{nl_v}), and the GVP is employed to learn the molecular topology and geometric characteristics effectively.

Graph transformer

In the demonstrated approach for effective representation of protein graphs, GTs have been utilized to learn intramolecular interactions (Fig. 1B). h_{p_i} ∈ R^d_h×1 represents ith node feature, and e_{p_ij} ∈ R^d_e×1 represents the edge feature between node i and j; at the beginning, it was initialed as

h p_i 0

and

e p_ij 0

with d-dimension through 2 distinct linear layers:

h p_i 0 = W h 0 h p_i + b h 0 e p_ij 0 = W e 0 e p_ij + b e 0

(10)

where

W h 0 ∈ R d × d h

W e 0 ∈ R d × d e

, and

b h 0

b e 0 ∈ R d

. The node and edge embeddings are first initialized and then processed through a series of 6 graph transformer layers, which produce the final embeddings. Each layer in this stack updates the embeddings through a process of message passing combined with a specialized version of multi-head self-attention (MHA), as detailed in the equations below:

q p_i k, l = W Q k, l Norm h p_i l

(11)

k p_j k, l = W K k, l Norm h p_j l

(12)

v p_j k, l = W V k, l Norm h p_j l

(13)

e p_ij k, l = W E k, l Norm e p_ij l

(14)

w p_ij k, l = Softmax j∈N i q p_i k, l · k p_j k, l d k · e p_ij k, l

(15)

h ̂ p_i l + 1 = h p_i l + W h 0 l Dropout Concat k∈ 1, …, H Aggregation_Sum j∈N i w p_ij k, l v p_j k, l

(16)

e ̂ p_ij l + 1 = e p_ij l + W e 0 l Dropout Concat k∈ 1, …, H w p_ij k, l

(17)

h p_i l + 1 = h ̂ p_i l + 1 + W h 2 l Dropout SiLU W h 1 l Norm h ̂ p_i l + 1

(18)

e p_ij l + 1 = e ̂ p_ij l + 1 + W e 2 l Dropout SiLU W e 1 l Norm e ̂ p_ij l + 1

(19)

where

W Q k, l

W K k, l

W V k, l

W E k, l ∈ R d k × d

W h 0 l

W e 0 l ∈ R d × d

W h 1 l

W e 1 k, l ∈ R 2 d × d

, and

W h 2 l

W e 2 l ∈ R d × 2 d

are learnable parameters from linear layers; k∈1, …, H represents attention heads number; dimension of each head is denoted by d_k; j∈N(i) denotes the set of nodes adjacent to node i; Norm refers to the application of batch normalization; Concat involves the operation of concatenation; Dropout operation involves randomly omitting units during training to prevent overfitting; SiLU refers to a specific activation function; Aggregation_Sum_j∈N(i) implies the summing up of messages from edges that link node i with its neighbors j; and Softmax_j∈N(i) denotes the SoftMax function applied across the neighbors j.

Geometric Vector Perceptrons

In the KarmaLoop implementation, GVPs serve as the nonloop residue encoder. They utilize node embeddings updated according to topological connections and geometric data within and between residues. The fundamental component, known as the gvp layer, processes scalar features f_s ∈ R^d and vector features f_v ∈ R^d×3. The operation within the lth layer of this system can be described as follows:

f v 1 l = W v 0 l f v l

(20)

f v 2 l = W v 1 l f v 1 l

(21)

S v 1 l = f v 1 l 2 row wise

(22)

S v 2 l = f v 2 l 2 row wise

(23)

f sv l = Concat f s l, S v 1 l

(24)

f ̂ s l = W sv l f sv l + b sv l

(25)

f s l + 1 = σ s f ̂ s l

(26)

f v l + 1 = σ v S v 2 l ⊙ f v 2 l row wise multiplication

(27)

where

W v 0 l ∈ R d v 1 × d v 0

W v 1 l ∈ R d v 2 × d v 1

, and

W sv l ∈ R d s 1 × d s 0 + d v 1

are learnable parameters; activation functions are denoted as σ_s and σ_v. Initially, sequence information is processed through an embedding layer and then concatenated with additional scalar node features before entering the gvp layers:

h seq = Embedding Sequence

(28)

h s = Concat h s 0, h seq

(29)

In the architecture of the embedding layer, the dimension of the word table is structured as (d_seq, d_seq) and h_s0∈R^d_hs0. Following this, both node features and edge features are fed into an initialization module, which includes a layer normalization and a gvp layer.:

h s 1, h v 1 = gvp LayerNorm h s, h v

(30)

e s 1, e v 1 = gvp LayerNorm e s, e v

(31)

where h_s∈R^d_seq+d_hs0, h_v∈R^d_hv0×3, h_s1∈R^d_hs1, h_v1∈R^d_hv1×3, e_s∈R^d_es0, e_v∈R^d_ev0×3, e_s1∈R^d_es1, and e_v1∈R^d_ev1×3. Following the initial setup, both node and edge attributes are fed into a sequence of GVPConv layers, where the process incorporates 2 stages of GVP layers for message transmission. The formulas defining the GVPConv layer operations are presented below:

m s_ij l = Concat h s_i l, e s_ij l, h s_j l

(32)

m v_ij l = Concat h v_i l, e v_ij l, h v_j l

(33)

m s_ij_1 l, m v_ij_1 l = gvp m s_ij l, m v_ij l

(34)

m s_ij_2 l, m v_ij_2 l = gvp m s_ij_1 l, m v_ij_1 l

(35)

m s_ij_3 l, m v_ij_3 l = gvp m s_ij_2 l, m v_ij_2 l

(36)

h ̂ s_j l, h ̂ v_j l = Aggregation_Mean i∈N j m s_ij_3 l, m v_ij_3 l

(37)

f ̂ s_j_0 l, f ̂ v_j_0 l = LayerNorm h s_j l + Dropout h ̂ s_j l, h v_j l + Dropout h ̂ v_j l

(38)

f ̂ s_j_1 l, f ̂ v_j_1 l = gvp f ̂ s_j_0 l, f ̂ v_j_0 l

(39)

f ̂ s_j_2 l, f ̂ v_j_2 l = gvp f ̂ s_j_1 l, f ̂ v_j_1 l

(40)

h s_i l + 1, h v_j l + 1 = LayerNorm h s_j l + Dropout f ̂ s_j_2 l, h v_j l + Dropout f ̂ v_j_2 l

(41)

In Eqs. 34, 35, and 39, the scalar and vector features utilize ReLU and sigmoid activation functions, respectively. Contrarily, other gvp layers operate without activation functions;

m s_ij l ∈ R 2 d hs 1 + d es 1

m v_ij l ∈ R 2 d hv 1 + d ev 1 × 3

m s_ij_1 l

m s_ij_2 l

m s_ij_3 l

h ̂ s_j l

f ̂ s_j_0 l

f ̂ s_j_2 l

and

h s_i l + 1 ∈ R d hs 1

m v_ij_1 l

m v_ij_2 l

m v_ij_3 l

h ̂ v_j l

f ̂ v_j_0 l

f ̂ v_j_2 l

and

h v_j l + 1 ∈ R d hv 1 × 3

f ̂ s_j_1 l ∈ R 4 d hs 1

, and

f ̂ v_j_1 l ∈ R 2 d hv 1 × 3

; and Aggregation_Mean_i∈N(j) indicates the process of averaging the messages across the edges that link node j to its adjacent nodes i.

Hierarchical representation

One of the innovations of KarmaLoop is the hierarchical representation of protein residues and atoms. By adopting both the advantage of atom-level representation (offering detailed descriptions but demanding high resource) and residue-level representation (requiring lower resource but providing less details) of protein, hierarchical representation achieves an effective balance between comprehensive protein characterization with reduced resource demands. As depicted in Fig. 1C, the scalar embeddings h_{p_i} and h_{nl_i} generated from the GVP block and GT block for the protein atoms and nonloop residues are merged. The atom node features are summed and concatenated with the corresponding residue features followed by linear layers and an activation layer for feature fusion.

m nl_i = Concat h nl_i, ∑ j = k k ′ h p_j

(42)

m nl_i = MLP LaekyReLU MLP m nl_i

(43)

where p, nl, i, j, k, and k' on the lower right (i.e., subscripts) denote protein atoms, nonloop residues, the index of nonloop residues, the index of nonloop atoms, the start index of the atom corresponding to ith residue, and the end index of the atom corresponding to ith residue, respectively.

Constructing interaction graph

For predicting loop conformations, we design an interaction graph [G_l,nl = (H_l,nl, E_{l, nl},X_l,nl)] derived from the overall protein atom graph G_p and nonloop residue graph G_nl. Previously, we constructed the loop graph and nonloop graph, with edges connecting nodes within the same region. However, it is crucial to establish connections between these regions. Therefore, we construct an interaction graph, which facilitates the exploration of interactions between loop and nonloop regions, thereby deepening our comprehension of their interconnected dynamics. We take nonloop residues and loop atoms as nodes. Then, we fully connect nonloop and loop nodes and consider the global interactions for accurate generation of loop conformations. Initial node features h₀ ∈ R^d_h0 are obtained from the GT encoder's output and merged feature from the output of merge block for loop atoms and nonloop residues, respectively. Edge features e₀ ∈ R^d_e0 are generated through one-hot encoding of edge types and the distances between nodes, with actual distances used for intra-residue and intra-loop edges and −1 for others. Moreover, the coordinates of nonloop nodes x₀ ∈ R^n×3 are directly inherited from the protein graphs, whereas the loop conformation are initialized randomly.

EGNN block

In this research, we incorporate self-attention mechanisms into the message-passing process of the E(n) equivariant model, EGNN, which is noted for its rapid processing capabilities in dynamical systems. Unlike traditional GNNs, the integration of self-attention, akin to transformations seen in RTMScore [51], greatly enhances the representational capacities of GNNs. Additionally, to enhance accuracy, we implement a recycling mechanism inspired by AlphaFold that compels the EGNN layer to predict atomic movements based on the current predicted conformation.

Initially, the scalar node embeddings for proteins and loops, h₀, are set using graph normalization techniques, with updates processed via GVP and GT blocks. The initialization of edge features, e₀, is updated through a dedicated linear layer:

h 1 = GraphNorm h 0

(44)

e 1 = W e_init e 0 + b e_init

(45)

where h₀, h₁ and e₁∈R^d_h, e₀∈R^d_e; W_{e_init}∈R^d_h × d_e and b_{e_init}∈R^d_h are learnable within the linear layer.

The primary building block for updating loop coordinates employs a sequence of 8 sequentially stacked EGNN layers. The message-passing mechanism in the lth layers of EGNN layer is depicted as follows:

q i k, l k∈ 1, …, H = W Q l h i_1 l + b Q l

(46)

k j k, l k∈ 1, …, H = W K l h j_1 l + b K l

(47)

v j k, l k∈ 1, …, H = W V l h j_1 l + b V l

(48)

e ij l = Concat e ij_1 l, x i l − x j l 2

(49)

m ij l = W m 2 l LeakyReLU Dropout W m 1 l e ij l + b m 1 l + b m 2 l

(50)

k ij k, l k∈ 1, …, H = Concat k∈ 1, …, H k j k, l ⊙ m ij l

(51)

w ij k, l = q i k, l ⊙ k ij k, l / d k

(52)

α ij k, l = Softmax j∈N i w ij k, l 2

(53)

h ̂ i_1 l = Dropout W h l Concat k∈ 1, …, H Aggregation_Sum j∈N i w ij k, l ⊙ v j k, l + b h l

(54)

h i_1 l + 1 = Gate_Block h i_1 l, h ̂ i_1 l

(55)

e ij_1 l + 1 = W e l Concat k∈ 1, …, H α ij k, l + b e l

(56)

x i l + 1 = Coords_Update_Block w ij k, l kϵ 1, …, H, jϵN i, x i l, x j l jϵN i

(57)

In linear layers

α ij k, l ∈ R 1

;

h i_1 l

h j_1 l

e ij_1 l

e ij l

m ij l

h ̂ i_1 l

h i_1 l + 1

, and

e ij_1 l + 1 ∈ R d h

;

q i k, l

k j k, l

k ij k, l

v j k, l

w ij k, l ∈ R d k

;

W Q l

W K l

W V l

W m 2 l

W h l

, and

W e l ∈ R d h × d h

W m 1 l ∈ R d h × d h + 1

b Q l

b K l

b V l

b m 1 l

b m 2 l

b h l

, and

b e l ∈ R d h

are learnable; k∈1, …, H indicates the total number of attention heads; each head's dimension, d_k, is the quotient of d_h by H; j∈N(i) indicates the nodes adjacent to node i; Concat refers to the operation of concatenation; Dropout is an operation used to prevent overfitting; LeakyReLU represents a variant of activation functions; Aggregation_Sum_j∈N(i) describes the process of aggregating messages from edges linking node i to its adjacent nodes j; and Softmax_j∈N(i) is the SoftMax function applied to the neighboring nodes j.

The loop node coordinates update was implemented by the Coords_Update_Block:

∆ x ij l ⃑ = x i l − x j l

(58)

∆ x ij l ⃑ = ∆ x ij l ⃑ / ∆ x ij l ⃑ 2

(59)

∆ x ij k, l = ∆ x ij l ⃑ · W x 2 l LeakeyReLU Dropout W x 1 l w ij k, l + b x 1 l + b x 2 l

(60)

∆ x ij l = W H l Concat k∈ 1, …, H ∆ x ij k, l

(61)

∆ x i l = Aggregation_Sum j∈N i ∆ x ij l

(62)

x i l + 1 = x i l + ∆ x i l

(63)

where

x i l

x j l

∆ x ij l ⃑

∆ x ij k, l

∆ x i l

, and

x i l + 1 ∈ R 1 × 3

;

W x 1 l ∈ R d k / 2 × d k

W x 2 l ϵ R 1 × d k / 2

W H l ϵ R 1 × H

b x 1 l ∈ R d k / 2

, and

b x 2 l ∈ R 1

Gate_Block serves as a foundational component for residual connections within EGNN layers and initiates each cycle of recycling.

g = Sigmoid Dropout W g Concat h ̂ new, h old, h ̂ new − h old + b g

(64)

h new = GraphNorm g ⊙ h ̂ new + h old

(65)

where h_old,

h ̂ new

, g, and h_new∈R^d_h; W_g∈R^d_h × d_h and b_g∈R^d_h are learnable parameters.

Mixture density network

The MDN block is used to learn the prior knowledge of distance. The process starts by sequentially concatenating the node embeddings from nonloop residues (h_nl) and loop atoms (h_l) produced by GVP block and GT block, respectively. This concatenated output is then passed through a linear layer, followed by batch normalization, an ELU activation function, and a dropout layer to enhance the capturing of interactions. Subsequently, 3 distinct linear layers are employed to compute the parameters: means μ_p,c, SDs σ_p,c, and mixing coefficients π_p,c. These parameters are vital for defining a mixture density model that captures a distribution of n different distance distributions for each pair between nonloop and loop nodes:

h nl, l = Dropout ELU BatchNorm W p, c Concat h nl, h l + b nl, l

(66)

μ nl, l = ELU W μ h nl, l + b μ + 1

(67)

σ nl, l = ELU W σ h nl, l + b σ + 1.1

(68)

π nl, l = Softmax W π h nl, l + b π

(69)

where W_nl,l∈R^d_nl,l×2d_h, W_μ, W_σ, W_π∈R^n×d_nl,l, b_p,c∈R^d_nl,l, b_μ, b_σ, and b_π∈Rⁿ are learnable parameters of linear layers; h_nl and h_l∈R^d_h.

In our approach to modeling proteins, we utilize both residue-level and atom-level representations. For determining the proximity between loop atoms and nonloop atoms, we select the minimum distance between each loop atom and every atom in a residue as the key metric for assessing the distance between loop and nonloop nodes.

Postprocessing

After KarmaLoop predicted the conformation, an optional force field (FF) optimization was implemented to remove unfavorable steric clashes and generate high-quality loop conformations. OpenMM [63] was deployed for energy minimization by using the ff14SB FF to optimize the predicted conformation. To prevent excessive changes to the protein, we applied a harmonic potential that restrains the backbone atoms (N, CA, C, and CB) to their original locations. Finally, we ran energy minimization on an A100 GPU and saved the optimized protein structure to a new PDB file (with an average speed of about 5 s per loop). Additionally, high tolerance was selected to make sure that not too much expensive computational cost was used, and in practical use, the tolerance can be selected automatically.

Surface pocket selection

The surface pocket selection includes 2 processes.

Radius determination

Initially, a midpoint is determined between 2 anchor residues (the nonloop residues immediately before and after the loop), serving as the preliminary center of the pocket. Subsequently, 2 specific radii are calculated:

a. Half the distance between the centers of mass of the 2 anchor residues.

b. A value derived from doubling the number of loop residues, expressed in Å, with a maximum limit of 16 Å.

From these, the larger radius is chosen for the pocket.

Residue reduction

The selected radius dictates one of 2 scenarios (Fig. S2):

a. if the radius is greater than 20 Å, pseudo-loop conformations are sampled uniformly using a linear interpolation method between the 2 anchor residue conformations. The residues located within 12 Å of these pseudo-loop conformations are included in the pocket.

b. In other cases, this radius is employed to define a smaller sphere, SS, centered at the previously determined midpoint. Nonloop residues located within SS are chosen as the initial pocket. A larger sphere, BS, is subsequently generated using an extended radius (initial radius + 12 Å), aiming to encompass all potential interacting residues. However, the BS-defined pocket may incorporate many residues distant from the loop–nonloop interaction interface, which may have minimal impact on loop conformation prediction. Hence, we reduced the pocket residues by removing the residues only selected by BS but forming similar residue-midpoint directions to these selected by SS. To further enhance the pocket, the residues located within 5 to 7 Å of the already selected pocket residues are also included.

Benchmark dataset

The biennial Critical Assessment of protein Structure Prediction (CASP) is a worldwide contest for protein structure prediction and is also commonly used for loop modeling assessment [17,20,23,64]. The general loop benchmark consists of 2 datasets: CASP13+14 and CASP15 monomer datasets. CASP13+14 is the combination of CASP13 and CASP14, which contains 549 loop structures from 52 proteins (the unmatched structure and sequence sample were removed). CASP15 containing 430 loop structures from 48 proteins was collected from the latest CASP competition. Then, the pocket sequence similarity was calculated by BLAST+ [65] (blastp module), and the sequences with no more than 50% similarity between CASP13+14/CASP15 and the training data were selected as the loop benchmark. Finally, 289 loop structures from 39 proteins in CASP13+14 and 304 loop structures from 41 proteins in CASP15 were selected as the benchmark. The antibody H3 benchmark derived from DeepAb is the combination of the RosettaAntibody and therapeutic benchmark [66], and comprises 92 antibody CDR H3 loop structures from 92 antibody complex structures. According to the standard of the pocket sequence similarity of less than 90% and H3 sequence similarity of less than 50%, we finally got 50 antibodies. The distribution of the loop lengths of those benchmarks is depicted in Fig. S1.

Training dataset

In this study, the training data were collected from PDB and Structural Antibody Database (SAbDab) [67]. The protein data from PDB were filtered by PISCES [68] using the following criteria: solved by x-ray crystallography, sequence identity ≤ 90%, resolution < 3.0 Å, R factor ≤ 0.25, and containing no DNA, RNA, and UNK molecules. The nonsecondary structure of PDB data was defined by DSSP [69], and we defined loops as regions connecting 2 secondary structures that consist of at least 4 residues. The antibody dataset was selected to be the same as that of DeepAb. The training and validation datasets were created using random stratified sampling according to the loop length in a 9:1 ratio. The antibody data were annotated using the Chothia numbering scheme [70]. Specifically, the training data contain the CDR L1, L2, L3, H1, H2, and H3 regions, and each type of CDRs was also split following the 9:1 ratio, according to loop length. Finally, the total numbers of the data for training and validation are 250,257 and 27,832, respectively.

Dataset preprocessing

All the protein data used in this work were downloaded from PDB, SAbDab, and CASP. Water molecules were subsequently removed from the proteins, and the loop regions with nonstandard amino acid residues, missing residues, and incorrect valence bonds were removed. The nonloop pockets were selected in the way mentioned in the “Loop initialization and pocket selection” section.

Training protocol

In response to different application requirements, we formulated 2 specialized training variants for KarmaLoop. For the loop reconstruction task, KarmaLoop was trained on surface pockets by initializing loop conformations without reliance on the ground-truth data. Conversely, when optimizing loop conformations derived from less precise frameworks, such as those predicted by AlphaFold, the training of KarmaLoop focused on pockets located within 12 Å of the raw loop conformations. These conformations were centered around the raw loop's center of mass.

In the training process of the model, it is hypothesized that the MDN block's ability to fit distributions of distances between nonloop and loop node pairs enhances the model's efficiency in constraining the search space for stable conformations, thereby facilitating the identification of the most optimal conformation. For the purpose of predicting conformations, we initially train the MDN block within the KarmaLoop framework using updated node embeddings from encoders and actual loop conformation data as inputs. This training employs an MDN-specific loss function L_MDN alongside 2 auxiliary cross-entropy loss functions (L_atom, L_bond), targeting different aspects of the model:

L = L MDN + 0.001 × L atom + 0.001 × L bond

(70)

L MDN = − log P d l, nl | h l, h nl = − log ∑ n = 1 10 π l, nl, n N d l, nl | μ l, nl, n, σ l, nl, n

(71)

where μ_l,nl,n, σ_l,nl,n, and π_l,nl,n represent mean, SDs, and the mixing coefficients of nth distance distribution; h_l, h_nl, and d_l,nl represent the node embeddings of loop atom nodes, nonloop residue nodes, and the distance between loop and nonloop nodes, respectively. The model was optimized using the Adam optimizer with a batch size of 64, a learning rate of 1 × 10⁻³, and a weight decay of 1 × 10⁻⁵. Training ceases when the validation loss rises for 70 consecutive epochs. At this stage, both the GT and GVP are partially attuned to the distance distributions. Subsequently, we initiate training of the EGNN module and the previously developed scoring module, utilizing the same dataset partitioning technique as in the initial training of the scoring module. We integrate the conformation predicted module's loss function into Eq. 1 with a coefficient of 1, correlating it to the RMSD between the predicted loop conformations and their actual counterparts. The training hyperparameters remain unchanged except for a reduction in the learning rate to 1 × 10⁻⁴ and the removal of weight decay.

L = L rmsd + L MDN + 0.001 × L atom + 0.001 × L bond

(72)

L rmsd = RMSD x l pred, x l label = ∑ n = 1 N x l, n pred − x l, n label 2 N

(73)

where N denotes the loop node number and n represents the index of the loop nodes.

Multi-conformation KarmaLoop

In our study, we observed that initializing the loops with different SDs led to diverse loop conformations, which partially captured the loop's conformational flexibility. Upon evaluating various SDs for loop initialization, we determined that a deviation of 1.75 Å, identical to the training set, yielded the highest success rate for single conformation. In contrast, a deviation of 7 Å was optimal for achieving success in multi-conformation predictions. Consequently, we used 1.75 Å for single-conformation generation and 7 Å for multi-conformation generation.

Evaluation methods

To assess the performance of KarmaLoop, we selected well-established traditional and DL methods renowned for their precision or efficiency. This choice was grounded in our thorough evaluation of loop modeling methods, as detailed in our previous research [71]. The general protein loop modeling methods that were considered include (a) knowledge-based method: FREAD; (b) ab initio methods: DISGRO, NGK, and Rosetta-missing-loop (RML); and (c) DL-based methods: AlphaFold RoseTTAFold and ColabFold, which is compatible with AlphaFold 2.3.1 (users can customize the template database). According to the user guide, NGK requires a start conformation of the loop region, and if there is a missing loop, RML can be used to fill it. In addition to these methods, several antibody structure prediction methods were also examined including DeepAb, RosettaAntibody-G [59], RepertoireBuilder [60], AbodyBuilder [61], AlphaFold, and AlphaFold-Multimer to assess the prediction ability of H3 loop region.

The versions of those methods and options are listed below:

FREAD (version 1.0): The “top_models” parameter was configured to 1 for single-conformation evaluations and 5 for multi-conformation evaluations, thereby enabling a streamlined assessment of both single and multi-conformational structures.

DISGRO (original version): The “pdbout” parameter was adjusted to 1 and 5 for various comparison modes, with the other options remaining at their default settings.

NGK and RML: Both integrated within the Rosetta suite (version 3.13), NGK employs the “out:nstruct” parameter, while RML utilizes “nstruct,” both set to 1 for single and 5 for multi-conformation evaluations. The other parameters for NGK matched the integration test standards, whereas RML mirrored the guidelines from the provided demo.

RoseTTAFold (version 1.1.0): The “run_pyrosetta_ver.sh” was chosen for the implementation of monomer structure prediction. This selection yields 5 results, analogous to the outcomes produced by AlphaFold, with the other options kept at the defaults.

AlphaFold (version 2.1.1): Two key configurations were made: the “db_preset” was set to “full-dbs”—a choice that, although time-intensive, ensures the acquisition of highly accurate structures. The “model_preset” was configured for “monomer” to facilitate monomer structure prediction.

AlphaFold-Multimer (version 2.1.1): This method was also evaluated. By choosing “model_preset” as “multimer” and other configurations as default.

ColabFold (version 1.5.2): ColabFold was implemented locally (LocalColabFold), and it was compatible with AlphaFold 2.3.1. By choosing “templates” and “custom-template-path” to the template database we made which consisted of all nonloop structures of CASP13+14 and CASP15 datasets.

Additionally, structural data from DeepAb (version 2.2), RosettaAntibody-G, RepertoireBuilder, and AbodyBuilder were sourced from the pdb-format results supplied by DeepAb.

Metrics

For loop modeling tasks, it is essential to consider both accuracy and efficiency. Full-atom RMSD was utilized for evaluating the accuracy of prediction. Because AlphaFold, ColabFold, and RoseTTAFold model the entire protein structures, the predicted structures of proteins are aligned before calculating the RMSD values. Additionally, AlphaFold-Multimer, AlphaFold, DeepAb, RosettaAntibody-G, RepertoireBuilder, AbodyBuilder, and RosettaAntibody predict the complete structures of antibodies. Following the same guidelines as DeepAb, the heavy chains of these antibodies are aligned prior to the RMSD calculation. Moreover, the success rate of the method was introduced as a complementary metric to RMSD, which measures the proportion of cases where the RMSD is less than or equal to 2 Å compared to experimentally determined conformations. The efficiency was simply measured by computing time.

Computing resource

KarmaLoop was trained on 8 NVIDIA A100-SXM4-80GB and 64-core Intel Xeon Platinum 8358P CPU @ 2.60 GHz. For evaluation, KarmaLoop was evaluated on a Tesla V100S GPU. Traditional methods (FREAD, DISGRO, NGK, and RML) were accomplished in parallel with 48-core Intel Xeon Gold 6240R CPU @ 2.40 GHz. AlphaFold-Multimer, AlphaFold, ColabFold, and RoseTTAFold were executed in parallel on 20 cores of the Intel Xeon Gold 6240R CPU @ 2.40 GHz and a Tesla V100S GPU. DeepAb, RosettaAntibody-G, RepertoireBuilder, AbodyBuilder, and RosettaAntibody predicted antibody conformation was derived from DeepAb provided data.

Funding

Less

National Key Research and Development Program of China(2022YFF1203000)

References

Less

Tang

, Zhang

, Liang

. Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLOS Comput Biol. 2014;10(4): Article e1003539.

Rufino

, Donate

, Canard

, Blundell

. Predicting the conformational class of short and medium size loops connecting regular secondary structures: Application to comparative modelling. J Mol Biol. 1997;267(2):352–367.

Lins

, Thomas

, Brasseur

. Analysis of accessible surface of residues in proteins. Protein Sci. 2003;12(7):1406–1417.

Stevens

, He

. Benchmarking the accuracy of AlphaFold 2 in loop structure prediction. Biomolecules. 2022;12(7):985.

Lee

, Lee

, Park

, Coutsias

, Seok

. Protein loop modeling by using fragment assembly and analytical loop closure. Proteins. 2010;78(16):3428–3436.

, Meng

, Nie

, Schwaneberg

, Davari

, Xu

, Ji

, Liu

. Flexibility regulation of loops surrounding the tunnel entrance in cytochrome P450 enhanced substrate access substantially. ACS Catal. 2022;12(20):12800–12808.

Crean

, Biler

, van der Kamp

, Hengge

, Kamerlin

SCL

. Loop dynamics and enzyme catalysis in protein tyrosine phosphatases. J Am Chem Soc. 2021;143(10):3830–3845.

Borges

, Brissos

, Hernandez

, Masgrau

, Lucas

, Monza

, Frazão

, Cordeiro

, Martins

. Methionine-rich loop of multicopper oxidase McoA follows open-to-close transitions with a role in enzyme catalysis. ACS Catal. 2020;10(13):7162–7176.

Jones

, Thornton

. Prediction of protein-protein interaction sites using patch analysis11Edited by G. von Heijne. J Mol Biol. 1997;272(1):133–143.

10.

Streaker

, Beckett

. Ligand-linked structural changes in the Escherichia coli biotin repressor: The significance of surface loops for binding and allostery. J Mol Biol. 1999;292(3):619–632.

11.

Myllykoski

, Raasakka

, Han

, Kursula

. Myelin 2′,3′-cyclic nucleotide 3′-phosphodiesterase: Active-site ligand binding and molecular conformation. PLOS ONE. 2012;7(2): Article e32336.

12.

Tamamis

, Floudas

. Molecular recognition of CXCR4 by a dual tropic HIV-1 gp120 V3 loop. Biophys J. 2013;105(6):1502–1514.

13.

Papaleo

, Saladino

, Lambrughi

, Lindorff-Larsen

, Gervasio

, Nussinov

. The role of protein loops and linkers in conformational dynamics and allostery. Chem Rev. 2016;116(11):6391–6423.

14.

Zhao

, Zhu

, Li

, Friesner

. Progress in super long loop prediction. Proteins. 2011;79(10):2920–2935.

15.

Berman

, Westbrook

, Feng

, Gilliland

, Bhat

, Weissig

, Shindyalov

, Bourne

. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242.

16.

Marks

, Deane

. Antibody H3 structure prediction. Comput Struct Biotechnol J. 2017;15:222–231.

17.

Choi

, Deane

. FREAD revisited: Accurate loop structure prediction using a database search algorithm. Proteins. 2010;78(6):1431–1440.

18.

Messih

, Lepore

, Tramontano

. LoopIng: A template-based tool for predicting the structure of protein loops. Bioinformatics. 2015;31(23):3767–3772.

19.

Hildebrand

, Goede

, Bauer

, Gruening

, Ismer

, Michalsky

, Preissner

. SuperLooper—A prediction server for the modeling of loops in globular and membrane proteins. Nucleic Acids Res. 2009;37:W571–W574.

20.

Karami

, Guyon

, De Vries

, Tufféry

. DaReUS-loop: Accurate loop modeling using fragments from remote or unrelated proteins. Sci Rep. 2018;8(1):13673.

21.

Barozet

, Chacón

, Cortés

. Current approaches to flexible loop modeling. Curr Res Struct Biol. 2021;3:187–191.

22.

Marks

, Nowak

, Klostermann

, Georges

, Dunbar

, Shi

, Kelm

, Deane

. Sphinx: Merging knowledge-based and ab initio approaches to improve protein loop prediction. Bioinformatics. 2017;33(9):1346–1353.

23.

Liang

, Zhang

, Zhou

. LEAP: Highly accurate prediction of protein loop conformations by integrating coarse-grained sampling and optimized energy scores with all-atom refinement of backbone and side chains. J Comput Chem. 2014;35(4):335–341.

24.

Canutescu

, Dunbrack RL Jr. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 2003;12(5):963–972.

25.

Wang

, Bradley

, Baker

. Protein-protein docking with backbone flexibility. J Mol Biol. 2007;373(2):503–519.

26.

Mandell

, Coutsias

, Kortemme

. Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nat Methods. 2009;6(8):551–552.

27.

Stein

, Kortemme

. Improvements to robotics-inspired conformational sampling in rosetta. PLOS ONE. 2013;8(5): Article e63090.

28.

Deane

, Blundell

. CODA: A combined algorithm for predicting the structurally variable regions of protein models. Protein Sci. 2001;10(3):599–612.

29.

Deane

, Blundell

. A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins. Proteins. 2000;40(1):135–144.

30.

Jumper

, Evans

, Pritzel

, Green

, Figurnov

, Ronneberger

, Tunyasuvunakool

, Bates

, Žídek

, Potapenko

, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589.

31.

Baek

, DiMaio

, Anishchenko

, Dauparas

, Ovchinnikov

, Lee

, Wang

, Cong

, Kinch

, Schaeffer

, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–876.

32.

Lee

, Su

, Tseng

. Comparative studies of AlphaFold, RoseTTAFold and Modeller: A case study involving the use of G-protein-coupled receptors. Brief Bioinform. 2022;23(5):bbac308.

33.

Liang

, Jiang

, Yuan

, Othman

, Xie

, Feng

. Differential performance of RoseTTAFold in antibody modeling. Brief Bioinform. 2022;23(5):bbac152.

34.

Terwilliger

, Liebschner

, Croll

, Williams

, McCoy

, Poon

, Afonine

, Oeffner

, Richardson

, Read

, et al. AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat Methods. 2024;21(1):110–116.

35.

Evans R, O'Neill M, Pritzel A, Antropova N, Senior A, Gree T, Žídek A, Bates R, Blackwell S, Yim J, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021. https://doi.org/10.1101/2021.10.04.463034

36.

Yin

, Pierce

. Evaluation of AlphaFold antibody-antigen modeling with implications for improving predictive accuracy. Protein Sci. 2024;33(1):16.

37.

Abramson

, Adler

, Dunger

, Evans

, Green

, Pritzel

, Ronneberger

, Willmore

, Ballard

, Bambrick

, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;1–3.

38.

Nguyen

, Li

, Xu

, Shang

. New deep learning methods for protein loop modeling. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(2):596–606.

39.

Ouyang W, Luo P, Zeng X, Qiu S, Tian Y, Li H, Yang S, Wang Z, Xiong Y, Qian C, et al. DeepID-Net: Multi-stage and deformable deep convolutional neural networks for object detection. arXiv. 2014. https://doi.org/10.48550/arXiv.1409.3505

40.

Pan F, Zhang Y, Lo C-C, Mandal A, Liu X, Zhang J. Protein loop modeling and refinement using deep learning models. bioRxiv. 2021. https://doi.org/10.1101/2021.11.03.467148

41.

Ruffolo

, Guerra

, Mahajan

, Sulam

, Gray

. Geometric potentials from deep learning improve prediction of CDR H3 loop structures. Bioinformatics. 2020;36(Suppl. 1):i268–i275.

42.

Ruffolo

, Sulam

, Gray

. Antibody structure prediction using interpretable deep learning. Patterns. 2022;3(2):100406.

43.

Liang

, Zhang

, Sarmiento

, Standley

. Protein loop modeling with optimized backbone potential functions. J Chem Theory Comput. 2012;8(5):1820–1827.

44.

Bonet

, Segura

, Planas-Iglesias

, Oliva

, Fernandez-Fuentes

. Frag'r'Us: Knowledge-based sampling of protein backbone conformations for de novo structure-based protein design. Bioinformatics. 2014;30(13):1935–1936.

45.

López-Blanco

, Canosa-Valls

, Li

, Chacón

. RCD+: Fast loop modeling server. Nucleic Acids Res. 2016;44(W1):W395–W400.

46.

Chys

, Chacón

. Random coordinate descent with spinor-matrices and geometric filters for efficient loop closure. J Chem Theory Comput. 2013;9(3):1821–1829.

47.

Coutsias

, Seok

, Jacobson

, Dill

. A kinematic view of loop closure. J Comput Chem. 2004;25(4):510–528.

48.

Gray

, Moughon

, Wang

, Schueler-Furman

, Kuhlman

, Rohl

, Baker

. Protein–protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol. 2003;331(1):281–299.

49.

Meiler

, Baker

. ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins. 2006;65(3):538–548.

50.

Leach

. Ligand docking to proteins with discrete side-chain flexibility. J Mol Biol. 1994;235(1):345–356.

51.

Shen

, Zhang

, Deng

, Gao

, Wang

, Xu

, Pan

, Hou

, Kang

. Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer. J Med Chem. 2022;65(15):10691–10706.

52.

Morehead A, Chen C, Cheng J. Geometric transformers for protein interface contact prediction. arXiv. 2021. https://doi.org/10.48550/arXiv.2110.02423

53.

Dwivedi VP, Bresson X. A generalization of transformer networks to graphs. arXiv. 2020. https://doi.org/10.48550/arXiv.2012.09699

54.

Jing B, Eismann S, Suriana P, Townshend RJL, Dror R. Learning from protein structure with geometric vector perceptrons. arXiv. 2020. https://doi.org/10.48550/arXiv.2009.01411

55.

Kryshtafovych

, Schwede

, Topf

, Fidelis

, Moult

. Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins. 2019;87(12):1011–1020.

56.

Kryshtafovych

, Schwede

, Topf

, Fidelis

, Moult

. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins. 2021;89(12):1607–1617.

57.

Mirdita

, Schütze

, Moriwaki

, Heo

, Ovchinnikov

, Steinegger

. ColabFold: Making protein folding accessible to all. Nat Methods. 2022;19(6):679–682.

58.

Pache R, Sperber M, Combs S, Rosenberger G, Model a missing loop. https://new.rosettacommons.org/demos/latest/public/model_missing_loop/README

59.

Jeliazkov

, Frick

, Zhou

, Gray

. Robustification of RosettaAntibody and Rosetta SnugDock. PLOS ONE. 2021;16(3): Article e0234282.

60.

Schritt

, Li

, Rozewicki

, Katoh

, Yamashita

, Volkmuth

, Cavet

, Standley

. Repertoire builder: High-throughput structural modeling of B and T cell receptors. Mol Syst Des Eng. 2019;4(4):761–768.

61.

Dunbar

, Krawczyk

, Leem

, Marks

, Nowak

, Regep

, Georges

, Kelm

, Popovic

, Deane

. SAbPred: A structure-based antibody prediction server. Nucleic Acids Res. 2016;44(W1):W474–W478.

62.

Jin WG, Barzilay R, Jaakkola T. Antibody-antigen docking and design via hierarchical structure refinement. Paper presented at: 39th International Conference on Machine Learning (ICML); 2022; Baltimore, MD. p. 10217–10227.

63.

Eastman

, Friedrichs

, Chodera

, Radmer

, Bruns

, Ku

, Beauchamp

, Lane

, Wang

, Shukla

, et al. OpenMM 4: A reusable, extensible, hardware independent library for high performance molecular simulation. J Chem Theory Comput. 2013;9(1):461–469.

64.

Lee

, Heo

, Seok

. Effective protein model structure refinement by loop modeling and overall relaxation. Proteins. 2016;84:293–301.

65.

Camacho

, Coulouris

, Avagyan

, Ma

, Papadopoulos

, Bealer

, Madden

. BLAST+: Architecture and applications. BMC Bioinform. 2009;10:421.

66.

Raybould

MIJ

, Marks

, Krawczyk

, Taddese

, Nowak

, Lewis

, Bujotzek

, Shi

, Deane

. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci USA. 2019;116(10):4025–4030.

67.

Dunbar

, Krawczyk

, Leem

, Baker

, Fuchs

, Georges

, Shi

, Deane

. SAbDab: The structural antibody database. Nucleic Acids Res. 2013;42:D1140–D1146.

68.

Wang

, Dunbrack

. PISCES: Recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005;33:W94–W98.

69.

Touw

, Baakman

, Black

, te Beek

TAH

, Krieger

, Joosten

, Vriend

. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 2015;43:D364–D368.

70.

Chothia

, Lesk

, Tramontano

, Levitt

, Smith-Gill

, Air

, Sheriff

, Padlan

, Davies

, Tulip

, et al. Conformations of immunoglobulin hypervariable regions. Nature. 1989;342(6252):877–883.

71.

Wang

, Wang

, Zhang

, Shen

, Zhang

, Wang

, Wu

, Jin

, Zhou

, Chen

, et al. Comprehensive assessment of protein loop modeling programs on large-scale datasets: Prediction accuracy and efficiency. Brief Bioinform. 2024;25(1): Article bbad486.

Appendix

Less

Year 2024 volume 7 Issue 7

PDF

186

Cite this Article

BibTeX

Article Info

doi: 10.34133/research.0408

Receive Date：2024-04-25
Online Date：2025-08-07
Published：2024-07-25

Article Data

Affiliations

History

Received：2024-04-25
Accepted：2024-05-22

Funding

National Key Research and Development Program of China(2022YFF1203000)

Affiliations

¹Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.

² Zhejiang Laboratory, Hangzhou 311100, Zhejiang, China.

³Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, Guangdong, China.

⁴Department of Pathology, New York University Medical Center, New York, NY 10016, USA.

⁵College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.

Corresponding:

^* Address correspondence to: tingjunhou@zju.edu.cn (T.H.); kimhsieh2@gmail.com (C.-Y.H.); yukang@zju.edu.cn (Y.K.)

References

Share

https://castjournals.cast.org.cn/joweb/research/EN/10.34133/research.0408

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Fig. 1. The workflow of KarmaLoop for loop modeling. (A) A loop modeling example, from a protein having a missing loop region to a whole protein with the predicted loop coordinates. (B) KarmaLoop architecture: the GT block consists of multi-head attention layers of the GT, while the GVP block represents GVP. The merge block indicates the layers of merging the nonloop atom node features into the corresponding residue node features, and the Loop Mask denotes the process of distinguishing loop nodes from nonloop nodes. The architecture incorporates the MDN denoted as the MDN block and the E(n) EGNN layer denoted as the EGNN block. (C) Process of merging nonloop atom node features and residue node features. The nonloop residue node is represented by a dark blue dot with a black frame, the nonloop atom node feature is represented by a light blue dot, and the loop atom node feature is represented by a tawny dot. (D) Message-passing scheme of EGNN: the node features. The graph on the left is the initial state of each node, and the graph on the right is the updated state. During the message-passing process, both node features, edge features, the node distances, and the node directions form the messages passed along edges and the pseudo-force F for coordinate updating, resulting in the features and coordinates of the updated nodes. (E) The structures of 5 KarmaLoop-generated loops are displayed, with the pink representing the KarmaLoop-predicted conformation and the blue indicating the ground truth (CASP IDs and loop lengths from left to right: T1129s2-D1: 4 amino acids, T1145-D2: 4 amino acids, T1056: 4aa, T1044: 7 amino acids, and T1137s3-D1: 10 amino acids). All samples are taken from CASP13, CASP14, and CASP15.

Fig. 2. Performance of KarmaLoop and protein loop modeling methods on the CASP dataset. (A) Empirical cumulative distribution of RMSD of the tested methods on (A) the CASP13+14 dataset and (B) the CASP15 dataset, and the dashed lines indicate the 1- and 2-Å RMSD cutoff. (C) The success rate was calculated in the 2-Å RMSD threshold, with all the methods generating 5 conformations. (D) RMSD distribution of the tested methods. The red boxes indicate the CASP13+14 dataset, and the salmon boxes indicate the CASP15 dataset. (E) Average and SD of the RMSDs of all the tested methods.

Fig. 3. Performance of KarmaLoop and the other tested methods across various loop lengths (loop lengths consist of more than one sample). The performance was tested on (A) the CASP13+14 benchmark dataset and (B) the CASP15 dataset.

Fig. 4. The time cost of each tested method. (A) Time distribution of each method. (B) Average time cost. Further details on the hardware setup can be found in the main text.

Fig. 5. Comparison of the antibody CDR H3 loop prediction accuracy. (A) Distribution of the RMSD values for each method. AF-multimer represents AlphaFold-Multimer. (B) Average RMSDs of different lengths on the tested methods. The error bar shows the Std. of the corresponding length target. (C) Distribution of the RMSD values for each method (K indicates that KarmaLoop refines the H3 loops based on the antibody structures predicted by the corresponding method). (D) Average RMSDs of different lengths on the tested methods.

Fig. 6. Comparison of KarmaLoop predictions and ground-truth conformations in 2 case studies: (A) AR and (B) Renin. Columns from left to right depict the KarmaLoop predicted conformations (in pink) alongside the 2 experimental-determined ground-truth loop conformations (in blue and green), as well as a comparison between the 2 ground-truth conformations.

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House