Cytochrome P450 Enzyme Design by Constraining the Catalytic Pocket in a Diffusion Model

Cytochrome P450 Enzyme Design by Constraining the Catalytic Pocket in a Diffusion Model

PDF

Qian Wang¹^,²^,³^,^†, Xiaonan Liu¹^,²^,³^,^†, Hejian Zhang¹^,³^,⁴^,^†, Huanyu Chu¹^,²^,^†, Chao Shi⁵^,^†, Lei Zhang¹^,⁶, Jie Bai¹^,³, Pi Liu¹^,³, Jing Li¹^,³^,⁷^,⁸, Xiaoxi Zhu¹^,²^,³, Yuwan Liu¹^,³, Zhangxin Chen⁵, Rong Huang¹^,³, Hong Chang¹^,³, Tian Liu¹^,³, Zhenzhan Chang⁵^,^*, Jian Cheng¹^,³^,^*, Huifeng Jiang¹^,³^,^*

Research. Vol 7 Article ID 0413

Less

Research. Vol 7 Article ID 0413

• Research Article •

Cytochrome P450 Enzyme Design by Constraining the Catalytic Pocket in a Diffusion Model

Full

Affiliations

¹Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.

² University of Chinese Academy of Sciences, Beijing 100049, China.

³ National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China.

⁴College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China.

⁵Department of Biochemistry and Biophysics, School of Basic Medical Sciences, Peking University, Beijing 100191, China.

⁶College of Life Science and Technology, Wuhan Polytechnic University, Wuhan, Hubei 430023, China.

⁷State Key Laboratory of Elemento-Organic Chemistry, College of Chemistry, Nankai University, Tianjin 300071, China.

⁸College of Life Science, Nankai University, Tianjin 300071, China.

Published: 2024-07-08 doi: 10.34133/research.0413

Outline

Abstract

Less

Although cytochrome P450 enzymes are the most versatile biocatalysts in nature, there is insufficient comprehension of the molecular mechanism underlying their functional innovation process. Here, by combining ancestral sequence reconstruction, reverse mutation assay, and progressive forward accumulation, we identified 5 founder residues in the catalytic pocket of flavone 6-hydroxylase (F6H) and proposed a “3-point fixation” model to elucidate the functional innovation mechanisms of P450s in nature. According to this design principle of catalytic pocket, we further developed a de novo diffusion model (P450Diffusion) to generate artificial P450s. Ultimately, among the 17 non-natural P450s we generated, 10 designs exhibited significant F6H activity and 6 exhibited a 1.3- to 3.5-fold increase in catalytic capacity compared to the natural CYP706X1. This work not only explores the design principle of catalytic pockets of P450s, but also provides an insight into the artificial design of P450 enzymes with desired functions.

Cite this Article

Qian Wang, Xiaonan Liu, Hejian Zhang, Huanyu Chu, Chao Shi, Lei Zhang, Jie Bai, Pi Liu, Jing Li, Xiaoxi Zhu, Yuwan Liu, Zhangxin Chen, Rong Huang, Hong Chang, Tian Liu, Zhenzhan Chang, Jian Cheng, Huifeng Jiang. Cytochrome P450 Enzyme Design by Constraining the Catalytic Pocket in a Diffusion Model[J]. Research, 2024 , 7 (7) : 0413 . DOI: 10.34133/research.0413

Full Text

Less

Introduction

Less

Cytochrome P450 enzymes (P450s) are ubiquitous in nearly all living organisms, playing pivotal roles in various metabolic processes and pathways crucial for life, growth, and development [1]. As the most versatile biocatalysts in nature, P450s not only catalyze more than 95% of the reported oxidation and reduction reactions [2–4], but also are known as a “Universal catalyst” in industrial applications due to the ability of selective oxidation of inert carbon–hydrogen bonds under mild conditions [5,6]. Therefore, obtaining new P450s with better properties has become an important goal in the field of bioengineering [7,8]. In spite of huge functional diversity, most P450s share the same catalytic mechanism [9,10] and similar structural scaffolds [4]. However, the catalytic pockets exhibited significant variability in P450s with different functions (Fig. S1) [11]. Moreover, the nonpolar composition and unique conformational flexibility of the substrate binding pockets are likely to enhance the capacity of these enzymes to modify their active sites and adapt to new substrates and selectivity [4]. Considering the high evolvability of P450s, directed evolution has been extensively employed in engineering P450s with better traits [12–15]. However, this method often necessitates multiple rounds of random mutagenesis and high-throughput screening, making it challenging to exhaustively explore the potential protein space, whether in the laboratory or computationally [16].

The rapid development of deep learning has opened up a new method to acquire novel P450s with desired characteristics. Even though impressive achievements have been witnessed in protein structure prediction [17,18], the desired functional design is still a big challenge [19,20]. Recent developments in protein design leveraged by deep learning methods encompass a broad spectrum. These include designing sequences for fixed backbones [21], variable backbone design [22], and the direct generation of novel sequences and backbones within the natural protein space [23]. These models employ various architectures, including convolutional neural networks (CNNs), graph neural networks (GNNs), and Transformers, which are all instrumental in capturing the complex interactions between amino acids within a protein sequence [19]. The abundance of sequence and structure data contributes to these deep learning models surpassing the performance of traditional physical or statistical models [24,25]. However, when considering functional design, it is impossible to collect sufficient high-quality functional data to train a sophisticated model to create sequences with a desired function [26,27]. Considering the current shortage, an approach that fuses knowledge-based techniques to scrutinize the design principles of natural P450s with powerful deep learning models to expand the natural protein sequence space, may be appropriate for designing new P450s. As our comprehension of the fundamental mechanisms that govern the evolution of the catalytic pocket for functional innovation in natural P450s remains limited, elucidating the process by which a particular P450 adopts a new function becomes crucial in designing a new one.

In this work, we used a flavone 6-hydroxylase (CYP706X1) from Erigeron breviscapus as an example, which belongs to the CYP706X subfamily and converts apigenin into scutellarein in the biosynthetic pathway of scutellarein (Fig. S2) [28]. Firstly, we determined the founder residues constituting the catalytic pocket responsible for the functional innovation of the P450 gene through ancestral sequence reconstruction, reverse mutation assay, progressive forward accumulation, and crystallographic analysis. Then, we elucidated the design principle of catalytic pocket for the functional innovation by an in-depth structural analysis. Finally, we devised the P450Diffusion, an artificial P450 generative model, by integrating the catalytic pocket design principle with a denoising diffusion probabilistic model that has demonstrated outstanding performance in image generation [29]. With the P450Diffusion model, we successfully designed 10 artificial P450s with F6H activity, and one design outperforms the naturally best-performing gene about 3.5-fold, indicating the potential of P450Diffusion in the design of new P450 enzymes.

Results

Less

Functional innovation of F6H in CYP706 family

Among the characterized P450s in CYP706 family, only the P450s in CYP706X subfamily could catalyze the flavonoid substrates, indicating that the F6H function may be de novo innovated in the ancestor of the CYP706X subfamily (Fig. 1A and Fig. S3). Moreover, we found that the catalytic pocket's configuration of CYP706X1 (i.e., EbF6H from E. breviscapus) is totally different from other P450s in the CYP706 family. The substrate apigenin could not even be properly positioned in other P450s with a C6-prone reactive state, which refers to the molecular configuration that is best suited for binding to the catalytic pocket of the enzyme and undergoing a reaction (Fig. S4). Therefore, it provides us an opportunity to decipher the constructive mechanisms for the formation of F6H's catalytic pocket by comparing the neighboring genes in the CYP706 family.

We compared the evolutionary trajectory between the CYP706X subfamily and the most closely non-functional CYP706Y subfamily using ancestral sequence reconstruction (see the Phylogenetic analysis and ancestral sequence reconstruction section). By testing the function of the inferred ancestral P450s for all key nodes in the phylogenetic tree (Fig. 1B), most ancestral sub-nodes in the CYP706X subfamily displayed significant F6H activity (Fig. 1C and Fig. S5). Conversely, the F6H function disappeared in both the common ancestor of CYP706X and CYP706Y subfamilies (ancXY) and the ancestor of CYP706Y subfamily (ancY) (Fig. 1C). Thus, the F6H's catalytic pocket should be originated when the CYP706X subfamily diverged from the common ancestor of CYP706X and CYP706Y. To gain insight into the evolution of the catalytic pocket underlying functional innovation, we determined the crystal structure of ancX3, which was found to crystallize more readily after screening for crystallization conditions (Figs. S5 and S6 and Table S1). Indeed, the binding mode of apigenin in the common ancestor of the CYP706X subfamily (ancX) was obviously different from the non-functional ancXY, though they possessed very similar structural arrangement (RMSD < 1.0 Å, sequence identity = 83%) (Fig. 1D and E). In ancX, the substrate apigenin not only forms a strong π–π stacking interaction with residue Phe, but also forms an obvious hydrogen bond with the Oxo group of CpdI, which stabilized the substrate in a C6-prone reactive state in ancX's catalytic pocket (Fig. 1D). However, in ancXY, the substrate apigenin could not form interactions like ancX with the residues in catalytic pocket and CpdI species, which means a non-C6-prone reactive state in the catalytic pocket of ancXY, which takes away the possibility of ancXY being catalytic on apigenin (Fig. 1E).

Founder residues for functional innovation of F6H

In order to clarify the molecular mechanism of forming the catalytic pocket with F6H function, we proposed to analyze the changes of residue compositions between catalytic pockets of non-functional ancXY and functional ancX. Within 8 Å range of the active center, 16 out of 48 residues are different (Fig. 2A). Interestingly, when we replaced all of the 16 residues with the corresponding residues in ancX, the mutant (referred to as the ancXY-16) obtained F6H function (Fig. 2B). Given that not all residues in the catalytic pocket contributed significantly to substrate recognition and binding due to different locations of residues in 3-dimensional space [30], we attempt to find out the founder residues in the catalytic pocket by the reverse mutation assay (RMA: evaluating the mutational effect of each residue within the ancXY-16's catalytic pocket by reverting it to ancestral type).

Firstly, RMA was conducted on the 16 residues of ancXY-16. We found that one of these mutations (A220L) inactivated the ancXY-16, and 12 mutations significantly decreased the catalytic activity, but 4 mutations (i.e., G111A, N119Q, F251L, and V307L) had less impact on the activity (Fig. 2B). Structural analysis showed that these 4 mutations were distant from the P450 catalytic center and were not involved in the changes in the residue's intrinsic hydrophilicity/hydrophobicity (see the Structural modeling and analysis section, Fig. S7). We excluded these 4 mutations to construct the ancXY-12, which still shows F6H function. Furthermore, RMA against the 12 residues of ancXY-12 revealed that 2 reverse mutations (i.e., A220L and T114I) would deactivate the enzyme, while the remaining 10 mutations only weakened enzyme activity, indicating that the actual number of founder residues is fewer than 12.

In order to identify founder residues more quickly, we used a progressive forward accumulation (PFA) strategy that progressively added important mutations to ancXY until the mutant gains F6H function (Fig. 2C). Initially, we gradually added single mutation to ancXY according to the order of the RMA mutational effect in ancXY-12. Until 5 mutations were accumulated, the constructed ancXY-5 (i.e., L220A/I114T/T317A/W123F/L248M) displayed F6H activity. RMA analysis against ancXY-5 showed that each of the 5 reverse mutations deactivated the enzyme, which verified the necessity of the 5 mutations in ancXY-5. To further confirm the crucial roles of these 5 residues for F6H's functional innovation, we perform a second PFA according to the order of the RMA mutational effect in ancXY-16 (Fig. S8). The order of the RMA mutational effect in 2 rounds of RMA (ancXY-16 and ancXY-12) is very similar; only a minor discrepancy lies in S115T, which ranked fourth in the first round, dropped to sixth in the second round. Until 6 mutations are accumulated in the second PFA, the constructed ancXY-6 (i.e., L220A/T317A/1114T/T115S/L248M/W123F) displayed F6H activity. RMA analysis of ancXY-6 revealed only 5 reverse mutations (A220L, T114I, A317T, F123W, and M248L) deactivated the enzyme, indicating that T115S mutation is not critical for the emergence of F6H function.

Therefore, these results above showed that the mutations of the 5 amino acids (L220A/I114T/T317A/W123F/L248M) play a founder role (referred to as founder residues in the following) in the F6H functional innovation process from ancXY to ancX. As to the other 11 residues, the structural analysis showed that these mutations decreasing the catalytic activity might play auxiliary roles in the enzyme catalysis due to no direct interactions with the substrate apigenin (Figs. S7 and S9).

The principle of catalytic pocket for functional innovation of F6H

We further interpreted the underlying mechanism of 5 founder residues for functional innovation through an in-depth analysis of the apigenin-binding model in ancXY-5 (Fig. 3A). The 5 founder residues could be divided into 2 parts according to their roles in protein structure. The first part included I114T, W123F, and L248M, which mainly contributed to fix or bind the apigenin. For example, the I114T introduced a hydrogen bond with 7′ hydroxyl of apigenin with an energy contribution of 0.66 ± 0.10 kcal/mol (see the MD simulations and MM-PB/GBSA section, Fig. 3B). A null mutation of T114V in ancXY-5 also ascertained the indispensability of this hydrogen bond for the F6H function (Fig. S10). The W123F contributed to the apigenin binding (−3.14±0.37 kcal/mol) with an aromatic π–π stacking interaction to the phenyl ring of the apigenin and alleviated the spatial conflicts caused by ancestral tryptophan in the ancXY (Fig. 3C). The L248M, located in the substrate access gate, was not only involved in the substrate tunneling process (Fig. 3D), but also contributed to the apigenin binding with a π stacking to the phenyl ring of apigenin. The second part included L220A, and T317A contributed to alleviate inappropriate interactions and space conflicts. The L220A alleviated the space conflict conducted by ancestral leucine and provided sufficient space for the placement of the B ring of substrate apigenin through the introduction of a small side chain (Fig. 3E). The T317A not only provided sufficient space for the placement of the A ring of apigenin but also avoided the wrong-orientation apigenin-binding mode shown in nonfunctional ancXY caused by a hydrogen bond between the hydroxyl group of threonine and the substrate (Fig. 3F).

Based on the mutations of 5 founder residues, it appears that, with an appropriate spatial capacity (provided by small side chain residues A220 and A317), the catalytic pocket evolved following a “3-point fixation” model. The “3-point fixation” refers to essential interactions with 3 pivots in apigenin including the following: 4′-OH of apigenin molecule (the first pivot) was fixed by the hydrogen bond from T114, the “B” ring of apigenin (the second pivot) was fixed by the π stacking interactions from F123 and M248, and 7-OH of apigenin (the third pivot) was fixed by the hydrogen bond with CpdI iron-oxo moiety (Fig. S11). The model held the substrate apigenin in a reactive near-attack conformation (NAC), which maintained the relative orientation between the reaction site of apigenin and CpdI iron-oxo moiety at a favorable distance and angle (3.6 Å and 155°), thus serving to initiate the 6-hydroxylation reaction of apigenin in the catalytic process (Fig. S12). We propose that the “3-point fixation” model could serve as the design principle for the catalytic pocket responsible for the natural functional innovation of F6H, which also offers us the potential to de novo design P450s with the desired functions.

Diffusion model-based designing of P450 with the specific function

Hundreds of thousands of P450 protein sequences collected in public databases offer us an opportunity to learn natural P450 sequence diversity and design new functional P450s [31]. Recent advancements in diffusion models have shown significant potential in enhancing the design of P450 enzymes with specific functions [29,32]. Here, we proposed a P450 sequences diffusion model (P450Diffusion) to de novo design P450s with a desired function by combining the diffusion model with the design principle of F6H catalytic pocket (Fig. 4A). P450Diffusion mainly consists of 2 models (i.e., pre-trained and fine-tuning diffusion models). Firstly, 226,509 natural P450 sequences were collected to train a pre-trained P450 sequence diffusion model. This pre-trained model consists of 2 subprocesses: a forward diffusion subprocess, which gradually adds Gaussian noise to the representation of P450 sequence until it becomes random noise, and a reverse generation subprocess, which starts from random noise and gradually de-noises the representation of P450 sequence to generate a new P450 sequence. After 150,547 training rounds, the pre-trained diffusion model could generate a wide variety of sequences, with similarities to natural sequences ranging from 20% to 50%. Secondly, 19,202 P450 sequences with appreciable similarity to CYP706X subfamily were used to fine-tune the pre-trained diffusion model for ensuring that the generated sequences have a similar structural backbone to the F6H. Besides, the 5 founder residues including T114, F123, A220, M248, and A317 were constrained to ensure the reproduction of the “3-point fixation” design principle in de novo generated sequences. The model integrating training set fine-tuning with constrained generation was referred to as the fine-tuning diffusion model.

Furthermore, we used the fine-tuning diffusion model to generate a total of 60,000 non-natural P450 sequences, which share about 50% average amino acid identity to that of the natural sequences. In comparison with natural P450s, the generated sequences not only have a highly similar distribution of Shannon entropies for each position in multiple sequence alignments, but also display very consistent residue–residue co-evolution patterns and physicochemical properties (Figs. S13 and S14). However, the generated sequences can be grouped into smaller clusters and interpolated between the natural sequence clusters, indicating that the generated sequences have higher diversity than natural P450s (Fig. 4B). It is noteworthy that the sequences generated by the fine-tuning P450Diffusion model form a larger cluster, exhibiting greater similarity to the CYP706X subfamily, thereby demonstrating the effectiveness of the fine-tuning model. Besides, we compared the distribution of 5 founder residues among natural and generated P450s (Fig. 4C). It is found that except the threonine (T) in position 317, other positions are highly variable in natural and generated P450s from pre-trained model, even in natural P450s from CYP706 family. However, all 5 founder residues are relatively conserved in the generated P450s from the fine-tuning model, indicating that the P450Diffusion possessed the capability of generating sequences with an amino acid distribution similar to that of natural F6H on the basis of constrained 5 founder residues.

Experimental verification and structural insights of de novo generated P450s

Finally, we experimentally tested whether the generated sequences from P450Diffusion were true P450 enzymes, and performed F6H function. In order to accurately obtain functional sequences from numerous designs, we conducted virtual screening on 60,000 generated sequences based on 3 specific criteria: the computational scores of composite metrics for assessing the quality of generated sequences, the 3-dimensional pocket constraints of the 5 founder residues, and the robustness of the apigenin binding modes (see the “Computational evaluation and structure-based virtual screening for generated sequences” section, Fig. 4A). Following virtual screening, 17 promising designs were meticulously selected for further exploration. These designs were subsequently synthesized and expressed within yeast expression systems. Notably, these designs exhibited sequence identities ranging from 70% to 87% when compared to CYP706X1, underlining their potential as novel catalysts (Table S2). The recombinant yeasts were cultivated for 4 days by feeding apigenin as substrate and HPLC analysis revealed 10 designs with significant F6H activity (Fig. 5A). Surprisingly, there are 6 designs that exhibited a 1.3- to 3.5-fold increase in scutellarein production compared to CYP706X1 (Fig. 5B). The 4 remaining active designs also displayed comparable activities with other natural F6H enzymes (i.e., Cnan706X and Lsal706X). Therefore, the results indicated that the P450Diffusion not only could capture the fundamental design principle of F6H catalytic pocket and effectively generate P450s sequences with F6H activity, but also selected out the better P450 enzymes compared to natural sequences from the P450 sequence space.

Furthermore, we presented a structural perspective on the active designs as well as the distinctions between the active and inactive ones. The structural and sequence analysis reveals that nearly all mutations in the generated designs are distant from the substrate binding pocket (Fig. 5C, exemplified by non-functional Design91808 and functional Design4129, and Fig. S15). Additionally, the binding orientations of active designs closely resemble those of natural CYP706X1 (Fig. S16). However, molecular dynamics (MD) simulations have demonstrated significantly weaker binding stability of the substrate apigenin in the inactive designs when compared to the active ones (Fig. 5D). This discrepancy likely serves as the primary reason for the inactivity observed in these 7 designs. Besides, we observed that the overall protein structures of the active designs appear to exhibit greater stability than the inactive ones following extensive MD simulations (Fig. 5E). Notably, significant structural fluctuations are observed, particularly within the sequence ranges of 220 to 230 and 390 to 410, as illustrated in the inactive designs (Fig. S17). For instance, in Design33380, the R229K mutation disrupts the salt bridge with E251, while the S230P mutation causes a break in the alpha-helix structure (Fig. S18). In Design91808, the S407L mutation breaks the hydrogen bond with the backbone of A51, resulting in a less stable protein backbone than observed in active designs (Fig. S19). These results imply that the amino acid mutations on the surface of the protein could lead to a reduction in the global stability of the protein, which further leads to incorrect folding and substrate binding instability, and ultimately to the loss of activity of the designs.

In order to further analyze the other 7 designs without F6H activity, we tested the P450 expression in yeast expression systems by integrating green fluorescent protein (GFP) at the C-terminal. All recombinant proteins successfully showed green fluorescence (Fig. S20). Considering that the GFP is a very stable protein whose incorrect folding of N terminal protein would not affect the brightening of GFP and the GFP itself could also enhance the folding and dissolving, we next design a protein soluble expression experiment. Since the protein expression level in the yeast system is low, it is very difficult to isolate and purify when expressed in yeast cells. In this study, the E. coli system, which could produce more proteins, was used to express these designed P450s. We confirmed the expression of P450s through N-terminal truncation, protein purification, SDS-PAGE gel analysis, and spectral analysis of the P450 enzyme generated in this study (Supplementary Information, Figs. S21 and S22). Results indicate that all proteins were expressed; however, those with higher catalytic activity showed better soluble expression in the E. coli system. This analysis provided us with valuable insights for future improvements of the P450 generative model.

Discussion

Less

Nature has evolved an amazing array of enzymes to catalyze biological functions and enabled living systems to face diverse environmental challenges [33]. Gene duplication contributes most to the generation of new enzymes [34], especially for cytochrome P450s, which evolve to the largest enzyme family for plant metabolism by widespread whole-genome and tandem duplications [7,35,36]. Although most duplicates are lost or subfunctionalized by purifying or neutral selection, neofunctionalization often happened in P450 evolution due to high plasticity and variability of catalytic pockets [37,38]. The evolutionary trajectory of P450's functional innovation have attracted researchers' attention for a long time [39–41] and the previous researches were mainly focused at the gene level [42–46] or residue level [47,48]. In this study, based on ancestral sequence reconstruction, RMA and PFA, we identified 5 founder residues in the catalytic pocket of flavone 6-hydroxylase (F6H). However, considering that there are several mutations left that also affect F6H function, we speculated that these 5 residues may represent the most probable combination for F6H functionality, and there might be other residue combinations with similar effects.

Upon conducting additional structural analysis of the apigenin-binding model in ancXY-5, we have formulated a “3-point fixation” model to elucidate the design principle of the catalytic pocket that played a pivotal role in the functional innovations of F6H function. The “3-point fixation” model seems to be a general principle for the substrate binding in P450's catalytic pocket, such as the camphor binding in P450cam [49] and N-palmitoyl glycine binding in P450BM3 [50]. Similar fixation rules could also be found in the general enzymatic catalysis where the substrates or catalytic residues are held in the catalytic pockets [51], even as a term commonly used in medicine and architecture [52]. It is worth mentioning that besides the “3-point fixation” model, nature also evolved other catalytic pocket design principles for functional innovations in P450s. For example, the SbaiCYP82D4, as the isoenzyme of CYP706X1, has evolved to a completely different catalytic pocket configuration for flavone 6-hydroxylation [53,54]. The catalytic pocket of SbaiCYP82D4 consisted of more residues with strong hydrophobicity, and no obvious hydrogen bond was found between surrounding residues and substrate apigenin, making the substrate bind in an “oblique binding” orientation (Fig. S23), which is distinguished with the “vertical binding” orientation in CYP706X1. Although a different substrate binding model was found in SbaiCYP82D4, the substrate apigenin also formed a reactive conformation in a NAC model to enable the initiation of the catalytic reaction. This fact indicated that substrates in P450s could be held in favorable orientations with different fixing rules under the premise of sufficient space and suitable shape for the placement of the substrate.

P450 enzymes, as a type of membrane protein, are extremely difficult to express and purify in large quantities, which also increases the difficulty of designs and modifications for P450 enzymes. Under conventional conditions, yeast, as a eukaryotic microorganism, provides suitable conditions for the correct folding of membrane proteins like P450s due to its endomembrane system. However, due to the need to extract peroxisomes and the low expression levels of proteins, there are few cases of purifying P450 proteins from yeast systems. Here, we attempted the designed P450s folding tests in both yeast and E. coli systems. Preliminary results showed that all proteins were expressed, but in the E. coli system, proteins with higher catalytic activity exhibited better soluble expression. Currently, the purification of membrane proteins such as P450s does indeed face significant challenges, and it is believed that with the advancement of protein purification technologies, further in-depth studies on protein expression will be realized. Techniques for increasing solubility and ensuring correct folding are also promising directions for future research in P450 protein designs.

The rapid development of deep learning has witnessed many impressive achievements in protein structure design, while the desired functional design is still a big challenge [55–57]. Our research provides a novel strategy for the de novo design of P450s with specific function by coupling the design principle of catalytic packet with deep learning model. In this study, non-natural P450s with F6H function were successfully designed by integrating the “3-point fixation” model with a denoising diffusion probabilistic model. The structural analysis of active designs suggested that the design principle of F6H catalytic pocket has been fully incorporated into the deep learning model. Furthermore, the structural insights between active and inactive designs suggest that mutations on protein surface may be the fundamental factors contributing to the inactivity or reduced activity of designed sequences, providing us with valuable insights for future improvements of the P450 generative model. More structure or sequence-based features should be considered, like the substrate-tunneling feature, the overall stability of protein, and so on.

In general, the current work provides insights into the principle of pocket design in the P450 functional innovations and offers a potential research paradigm for the de novo design of P450 enzymes with desired functions. With the increase of in-depth investigated P450s, more catalytic pocket design principles would be deciphered and facilitate the design of P450s with novel and desired functions.

Materials and Methods

Less

Phylogenetic analysis and ancestral sequence reconstruction

The P450 sequences of CYP706 subfamilies were selected from the previous study [28], including 10 P450s of CYP706X/Y subfamilies for ancestral sequence reconstruction, and a P450 of CYP706W subfamily as an out-group. The transmembrane domains of P450 sequences were annotated with the TMHMM package [58]. Using the crystal structures of CYP76AH1 (PDB ID: 5YLW), a structural information-based sequence alignment of the P450s deprived of N-transmembrane region was generated by Expresso [59]. Poorly aligned regions (N- and C-termini) were trimmed. Then, a phylogenetic ML tree was created with the RAxML [60]. All protein sequences of ancestral nodes were deduced using FastML [61,62]. The N- and C-terminal amino acids including transmembrane domain derived from CYP706X1 were added to each ancestor. Ultimately, we obtained the most probable ancestor of CYP706Y subfamily (ancY) and CYP706X subfamily (ancX), the common ancestor of 2 subfamilies (ancXY), and all sub-ancestors of CYP706Y subfamily (ancY1, ancY2, and ancY3) and CYP706X subfamily (ancX1, ancX2, and ancX3) in the sub-nodes of the phylogenetic tree (Fig. 1B). The ancestral sequences are available in the Supplementary information.

Crystallization and structure solution

Initial crystallization screening was performed using the sitting-drop vapor-diffusion method with commercial crystal screen kits at 16 °C. The ancX3 protein at a concentration of 10 mg/ml in buffer (2 mM KH₂PO₄, 8 mM K₂HPO₄, 500 mM NaCl, 0.2 mM EDTA, 1 mM DTT, and 10% [v/v] glycerol, pH 7.4) was used in the initial crystallization screening to determine the crystallization condition. The ancX3 protein was mixed with precipitant solution at a drop size of 0.6+0.6 μl against the reservoir containing 50 μl of precipitant solution. The crystals grew from the mixture with the precipitant solution consisting of 1.34 M NaCl, 13.4% (w/v) PEG3350, 0.1 M MgCl₂, and 0.1 M imidazole, pH 6.5. The crystal optimization was performed using the hanging-drop vapor-diffusion method at 16 °C against the reservoir containing 0.5 ml of the precipitant solution. The drops contained 2 μl of precipitant solution, 2 μl of ancX3 protein, and 0.2 μl of additive solution (40% v/v polypropylene glycol P400) from the Hampton additive screen kit.

Crystals of ancX3 were mounted from the crystallization drops in nylon loops and flash-frozen in liquid nitrogen using the cryoprotectant consisting of 1.34 M NaCl, 13.4% (w/v) PEG3350, 0.1 M MgCl₂, 0.1 M imidazole, and 25% (v/v) glycerol, pH 6.5. Diffraction data (λ = 0.97918 Å) were collected on beamlines 17U1 at Shanghai Synchrotron Radiation Facility for IFS crystals. Diffraction images were indexed, integrated, and scaled using the XDS program. Details of the data collection statistics are summarized in Table S1.

The structure of ancX3 was solved by molecular replacement with the structure of CYP76AH1 (PDB code: 5YLW) as search model [63]. Iterative model building and refinement were performed using COOT and PHENIX, respectively. Coordinates and structure factors have been deposited with the PDB under accession id 8JC2.

Structural modeling and analysis

Structure modeling: The 3D models of ancestral proteins and generated designs are predicted by the local ColabFold algorithm through inputting the crystal structure of ancX3 as one of the templates [64].

Molecular docking: The Cartesian coordinates and atom charges of CpdI were obtained from published data [65]. The structure of substrate apigenin was obtained from PubChem [66], and assigned with AM1-BCC charges [67]. An ensemble of different conformations of the substrate were generated by enumerating these under OpenBabel [68]. Substrate rotamers were extensively sampled around the C2–C1′ axis with 5° intervals. The mol2-formatted CpdI and apigenin were parameterized with molfile_to_params.py script. Before molecular docking, the protein structure complex with CpdI species was firstly sampled and minimized by the RosettaRelax protocol without constraints [69,70]. Then, the apigenin was docked into relaxed structures using RosettaLigand [71–73]. Distance restraints were added between the Fe ion and ligated cysteine (2.3 Å ± 0.1 Å), between carboxylate groups of heme and arginines (2.2 Å ± 0.4 Å) in Rosetta-Scripts [74]. Each run of 100,000 models was generated with the MPI [75] version of RosettaLigand, and the top 100 models with the lowest Rosetta energy unit were clustered with Calibur [76], and the structures with the lowest binding free energy (interface_delta) were selected as our final docking models. The Rosetta scripts and option files for RosettaLigand are available in the Supplementary Information.

Pocket analysis: The hydrophilicity calculation and hydrophobic surface area evaluation of CYP pockets are implemented with fpocket software [77]. The dpocket function of fpocket2 was used to extract binding pocket descriptors for different P450s.

MD simulations and MM-PB/GBSA

The ancestral proteins underwent 100-ns simulations using MD to optimize their complex structures. Additionally, the complex structures of the generated designs underwent a 300-ns MD simulation for further structural optimization and to observe overall protein and substrate fluctuations within the catalytic pockets. All simulations were performed in triplicate.

System setup: Our target models with CpdI and substrate molecules were set as the initial structures for MD simulation. The protein structures were prepared with the pdb4amber application in the Amber20 package [78]. The force field for the CpdI species was taken from published data [65]. The partial atomic charges and missing parameters for substrate apigenin were generated by Antechamber with an AM1-BCC charge model [79,80]. Na⁺ ions (0.15 M) were added to the protein surface to neutralize the total charge of the system. Finally, the resulting system was solvated in a rectangular box of TIP3P waters extending up to a minimum cutoff of 12 Å from the protein boundary. The Amber ff14SB force field was employed for all the proteins in MD simulations.

MD simulations: After proper parameterizations and setup, the resulting systems were minimized with 2 steps (the first step with 5,000 steps of steepest descent and 10,000 steps of conjugate gradient, and the second step with 10,000 steps of steepest descent and 30,000 steps of conjugate gradient) to remove the poor contacts and relax the systems. The systems were then gently annealed from 0 to 300 K under the NVT ensemble for 50 ps with a restraint of 5 kcal mol⁻¹ Å⁻². Subsequently, the systems were maintained for a total of 5 rounds of density equilibration of 20 ps in the NPT ensemble at a target temperature of 300 K and a target pressure of 1.0 atm using the Langevin thermostat [81] with a restraint of 1 kcal mol⁻¹ Å⁻². A total of 5 rounds of density equilibration relaxed the system to achieve a uniform density after heating dynamics under periodic boundary conditions. Thereafter, we removed all of the restraints applied during heating and density dynamics and further equilibrated the systems for ∼2 ns to get a well-settled pressure and temperature for conformational and chemical analyses. This was followed by an MD production run for each of the systems. During all of the MD simulations, the covalent bonds containing hydrogen were constrained using SHAKE [82] and particle-mesh Ewald [83] was used to treat long-range electrostatic interactions. All of the MD simulations were performed with the GPU version of the Amber 20 package.

MM-PB/GBSA: The python script mmpbsa.py [84] in Amber20 package was used in this research to analyze the binding free energy of apigenin. According to the systematic research of Hou et al., the inclusion of the conformational entropy may be crucial for the prediction of absolute binding free energies but not for ranking the binding affinities of similar ligands [85]. The binding free energy analysis is implemented here just for analyzing the interaction energy contribution of each key residue. Therefore, the change of conformational entropy upon ligand binding has been ignored in our calculation because of expensive computational cost and low prediction accuracy. The calculation procedure mainly referred to the MMPBSA protocol in AMBER tutorial websites (https://ambermd.org/tutorials/advanced/tutorial3/index.php).

Building and training the P450Diffusion

Denoising diffusion probability models (or diffusion models, for short) work by applying a Markov process to corrupt the training data by successively adding Gaussian noise, then learning to recover the data by reversing this denoising process [86]. We adapt this framework to generate protein sequences, introducing necessary modifications to encode the discrete protein sequences into a vector of a specific length. We used physicochemical character-based schemes, the principal components score Vectors of Hydrophobic, Steric, and Electronic properties (VHSE8, [87]), to encode protein sequences. The P450Diffusion is composed of a U-Net with self-attention layers and features a classical U-shaped structure with down-sampling and up-sampling blocks.

To build the P450Diffusion, we screened and analyzed all potential P450s from a published P450 database [31] and public databases, filtering out sequences with a length greater than 560 and resulting in 226,509 sequences to form the training dataset. Then, we encode the training dataset, where each amino acid in the protein sequence is encoded as an 8-dimensional vector, and each batch protein sequence is encoded as a 64×1×560×8 vector. Here, 64 is the batch size equal to the number of samples in the training data; 1 represents the channel size; 560 represents the maximum length of the protein sequence; and 8 represents the VHSE8 encode vector for each amino acid in the protein sequence. If the protein sequence is shorter than 560, we add gaps until it reaches a length of 560. In this case, we assign a vector of 8 zeroes as the encoding for gaps. Then, we started to train the pre-trained P450 sequence diffusion model. After 150,547 training steps, the loss functions of the pre-trained diffusion model converged and the model was obtained (Fig. S24A).

In order to generate sequences with the F6H function more effectively, we fine-tune the pre-trained diffusion model with the filtered dataset by selecting sequences with more than 30% amino acid identity to the CYP706X subfamily and clustering them with 90% sequence similarity. Finally, a total of 19,234 sequences formed a fine-tuning dataset. Meanwhile, we assigned different sample weights to 30 sequences from the CYP706X subfamily and other sequences in the fine-tuning dataset. The sampling weight ratio between the 30 sequences from the CYP706X subfamily and other sequences was 600:1. The P450Diffusion was obtained after 150,500 training steps (Fig. S24B).

The P450Diffusion architecture to generate P450 sequences was based on the diffusion model. The diffusion model is composed of a U-Net with self-attention layers. The main difference with traditional U-Net is that the up-sampling and down-sampling blocks support an extra timestep argument on their forward pass. This is done by embedding the timestep linearly into the convolutions. In the training process, the network takes a batch of noisy protein sequences of shape (batch size, channels, height, and width) and a batch of noise levels of shape (batch size, 1) as input, and returns a tensor of shape (batch size, channels, height, and width). In this model, we used a mean squared error loss (MSELoss) function and optimized the networks with the AdamW algorithm, setting the learning rate to 2e-4. Our model was implemented in PyTorch and trained on 6 GeForce RTX 3090 systems for about 150,000 steps, which took approximately 63 h.

Computational evaluation and structure-based virtual screening for generated sequences

Three criteria were used to screen the generated sequences in silico to improved experimental validation success rates: the computational scores of composite metrics for assessing the quality of generated sequences, the 3-dimensional pocket constraints of the 5 founder residues, and the robustness of the apigenin binding modes. Details are as follows.

We used random protein sequences of length 560 with the 5 founder residues as the starting sequence for the diffusion model sample. In the reverse diffusion process, we perform 600 steps of denoising the 60,000 starting sequences to obtain 60,000 generated sequences. In order to increase the likelihood that the generated sequences would function as F6H, we evaluated the generated protein sequences using a variety of computational metrics, including esm-1v [88], Alphafold2 [18], ProteinMPNN [89], and others [90]. Firstly, the 60,000 generated sequences were screened by the sequence motif constructed by the 5 founder residues, and 77 sequences were filtered out. Secondly, both the 77 generated sequences and the F6H sequences were scored for esm-1v, and then the top 33 sequences in the esm-1v results were selected for alphafold2 structure modeling. Thirdly, the constructed structures and sequences were evaluated using ProteinMPNN, and the top 19 designs were selected based on their ProteinMPNN scores, which were higher than that of CYP706X1 (−1.63). Fourthly, substrate apigenin and CpdI were docked into constructed structures using RosettaLigand and the substrate-binding models were obtained based on binding affinity (interface_delta_X). Subsequently, MD simulations were performed to evaluate the overall structure stability and binding pocket stability for each designed sequences. Finally, 17 substrate-binding structures that meet the catalytic pocket constraints constituted by founder residues and maintain stable substrate binding modes were chosen as candidate sequences for experimental verification (Fig. S25).

Cloning construction and product detection

Chemicals and media used in this study are shown in the Supplementary Materials. All primers used in this study are listed in Table S3. All strains and plasmids are listed in Table S4. The protein sequences and DNA sequences can be found in the Supplementary Information. Nucleotide sequences of ancXY, ancX, ancX1, ancX2, ancX3, and ancX-16 were codon optimized for Saccharomyces cerevisiae and synthesis by Genscript, China. Subsequently, the gene fragments, ATR2 (P450 reductase from Arabidopsis thaliana) and the head-to-head promoters (pPGK1-pTDH3) were cloned into the vector Y22-TC using the Minerva Super Fusion Cloning Kit (US Everbright Inc., China). The assembly system was transformed into DMT competent cells and the sequences assembled successfully were verified by further sequencing. For mutants constructing, mutation sites were introduced by the mutant primers that are listed in Table S3 and used the same method for recombinant vectors' assembly. The nucleotide sequences of P450 designs were codon optimized for S. cerevisiae and subcloned between PGK1 promoter and CYC1 terminator of Y22-PE by Genscript, China.

Because the functional expression of P450 enzyme needed an auxiliary reductase partner (CPR), the ATR2 from Arabidopsis thaliana was cloned into expression vector YCplac33-TP, which contained a TDH3 promoter and a PDC1 terminator and named Y33-ATR2. The plasmid Y33-ATR2 was preserved in our laboratory. The recombinant vectors containing P450 enzymes designed by deep learning were separately co-transformed with Y33-ATR2 into W303-1B, and transformants were selected on a tryptophan and uracil minus plate (CM-Trp-Ura). Three colonies were picked for each genotype, and used to inoculate 3 ml of CM-Trp-Ura medium in a 24-well plate. The recombinant vectors containing ATR2 and P450 (ancXY, ancX, ancX1, ancX2, or ancX3) were directly transformed into W303-1B without extra Y33-ATR2 and cultured in tryptophan minus medium (CM-Trp). The cells were grown at 30 °C and 550 rpm for 48 h, after which the resulting seed cultures were transferred into fresh medium at a ratio of 1:50. The new cultivation was fermented under the same condition for 4 days after feeding 1 mM apigenin. For the mutants, flasks containing 30 ml of medium were then inoculated at a ratio of 1:50 using the resulting seed cultures by feeding 1 mM apigenin. The main cultures were grown at 30 °C and 220 rpm for 4 days. The product extraction method and HPLC detection method were based on our previous study [28] and were described in detail in the Supplementary Methods.

Funding

Less

Key Technologies Research and Development Program of Anhui Province (2019YFA0905700 and No. 2021YFC2103500)
National Natural Science Foundation of China (No. 32371499)
China Postdoctoral Science Foundation (Grant No. 2019M661032)
National Natural Science Foundation of China (Grant No. 31901026 and No. 32171418)
Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(No. TSBICIP-KJGG-002-02 and No. TSBICIP-CXRC-015)
Tianjin Science Fund for Distinguished Young Scholars(No.18JCJQJC48300)

References

Less

Nelson

. Cytochrome P450 diversity in the tree of life. Biochim Biophys Acta Proteins Proteom. 2018;1866(1):141–154.

Lamb

, Follmer

, Goldstone

, Nelson

, Warrilow

, Price

, True

, Kelly

, Poulos

, Stegeman

. On the occurrence of cytochrome P450 in viruses. Proc Natl Acad Sci USA. 2019;116(25):12343–12352.

Liang

, Wei

, Qiu

, Jiao

. Homogeneous oxygenase catalysis. Chem Rev. 2018;118(10):4912–4945.

Manikandan

, Nagini

. Cytochrome P450 structure, function and clinical significance: A review. Curr Drug Targets. 2018;19(1):38–54.

Coon

. Cytochrome P450: Nature's most versatile biological catalyst. Annu Rev Pharmacol Toxicol. 2005;45:1–25.

Bernhardt

, Urlacher

. Cytochromes P450 as promising catalysts for biotechnological application: Chances and limitations. Appl Microbiol Biotechnol. 2014;98(14):6185–6203.

Liu

, Zhu

, Wang

, Liu

, Cheng

, Jiang

. Discovery and modification of cytochrome P450 for plant natural products biosynthesis. Synth Syst Biotechnol. 2020;5(3):187–199.

, Jiang

, Guengerich

, Ma

, Li

, Zhang

. Engineering cytochrome P450 enzyme systems for biomedical and biotechnological applications. J Biol Chem. 2020;295(3):833–849.

Moody

, Raven

. The nature and reactivity of ferryl heme in compounds I and II. Acc Chem Res. 2018;51(2):427–435.

10.

Shaik

, Cohen

, Wang

, Chen

, Kumar

, Thiel

. P450 enzymes: Their structure, reactivity, and selectivity modeled by QM/MM calculations. Chem Rev. 2010;110(2):949–1017.

11.

Nair

, McKinnon

, Miners

. Cytochrome P450 structure–function: Insights from molecular dynamics simulations. Drug Metab Rev. 2016;48(3):434–452.

12.

, Schwaneberg

, Fischer

, Schmid

. Directed evolution of the fatty-acid hydroxylase P450 BM-3 into an indole-hydroxylating catalyst. Chem Eur J. 2000;6(9):1531–1536.

13.

Brandenberg

, Chen

, Arnold

. Directed evolution of a cytochrome P450 carbene transferase for selective functionalization of cyclic compounds. J Am Chem Soc. 2019;141(22):8989–8995.

14.

Yang

, Arnold

. Navigating the unnatural reaction space: Directed evolution of heme proteins for selective carbene and nitrene transfer. Acc Chem Res. 2021;54(5):1209–1225.

15.

Reetz

. Directed evolution of artificial metalloenzymes: A universal means to tune the selectivity of transition metal catalysts? Acc Chem Res. 2019;52(2):336–344.

16.

Brandenberg

, Fasan

, Arnold

. Exploiting and engineering hemoproteins for abiological carbene and nitrene transfer reactions. Curr Opin Biotechnol. 2017;47:102–111.

17.

Baek

, DiMaio

, Anishchenko

, Dauparas

, Ovchinnikov

, Lee

, Wang

, Cong

, Kinch

, Schaeffer

, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–876.

18.

Jumper

, Evans

, Pritzel

, Green

, Figurnov

, Ronneberger

, Tunyasuvunakool

, Bates

, Žídek

, Potapenko

, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589.

19.

Ding

, Nakai

, Gong

. Protein design via deep learning. Brief Bioinform. 2022;23(3):bbac102.

20.

Ferruz

, Heinzinger

, Akdel

, Goncearenco

, Naef

, Dallago

CJC

. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J. 2022;21:238–250.

21.

Liu

, Zhang

, Wang

, Zhu

, Wang

, Li

, Zhang

, Li

, Chen

, Liu

. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022;2(7):451–462.

22.

Watson

, Juergens

, Bennett

, Trippe

, Yim

, Eisenach

, Ahorn

, Borst

, Ragotte

, Miles

, et al. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620(7976):1089–1100.

23.

Repecka

, Jauniskis

, Karpus

, Rembeza

, Rokaitis

, Zrimec

, Poviloniene

, Laurynenas

, Viknander

, Abuajwa

, et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Machine Intell. 2021;3(4):324–333.

24.

Liu

, Chen

. Computational protein design with data-driven approaches: Recent developments and perspectives. Wiley Interdiscip Rev Comput Mol Sci. 2023;13(3): Article e1646.

25.

Malbranke

, Bikard

, Cocco

, Monasson

, Tubiana

. Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergies. Curr Opin Struct Biol. 2023;80: Article 102571.

26.

Sanderson

, Bileschi

, Belanger

, Colwell

. ProteInfer, deep neural networks for protein functional inference. eLife. 2023;12: Article e80942.

27.

, Verma

, Sheridan

, Liaw

, Ma

, Marshall

, McIntosh

, Sherer

, Svetnik

, Johnston

. Deep dive into machine learning models for protein engineering. J Chem Inf Model. 2020;60(6):2773–2790.

28.

Liu

, Cheng

, Zhang

, Ding

, Duan

, Yang

, Kui

, Cheng

, Ruan

, Fan

, et al. Engineering yeast for the production of breviscapine by genomic analysis and synthetic biology approaches. Nat Commun. 2018;9(1):448.

29.

, Jain

, Abbeel

. Denoising diffusion probabilistic models. Adv Neural Inf Proces Syst. 2020;33:6840–6851.

30.

Clifton

, Kaczmarski

, Carr

, Gerth

, Tokuriki

, Jackson

. Evolution of cyclohexadienyl dehydratase from an ancestral solute-binding protein. Nat Chem Biol. 2018;14(6):542–547.

31.

Wang

, Wang

, Liu

, Liao

, Chu

, Chang

, Cao

, Li

, Zhang

, Cheng

, et al. PCPD: Plant cytochrome P450 database and web-based tools for structural construction and ligand docking. Synth Syst Biotechnol. 2021;6(2):102–109.

32.

Anand N, Achim T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv. 2022. https://doi.org/10.48550/arXiv2205.15019

33.

Copeland

. Enzymes: A practical introduction to structure, mechanism, and data analysis. John Wiley & Sons; 2023.

34.

Long

, VanKuren

, Chen

, Vibranovski

. New gene evolution: Little did we know. Annu Rev Genet. 2013;47:307–333.

35.

Cheng

, Chen

, Liu

, Li

, Zhang

, Dai

, Lu

, Zhou

, Cai

, Zhang

, et al. The origin and evolution of the diosgenin biosynthetic pathway in yam. Plant Commun. 2021;2(1): Article 100079.

36.

Cheng

, Wang

, Liu

, Zhu

, Li

, Chu

, Wang

, Lou

, Cai

, Yang

, et al. Chromosome-level genome of Himalayan yew provides insights into the origin and evolution of the paclitaxel biosynthetic pathway. Mol Plant. 2021;14(7):1199–1209.

37.

Liu

, Tavares

, Forsythe

, André

, Lugan

, Jonasson

, Boutet-Mercey

, Tohge

, Beilstein

, Werck-Reichhart

, et al. Evolutionary interplay between sister cytochrome P450 genes shapes plasticity in plant metabolism. Nat Commun. 2016;7(1):13026.

38.

Hansen

, Nelson

, Møller

, Werck-Reichhart

. Plant cytochrome P450 plasticity and evolution. Mol Plant. 2021;14(8):1244–1265.

39.

Jensen

. Enzyme recruitment in evolution of new function. Ann Rev Microbiol. 1976;30(1):409–425.

40.

Arnold

. The nature of chemical innovation: New enzymes by evolution. Q Rev Biophys. 2015;48(4):404–410.

41.

Copley

. Setting the stage for evolution of a new enzyme. Curr Opin Struct Biol. 2021;69:41–49.

42.

Long

, Betrán

, Thornton

, Wang

. The origin of new genes: Glimpses from the young and old. Nat Rev Genet. 2003;4(11):865–875.

43.

Zhou

, Zhang

, Xu

, Zhao

, Zhan

, Li

, Ding

, Yang

, Wang

. On the origin of new genes in drosophila. Genome Res. 2008;18(9):1446–1455.

44.

Carvunis

A-R

, Rolland

, Wapinski

, Calderwood

, Yildirim

, Simonis

, Charloteaux

, Hidalgo

, Barbette

, Santhanam

, et al. Proto-genes and de novo gene birth. Nature. 2012;487(7407):370–374.

45.

Ohno

. Evolution by gene duplication. Springer Science & Business Media; 2013.

46.

Zimmer

, Garrood

, Singh

, Randall

, Lueke

, Gutbrod

, Matthiesen

, Kohler

, Nauen

, Davies

TGE

, et al. Neofunctionalization of duplicated P450 genes drives the evolution of insecticide resistance in the brown planthopper. Curr Biol. 2018;28(2):268–274e5.

47.

Renata

, Wang

, Arnold

. Expanding the enzyme universe: Accessing non-natural reactions by mechanism-guided directed evolution. Angew Chem Int Ed. 2015;54(11):3351–3367.

48.

Giunta

, Cea-Rama

, Alonso

, Briand

, Bargiela

, Coscolin

, Corvini

PF-X

, Ferrer

, Sanz-Aparicio

, Shahgaldian

. Tuning the properties of natural promiscuous enzymes by engineering their nano-environment. ACS Nano. 2020;14(12):17652–17664.

49.

Raag

, Poulos

. Crystal structures of cytochrome P-450CAM complexed with camphane, thiocamphor, and adamantane: Factors controlling P-450 substrate hydroxylation. Biochemistry. 1991;30(10):2674–2684.

50.

Haines

, Tomchick

, Machius

, Peterson

. Pivotal role of water in the mechanism of P450BM-3. Biochemistry. 2001;40(45):13456–13465.

51.

Benkovic

, Hammes-Schiffer

. A perspective on enzyme catalysis. Science. 2003;301(5637):1196–1202.

52.

Rana

, Warraich

, Tahir

, Iqbal

, Von See

, Eckardt

, Gellrich

N-C

. Surgical treatment of zygomatic bone fracture using two points fixation versus three point fixation—A randomised prospective clinical trial. Trials. 2012;13(1):34.

53.

Liu

, Cheng

, Zhu

, Zhang

, Yang

, Guo

, Jiang

, Ma

. De novo biosynthesis of multiple pinocembrin derivatives in Saccharomyces cerevisiae. ACS Synth Biol. 2020;9(11):3042–3051.

54.

Gao

, Lou

, Hao

, Qi

, Tian

, Pu

, He

, Wang

, Xu

, et al. Comparative genomics reveal the convergent evolution of CYP82D and CYP706X members related to flavone biosynthesis in Lamiaceae and Asteraceae. Plant J. 2021;109(5):1305–1318.

55.

, Johnston

, Arnold

, Yang

. Protein sequence design with deep generative models. Curr Opin Chem Biol. 2021;65:18–27.

56.

Ovchinnikov

, Huang

P-S

. Structure-based protein design with deep learning. Curr Opin Chem Biol. 2021;65:136–144.

57.

Wang

, Lisanza

, Juergens

, Tischer

, Watson

, Castro

, Ragotte

, Saragovi

, Milles

, Baek

, et al. Scaffolding protein functional sites using deep learning. Science. 2022;377(6604):387–394.

58.

Möller

, Croning

, Apweiler

. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics. 2001;17(7):646–653.

59.

Armougom

, Moretti

, Poirot

, Audic

, Dumas

, Schaeli

, Keduas

, Notredame

. Expresso: Automatic incorporation of structural information in multiple sequence alignments using 3D-coffee. Nucleic Acids Res. 2006;34(Suppl 2):W604–W608.

60.

Stamatakis

. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–1313.

61.

Ashkenazy

, Penn

, Doron-Faigenboim

, Cohen

, Cannarozzi

, Zomer

, Pupko

. FastML: A web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res. 2012;40(W1):W580–W584.

62.

Kaltenbach

, Burke

, Dindo

, Pabis

, Munsberg

, Rabin

, Kamerlin

SCL

, Noel

, Tawfik

. Evolution of chalcone isomerase from a noncatalytic ancestor. Nat Chem Biol. 2018;14(6):548–555.

63.

Shi

, Gu

, Chen

, Huang

, Guo

, Huang

, Deng

, He

, Zhang

, Huang

, et al. Structural insights revealed by crystal structures of CYP76AH1 and CYP76AH1 in complex with its natural substrate. Biochem Biophys Res Commun. 2021;582:125–130.

64.

Mirdita

, Schütze

, Moriwaki

, Heo

, Ovchinnikov

, Steinegger

. ColabFold: Making protein folding accessible to all. Nat Methods. 2022;19(6):679–682.

65.

Dubey

, Wang

, Shaik

. Molecular dynamics and QM/MM calculations predict the substrate-induced gating of cytochrome P450 BM3 and the regio- and stereoselectivity of fatty acid hydroxylation. J Am Chem Soc. 2016;138(3):837–845.

66.

Kim

, Thiessen

, Bolton

, Chen

, Fu

, Gindulyte

, Han

, He

, Shoemaker

, et al. PubChem substance and compound databases. Nucleic Acids Res. 2015;44(D1):D1202–D1213.

67.

Wang

, Wang

, Kollman

, Case

. Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model. 2006;25(2):247–260.

68.

O'Boyle

, Banck

, James

, Morley

, Vandermeersch

, Hutchison

. Open babel: An open chemical toolbox. J Chem. 2011;3(1):33.

69.

Misura

, Baker

. Progress and challenges in high-resolution refinement of protein structure models. Proteins. 2005;59(1):15–29.

70.

Bradley

, Misura

, Baker

. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309(5742):1868–1871.

71.

Meiler

, Baker

. ROSETTALIGAND: Protein–small molecule docking with full side-chain flexibility. Proteins. 2006;65(3):538–548.

72.

Davis

, Baker

. RosettaLigand docking with full ligand and receptor flexibility. J Mol Biol. 2009;385(2):381–392.

73.

Lemmon G, Meiler J. Rosetta ligand docking with flexible XML protocols. Comput Drug Discov Design. 2012:143–155.

74.

Fleishman

, Leaver-Fay

, Corn

, Strauch

E-M

, Khare

, Koga

, Ashworth

, Murphy

, Richter

, Lemmon

, et al. RosettaScripts: A scripting language interface to the Rosetta macromolecular modeling suite. PLOS ONE. 2011;6(6): e20161.

75.

Graham RL, Woodall TS, Squyres JM, editors. Open MPI: A flexible high performance MPI. In: Parallel Processing and Applied Mathematics: 6th International Conference, PPAM 2005. Springer Berlin Heidelberg; 2006. p. 228–239.

76.

, Ng

. Calibur: A tool for clustering large numbers of protein decoys. BMC Bioinformatics. 2010;11(1):25.

77.

Le Guilloux

, Schmidtke

, Tuffery

. Fpocket: An open source platform for ligand pocket detection. BMC Bioinformatics. 2009;10:1–11.

78.

Case DA, Aktulga HM, Belfon K, Ben-Shalom I, Brozell SR, Cerutti DS, Cheatham TE, Cisneros GA, Cruzeiro VWD, Darden TA, et al. Amber 2021. San Francisco: University of California; 2021.

79.

Wang

, Wang

, Kollman

, Case

. Antechamber: An accessory software package for molecular mechanical calculations. J Am Chem Soc. 2001;222:U403.

80.

Jakalian

, Jack

, Bayly

. Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J Comput Chem. 2002;23(16):1623–1641.

81.

Izaguirre

, Catarello

, Wozniak

, Skeel

. Langevin stabilization of molecular dynamics. J Chem Phys. 2001;114(5):2090–2098.

82.

Ryckaert

J-P

, Ciccotti

, Berendsen

. Numerical integration of the Cartesian equations of motion of a system with constraints: Molecular dynamics of n-alkanes. J Comput Phys. 1977;23(3):327–341.

83.

Darden

, York

, Pedersen

. Particle mesh Ewald: An N·log(N) method for Ewald sums in large systems. J Chem Phys. 1993;98(12):10089–10092.

84.

Miller BR III, McGee TD Jr, Swails

, Homeyer

, Gohlke

, Roitberg

. MMPBSA. Py: An efficient program for end-state free energy calculations. J Chem Theory Comput. 2012;8(9):3314–3321.

85.

Hou

, Wang

, Li

, Wang

. Assessing the performance of the MM/PBSA and MM/GBSA methods. 1. The accuracy of binding free energy calculations based on molecular dynamics simulations. J Chem Inf Model. 2011;51(1):69–82.

86.

Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. PMLR; 2015. p. 2256–2265.

87.

Mei

, Liao

, Zhou

, Li

. A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers. 2005;80(6):775–786.

88.

Meier

, Rao

, Verkuil

, Liu

, Sercu

, Rives

. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Proces Syst. 2021;34:29287–29303.

89.

Dauparas

, Anishchenko

, Bennett

, Bai

, Ragotte

, Milles

, Wicky

BIM

, Courbet

, de Haas

, Bethel

, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science. 2022;378(6615):49–56.

90.

Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. bioRxiv. 2023. https://doi.org/10.1101/2023.03.04.531015

91.

Crooks

, Hon

, Chandonia

J-M

, Brenner

. WebLogo: A sequence logo generator. Genome Res. 2004;14(6):1188–1190.

Appendix

Less

Year 2024 volume 7 Issue 7

PDF

182

103

Cite this Article

BibTeX

Article Info

doi: 10.34133/research.0413

Receive Date：2024-03-25
Online Date：2025-07-24
Published：2024-07-08

Article Data

Affiliations

History

Received：2024-03-25
Accepted：2024-05-27

Funding

Key Technologies Research and Development Program of Anhui Province (2019YFA0905700 and No. 2021YFC2103500)

National Natural Science Foundation of China (No. 32371499)

China Postdoctoral Science Foundation (Grant No. 2019M661032)

National Natural Science Foundation of China (Grant No. 31901026 and No. 32171418)

Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(No. TSBICIP-KJGG-002-02 and No. TSBICIP-CXRC-015)

Tianjin Science Fund for Distinguished Young Scholars(No.18JCJQJC48300)

Affiliations