Deep learning (DL)-driven efficient synthesis planning may profoundly transform the paradigm for designing novel pharmaceuticals and materials. However, the progress of many DL-assisted synthesis planning (DASP) algorithms has suffered from the lack of reliable automated pathway evaluation tools. As a critical metric for evaluating chemical reactions, accurate prediction of reaction yields helps improve the practicality of DASP algorithms in the real-world scenarios. Currently, accurately predicting yields of interesting reactions still faces numerous challenges, mainly including the absence of high-quality generic reaction yield datasets and robust generic yield predictors. To compensate for the limitations of high-throughput yield datasets, we curated a generic reaction yield dataset containing 12 reaction categories and rich reaction condition information. Subsequently, by utilizing 2 pretraining tasks based on chemical reaction masked language modeling and contrastive learning, we proposed a powerful bidirectional encoder representations from transformers (BERT)-based reaction yield predictor named Egret. It achieved comparable or even superior performance to the best previous models on 4 benchmark datasets and established state-of-the-art performance on the newly curated dataset. We found that reaction-condition-based contrastive learning enhances the model's sensitivity to reaction conditions, and Egret is capable of capturing subtle differences between reactions involving identical reactants and products but different reaction conditions. Furthermore, we proposed a new scoring function that incorporated Egret into the evaluation of multistep synthesis routes. Test results showed that yield-incorporated scoring facilitated the prioritization of literature-supported high-yield reaction pathways for target molecules. In addition, through meta-learning strategy, we further improved the reliability of the model's prediction for reaction types with limited data and lower data quality. Our results suggest that Egret holds the potential to become an essential component of the next-generation DASP tools.
| 1. | We extracted and curated a high-quality generic reaction yield dataset named Reaxys-MultiCondi-Yield from the Reaxys database (Fig. 1C). Compared to HTE datasets, Reaxys-MultiCondi-Yield encompasses a broader chemical space, including 12 reaction types, 752 catalysts, 1,152 solvents, 15,007 reagents, and 84,125 reactions. Specifically, this dataset consists of 11,831 reaction groups, which are sets of reactions with the same reactants and products but varying yields due to different reaction conditions. |
| 2. | To implement a general yield prediction model, we designed a pretraining framework named Egret (Fig. 2), which is based on BERT and includes 2 pretraining tasks: masked language modeling (MLM) and reaction-condition-based contrastive learning. Egret performed comparably or even better than the previous best models on 4 benchmark datasets and achieved optimal performance on the Reaxys-MultiCondi-Yield dataset. |
| 3. | We proposed a yield-incorporated scoring for multistep retrosynthesis planning, and the results indicated that the yield-incorporated scoring can indeed prioritize literature-supported high-yield synthesis routes for target molecules. |
| 4. | Finally, we used a meta-learning strategy to model the low-sample-size or low-quality data of 5 reaction classes in the Reaxys-MultiCondi-Yield dataset, resulting in a significant improvement in prediction accuracy. Specifically, the accuracy of the reaction class 10 has increased by 33.33%. |
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |