The digital transformation of Chinese Materia Medica(CMM) classics is critical for bridging ancient pharmacological wisdom with modern drug discovery. However, existing Knowledge Graphs(KGs) for CMM are often constructed in isolation, resulting in fragmented information silos that hinder global data interoperability. While Entity Alignment(EA) has become a focal point in the international Semantic Web community, specific research targeting the alignment of ancient CMM literature remains a significant gap. Moreover, current state-of-the-art models—primarily designed for modern, high-resource languages—struggle to address the unique challenges of ancient Chinese texts. These challenges include severe structural heterogeneity caused by disparate historical writing styles, high terminological ambiguity where distinct medical concepts share similar characters, and a critical scarcity of high-quality annotated data⁃sets. This study aims to fill this gap by proposing a domain-specific deep learning framework designed to automate the fusion of multi-source historical medical knowledge.
To overcome these barriers, this paper proposed the Generative Adversarial Fuzzy-boundary Learning(GAFL-Align) model. The study utilized two representative classics spanning different historical eras: Shennong Bencao Jing and Tangye Bencao. After data cleaning, the datasets comprised 3 771 and 3 910 normalized entities, respectively, focusing on core categories such as herbs, symptoms, and diseases. The technical architecture integrated BERT for deep semantic encoding with Graph Attention Networks(GAT) to capture topological structures. To handle distribution shifts across heterogeneous texts, the model employed a Generative Adversarial Network(GAN) for domain adaptation, mapping entities into a unified feature space. Furthermore, a novel fuzzy boundary negative sampling strategy was developed to distinguish “hard negatives”—terms with high lexical similarity but distinct medical meanings. To address data scarcity, an iterative self-training mechanism with confidence-aware filtering was implemented to augment the training set from a limited number of expert-annotated seed pairs.
Experimental results indicated that GAFL-Align achieved a Hits@1 score of 83.59%, significantly outperforming nine baselines, including translation-based models, GNN variants, and Large Language Models(LLMs)-augmented approaches like ChatEA. The model successfully constructed a fused KG containing 6 826 entities, effectively merging heterogeneous data while preserving unique source-specific attributes. These findings demonstrate that combining adversarial domain adaptation with fine-grained semantic differentiation offers a superior solution for low-resource historical knowledge fusion compared to generic LLMs. Ultimately, this research provides a robust technical foundation for the intelligent organization of CMM heritage, offering significant implications for digital humanities and the global standardization of traditional medicine data.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |