Embodied intelligence represents a new stage in the evolution of artificial intelligence, marking a transition from "perception−cognition" to an integrated paradigm of "perception−cognition−action." The Vision−Language−Action (VLA) model provides a critical technological pathway for enabling autonomous agent operation in the real world by unifying visual perception, language understanding, and action generation. This paper systematically reviews the development trajectory and representative achievements of VLA technologies, and summarizes their architectural paradigm, which includes multi−modal perception, semantic fusion mechanisms, reinforcement and imitation learning, world models, and hierarchical action output. By considering application scenarios such as autonomous driving, human–computer interaction, and industrial equipment, we further analyze the core challenges faced by VLA development, including the scarcity of data resources, limited generalization and transferability, insufficient interpretability, and increasing computational demands, and we outline the future development trends.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |