收藏切换
Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios
收藏切换
PDF
Jianfeng ZHANG, Dong XIE, Songlei JIAN*, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU
Journal of National Niversity of Defense Technology | 2025, 47(6) : 60 - 70
Less
收藏切换
Journal of National Niversity of Defense Technology | 2025, 47(6): 60-70
Computer System and technology
Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios
Full
Jianfeng ZHANG, Dong XIE, Songlei JIAN*, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU
Affiliations
  • College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Published: 2025-12-28 doi: 10.11887/j.issn.1001-2486.25050035
Outline
收藏切换

Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios.Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO(operator-aware tensor offloading)approach was proposed.OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan.Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp.Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models.Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

large language models  /  resource constraints  /  model inference  /  GPU affinities of operators  /  operator-aware tensor offloading approach
Jianfeng ZHANG, Dong XIE, Songlei JIAN, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU. Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios[J]. Journal of National Niversity of Defense Technology, 2025 , 47 (6) : 60 -70 . DOI: 10.11887/j.issn.1001-2486.25050035
Year 2025 volume 47 Issue 6
PDF
130
63
Cite this Article
BibTeX
Article Info
doi: 10.11887/j.issn.1001-2486.25050035
  • Receive Date:2025-05-24
  • Online Date:2026-04-16
  • Published:2025-12-28
Article Data
Affiliations
History
  • Received:2025-05-24
Affiliations
    College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
References
Share
https://castjournals.cast.org.cn/joweb/gfkjdxxb/EN/10.11887/j.issn.1001-2486.25050035
Share to
QR

Scan QR to access full text

Cite this article
BibTeX
Citations
表12种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏
  • BibTeX
  • EndNote
  • RefWorks
  • TxT