Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios

Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios

PDF

Jianfeng ZHANG, Dong XIE, Songlei JIAN^*, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU

Journal of National Niversity of Defense Technology | 2025, 47(6) : 60 - 70

Less

Journal of National Niversity of Defense Technology | 2025, 47(6): 60-70

• Computer System and technology •

Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios

Full

Jianfeng ZHANG, Dong XIE, Songlei JIAN^*, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU

Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

Published: 2025-12-28 doi: 10.11887/j.issn.1001-2486.25050035

Outline

Abstract

Less

Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios.Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO(operator-aware tensor offloading)approach was proposed.OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan.Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp.Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models.Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

Key words

large language models / resource constraints / model inference / GPU affinities of operators / operator-aware tensor offloading approach

Cite this Article

Jianfeng ZHANG, Dong XIE, Songlei JIAN, Bao LI, Xiaochuan WANG, Yong GUO, Jie YU. Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios[J]. Journal of National Niversity of Defense Technology, 2025 , 47 (6) : 60 -70 . DOI: 10.11887/j.issn.1001-2486.25050035

Appendix

Less

Year 2025 volume 47 Issue 6

PDF

130

Cite this Article

BibTeX

Article Info

doi: 10.11887/j.issn.1001-2486.25050035

Receive Date：2025-05-24
Online Date：2026-04-16
Published：2025-12-28

Article Data

Affiliations

History

Received：2025-05-24

Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

References

Share

https://castjournals.cast.org.cn/joweb/gfkjdxxb/EN/10.11887/j.issn.1001-2486.25050035

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House