Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios.Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO(operator-aware tensor offloading)approach was proposed.OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan.Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp.Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models.Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |