Transformer attention-guided optimal view selection and classification for 3D models

Transformer attention-guided optimal view selection and classification for 3D models

PDF

Songle Chen¹^,², Ruyue Huang¹, Sixuan Huang¹, Yi Chen³, Qian Li⁴^,^*

Journal of Image and Graphics | 2025, 30(12) : 3927 - 3940

Less

Journal of Image and Graphics | 2025, 30(12): 3927-3940

• Computer Graphics •

Transformer attention-guided optimal view selection and classification for 3D models

Full

Songle Chen¹^,², Ruyue Huang¹, Sixuan Huang¹, Yi Chen³, Qian Li⁴^,^*

Affiliations

¹Jiangsu Provincial Postal Big Data Technology and Application Engineering Research Center，Nanjing University of Posts and Telecommunications，Nanjing210003， China

²State Key Laboratory for Novel Software Technology，Nanjing University， Nanjing210023， China

³School of Digital Economy， Nanjing Audit University，Nanjing211815， China

⁴College of Meteorology and Oceanography， National University of Defense Technology， Changsha411107， China

Published: 2025-12-16 doi: 10.11834/jig.250037

Outline

Abstract

Less

Objective

3D model classification is a fundamental problem in the fields of computer graphics and computer vision， with wide-ranging applications in areas such as computer-aided design， mixed reality， autonomous driving， and robotic navigation. The challenges associated with 3D model classification primarily arise from three key aspects： the difficulty in representing 3D surface geometric features， the diversity of 3D transformations and deformations， and the incompleteness of geometric and topological structures. Existing multi-view-based 3D model classification methods typically render 3D models from multiple preset viewpoints and input all rendered views into a neural network for classification. However， due to the presence of redundant and ineffective views， not all views contribute equally to the classification task. Selecting views that substantially enhance classification performance can not only improve the overall accuracy of multi-view 3D model classification but also help identify representative views that effectively capture the essential characteristics of the 3D model.

Method

This paper proposes a Transformer attention-guided approach for optimal view selection and classification of 3D models. The 3D model is first rendered from 20 viewpoints arranged on a regular icosahedron. A convolutional neural network is then employed to extract feature information from these multiple views， producing a sequence of local multi-view feature tokens. Aiming to retain spatial location information， position encoding is applied to the token sequence. Next， a learnable global classification token is introduced and concatenated with the multi-view feature tokens， forming the input to a Transformer encoder that performs global view feature fusion and generates an initial global classification feature. Subsequently， the optimal view selection module calculates the contribution of each view to the initial global classification token using the attention score matrix from the feature fusion process. The highest-scoring views are selected as the optimal views. These optimal view feature tokens are then concatenated with the initial global classification token and input into the Transformer encoder for a second round of feature fusion， producing the final global classification token. This final token is passed through a classifier to generate the classification probabilities and simultaneously output the selected optimal views. Aiming to enhance generalization during training， the model incorporates random view dropping and contrastive learning strategies.

Result

This study experiments on the ModelNet40 dataset， which comprises 40 object categories. The dataset is suitable for research in 3D object recognition and is widely used for benchmarking algorithm performance. Evaluation metrics include overall accuracy （OA）， average accuracy （AA）， and speed. OA measures classification accuracy across the entire dataset， while AA calculates the mean accuracy across all categories， addressing issues related to class imbalance. The dataset， created by Stanford University， is widely used for performance evaluation of algorithms. First， the Transformer-based multi-view selection and 3D model classification method proposed in this paper are compared with other state-of-the-art deep learning-based 3D model classification methods to validate its effectiveness. Subsequently， ablation experiments are conducted to analyze the impact of different parameter settings on the performance of the proposed method， including multi-view representation， feature extraction backbone， Transformer hidden layer dimension， number of attention heads， contrastive learning strategy， and random view dropout module. On the ModelNet40 benchmark dataset， the proposed method achieves an overall recognition accuracy of 97.61% and an average recognition accuracy of 96.36%. In addition to reaching state-of-the-art classification performance， the optimal views selected based on the Transformer attention score matrix are shown to be highly representative.

Conclusion

The proposed method leverages the Transformer architecture to perform feature fusion across different views. By employing mechanisms such as self-attention， residual connections， and multi-layer stacking， the Transformer effectively learns complex features and captures global contextual relationships among different views. Furthermore， the attention score matrix generated by the Transformer serves as a basis for optimal view selection， enabling efficient classification while identifying the most representative views.

Key words

3D model classification / Transformer / optimal view selection / contrastive learning / multi-view learning

Cite this Article

Songle Chen, Ruyue Huang, Sixuan Huang, Yi Chen, Qian Li. Transformer attention-guided optimal view selection and classification for 3D models[J]. Journal of Image and Graphics, 2025 , 30 (12) : 3927 -3940 . DOI: 10.11834/jig.250037

Appendix

Less

Year 2025 volume 30 Issue 12

PDF

128

Cite this Article

BibTeX

Article Info

doi: 10.11834/jig.250037

Receive Date：2025-02-12
Online Date：2026-04-09
Published：2025-12-16

Article Data

Affiliations

History

Received：2025-02-12
Revised：2025-04-28

Affiliations

¹Jiangsu Provincial Postal Big Data Technology and Application Engineering Research Center，Nanjing University of Posts and Telecommunications，Nanjing210003， China

²State Key Laboratory for Novel Software Technology，Nanjing University， Nanjing210023， China

³School of Digital Economy， Nanjing Audit University，Nanjing211815， China

⁴College of Meteorology and Oceanography， National University of Defense Technology， Changsha411107， China

References

Share

https://castjournals.cast.org.cn/joweb/zgtxtxxb/EN/10.11834/jig.250037

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House