3D model classification is a fundamental problem in the fields of computer graphics and computer vision, with wide-ranging applications in areas such as computer-aided design, mixed reality, autonomous driving, and robotic navigation. The challenges associated with 3D model classification primarily arise from three key aspects: the difficulty in representing 3D surface geometric features, the diversity of 3D transformations and deformations, and the incompleteness of geometric and topological structures. Existing multi-view-based 3D model classification methods typically render 3D models from multiple preset viewpoints and input all rendered views into a neural network for classification. However, due to the presence of redundant and ineffective views, not all views contribute equally to the classification task. Selecting views that substantially enhance classification performance can not only improve the overall accuracy of multi-view 3D model classification but also help identify representative views that effectively capture the essential characteristics of the 3D model.
This paper proposes a Transformer attention-guided approach for optimal view selection and classification of 3D models. The 3D model is first rendered from 20 viewpoints arranged on a regular icosahedron. A convolutional neural network is then employed to extract feature information from these multiple views, producing a sequence of local multi-view feature tokens. Aiming to retain spatial location information, position encoding is applied to the token sequence. Next, a learnable global classification token is introduced and concatenated with the multi-view feature tokens, forming the input to a Transformer encoder that performs global view feature fusion and generates an initial global classification feature. Subsequently, the optimal view selection module calculates the contribution of each view to the initial global classification token using the attention score matrix from the feature fusion process. The highest-scoring views are selected as the optimal views. These optimal view feature tokens are then concatenated with the initial global classification token and input into the Transformer encoder for a second round of feature fusion, producing the final global classification token. This final token is passed through a classifier to generate the classification probabilities and simultaneously output the selected optimal views. Aiming to enhance generalization during training, the model incorporates random view dropping and contrastive learning strategies.
This study experiments on the ModelNet40 dataset, which comprises 40 object categories. The dataset is suitable for research in 3D object recognition and is widely used for benchmarking algorithm performance. Evaluation metrics include overall accuracy (OA), average accuracy (AA), and speed. OA measures classification accuracy across the entire dataset, while AA calculates the mean accuracy across all categories, addressing issues related to class imbalance. The dataset, created by Stanford University, is widely used for performance evaluation of algorithms. First, the Transformer-based multi-view selection and 3D model classification method proposed in this paper are compared with other state-of-the-art deep learning-based 3D model classification methods to validate its effectiveness. Subsequently, ablation experiments are conducted to analyze the impact of different parameter settings on the performance of the proposed method, including multi-view representation, feature extraction backbone, Transformer hidden layer dimension, number of attention heads, contrastive learning strategy, and random view dropout module. On the ModelNet40 benchmark dataset, the proposed method achieves an overall recognition accuracy of 97.61% and an average recognition accuracy of 96.36%. In addition to reaching state-of-the-art classification performance, the optimal views selected based on the Transformer attention score matrix are shown to be highly representative.
The proposed method leverages the Transformer architecture to perform feature fusion across different views. By employing mechanisms such as self-attention, residual connections, and multi-layer stacking, the Transformer effectively learns complex features and captures global contextual relationships among different views. Furthermore, the attention score matrix generated by the Transformer serves as a basis for optimal view selection, enabling efficient classification while identifying the most representative views.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |