Journal of Image and Graphics

Self-supervised coal mine image denoising with adaptive masking

Yaru Zhang, Jiantao Liu, Guoqing Xu, Dingyi Hao

Journal of Image and Graphics. 2025, 30(12): 3884-3899.

Objective

The objective of this research is to enhance the quality and accuracy of information extracted from coal mine images， which are often degraded by high dust concentrations and uneven lighting conditions. These challenging environmental conditions introduce noise， reduce local contrast， and lead to the loss of fine details and edge textures， ultimately compromising the visual quality and the reliability of information extraction. Aiming to address these challenges， this study proposes a self-supervised coal mine image denoising algorithm based on adaptive masking. Designed to handle a wide range of noise levels and types， this algorithm aims to restore the original integrity of the image while preserving critical visual features. The proposed algorithm is divided into three main components： adaptive masking， mask integration， and an adaptive integrated loss function. Each component plays a vital role in enhancing the denoising process， ensuring that the final output is accurate and visually appealing.

Method

The adaptive masking component is the cornerstone of the proposed algorithm， enabling segmented processing of coal mine images. This segmentation not only reduces computational overhead but also allows for more targeted and effective denoising. By dividing each image into smaller blocks， the algorithm can analyze and process each section independently， thereby improving the overall efficiency of the denoising process. The module operates by sequentially applying a mask to the edge and corner pixels of each block， while deliberately excluding the central pixels. This method prevents the network from performing a trivial identity mapping that fails to enhance image quality. Instead， this approach introduces data variability that boosts the generalization capability and robustness of the neural network model， making it adaptable to previously unknown images. The adaptive nature of the mask ensures that the module responds dynamically to varying noise levels and image features. By analyzing local variance and texture complexity， the mask can adaptively determine the optimal masking strategy for each block. This tailored approach ensures that the denoising process is responsive to the specific characteristics of each image， substantially improving its effectiveness. Subsequently， once the masking process is complete， the mask integration module is employed. This module is responsible for fusing the neural network’s output with the masked areas to reconstruct a coherent and denoised image. The integration involves calculating the Hadamard product （element-wise multiplication） between the network’s output and the masked image. This strategic operation enhances the network’s capability to distinguish between actual image content and noise， especially around edges and texture boundaries. In this stage， considering local and global features of the coal mine images is crucial. Effective integration of these features allows the algorithm effectively interpret image context， leading in denoised outputs that are coherent and structurally complete. The mask integration module also ensures that denoised areas seamlessly blend into the rest of the image， preserving the overall visual flow and structural integrity. Furthermore， this module incorporates a quality evaluation mechanism to assess the effectiveness of the integration. The feedback from these evaluations is used to iteratively refine the integration process. The final component of the algorithm is an adaptive integrated loss function， which guides the model during training. This loss function is specifically designed to address the unique challenges of coal mine image denoising， including complex noise patterns and the need to preserve subtle image details. The adaptive integrated loss uses the integrated image as a training label， allowing the model to learn effectively from the differences between the noisy input images and the denoised outputs. Additionally， by incorporating the original noisy image， the loss function increases the model’s sensitivity to signal changes， enhancing its adaptability across various denoising scenarios and noise conditions.

Result

The proposed algorithm was rigorously tested using an underground coal mine image dataset alongside four additional public datasets， including Kodak24 （Kodak lossless true color image suite）， BSD300 （Berkeley segmentation dataset 300）， and BSDS500 （Berkeley segmentation dataset 500）. The experiments were specifically designed to simulate real-world conditions， with a particular emphasis on dimly lit environments commonly encountered in coal mines. The results of these experiments demonstrated that the algorithm substantially outperformed other comparative denoising algorithms， in terms of subjective evaluations and objective metrics such as peak signal-to-noise ratio （PSNR） and structural similarity index （SSIM）. In tunnel scenes with a high level of Gaussian noise （level 50）， the algorithm achieved substantial improvements in PSNR/SSIM values compared to existing methods such as B2U and NBR2NBR， with increases of 4.2 dB/0.055 and 2.99 dB/0.077， respectively. Furthermore， when tested on images corrupted with Gaussian noise levels ranging from 5 to 50 on the public datasets， the algorithm consistently demonstrated substantial PSNR improvements over the second-best method， with increases of 1.09%， 0.72%， and 0.68% for Kodak24， BSD300， and BSDS500， respectively.

Conclusion

The proposed self-supervised denoising algorithm has demonstrated a strong capability to remove noise while preserving overall image information from single coal mine images， across various noise levels and types. This finding highlights the algorithm’s robustness and generalization capabilities， making it a promising tool for real-world applications in coal mine monitoring and safety systems. The effectiveness of the algorithm in enhancing image quality and improving the accuracy of information extraction， even under challenging conditions， underscores its potential to make a substantial contribution to the field of coal mine image processing and analysis.The code in this paper can be obtained by https://www.sciclb.cn/anonymous/skpswk56.

Path stepwise estimation network combining social constraint and trajectory endpoints

Enhong Wu, Qingge Ji

Journal of Image and Graphics. 2025, 30(12): 3900-3913.

Objective

Pedestrian trajectory prediction constitutes a critical research challenge in autonomous driving systems， intelligent security surveillance， and human-robot interaction frameworks. The capability to accurately anticipate pedestrian movement patterns directly influences the operational safety of autonomous vehicles， the responsiveness of surveillance systems， and the adaptability of social robots in dynamic environments. While existing approaches predominantly focus on leveraging sequential data patterns and optimizing model architectures through recurrent neural networks， they often overlook the intrinsic social-semantic characteristics embedded in real-world pedestrian interactions. Current methodologies tend to treat trajectory prediction as a purely sequential modeling task， overlooking three fundamental aspects： 1） the social constraints governing crowd movement patterns， 2） the intentional， destination-oriented nature of human locomotion， and 3） the dynamic adaptation mechanisms pedestrians employ during path navigation. This oversight leads to suboptimal performance， particularly in dense pedestrian scenarios where social interactions and environmental adaptability notably influence movement decisions. Aiming to address these limitations， this paper proposes path stepwise estimation network （PSEN）， a novel framework that systematically integrates social relationship modeling， endpoint-aware trajectory planning， and environment-adaptive path refinement. The proposed model bridges the gap between conventional sequence prediction paradigms and the complex socio-spatial dynamics inherent in real-world pedestrian navigation scenarios.

Method

This paper incorporates the characteristics of path planning observed in daily human walking， which can be broadly divided into three key aspects. First， social restrictions are considered. The crowd is categorized based on movement direction， speed， and distance to demonstrate these reflections. Intra-class feature learning is then performed on the classified groups. The social relationships between predicted pedestrians and other pedestrians are calculated using social weights to obtain social attention， which affects the subsequent path estimation network. Second， an endpoint estimation network is introduced by stimulating the feature that pedestrians typically identify a destination and then purposefully plan their walking path. This network leverages the strengths of serialized prediction tasks by using spatiotemporal sequences to predict an endpoint. The estimated endpoint serves as a reference condition within the overall network model， guiding the complete path planning process. Third， this paper address the fact that pedestrians constantly fine-tune their local paths and adjust their focus based on environmental context and destination. Aiming to model this behavior， an endpoint and path fine-tuning network is constructed using conditional variational autoencoder （CVAE） and multilayer perceptron （MLP）. This module takes the output of the endpoint estimation network as a condition and uses the output from the social restriction module， along with the historical trajectory， as inputs for feature learning. After every three frames of prediction， the social restriction and endpoint module outputs are updated according to the current environment of the pedestrians. This update allows the model to automatically fine-tune the planned path in response to dynamic surroundings.

Result

The experiments are conducted by comparing the proposed method with six baseline methods on the ETH/UCY dataset， five baseline methods on the SDD dataset， and four baseline methods on the NBA SportVU dataset. The evaluation metrics used are average displacement error （ADE） and final displacement error （FDE）. On the entire ETH/UCY dataset， ADE and FDE are reduced by an average of 5.1% and 7.5%， respectively. On the SDD dataset， reductions of 1% in ADE and 2% in FDE are observed on average. When analyzing individual datasets， the performance improvements are highly pronounced in scenarios with denser pedestrian traffic. Notably， in the ZARA1， ZARA2， and UNIV datasets， the proposed method achieves improvements of over 10% in prediction accuracy. Ablation experiments are also conducted on the ETH/UCY dataset to evaluate the contributions of individual components of the PSEN framework. The experimental results demonstrate that each module of PSEN notably improves the effectiveness of pedestrian trajectory prediction， achieving average reductions of 19% and 31% in ADE and final displacement error FDE， respectively. Ablation experiments are performed in parameters such as social distance， social attention weights， and the number of frames used in stepwise trajectory generation. These experiments confirm that all aspects of the network design positively impact pedestrian trajectory prediction. However， the model does not perform as well on the NBAsportVU dataset. This dataset is characterized by 10 players moving at high speeds， with trajectory endpoints changing dynamically based on in-game situations and players’ intentions. Different from ETH/UCY and SDD datasets， where movement is predictable and socially constrained， the varying roles and tactical decisions of agent in NBA dataset play a crucial role in path planning， making prediction highly challenging. Therefore， achieving accurate predictions by relying solely on time-position information is difficult because the characteristics of pedestrians in this setting notably differ from those in typical pedestrian scenes. In sports scenes， athletes actively seek collisions and obstructions as part of their strategic movement. PSEN does not consider the role-specific behaviors of agents， limiting its effectiveness in such environments.

Conclusion

The PSEN model proposed in this paper integrates the serialization task with three key features of real-world pedestrian scenes. By combining recurrent neural networks with a CVAE， PSEN effectively reflects the complex features of pedestrian trajectory prediction in realistic scenarios. The model achieves superior performance on the ETH/UCY and SDD datasets， providing a new direction for subsequent tasks in pedestrian trajectory prediction. However， this study focuses only on interactions among pedestrians and does not consider the relationship between pedestrians and other objects， such as vehicles and obstacles. In novel environments， or in scenes where pedestrians are sparse but other dynamic or static objects are abundant， the performance of the model may degrade. Further research is needed in terms of the relationships between pedestrians and objects， along with their associated feature information.

Segmented dental arch line design based on Hermite interpolation function

Weijie Liu, Long Ma, Guangshun Wei, Yeying Fan, Yuanfeng Zhou

Journal of Image and Graphics. 2025, 30(12): 3941-3954.

Objective

In recent years， rapid advancements in digital technology have positioned digital orthodontics as a critical research focus within the field of dentistry. Among the numerous challenges encountered during orthodontic treatment， designing an accurate dental arch line is fundamental for precisely calculating the target positions of teeth after treatment. The dental arch line should not only follow the natural growth patterns of the teeth but also satisfy aesthetic and functional requirements essential for optimal orthodontic outcomes. However， current automated tooth alignment methods typically model the dental arch line using Beta functions， which are inherently limited by their restricted degrees of freedom. This limitation often prevents these methods from generating curves that accurately capture the ideal dental arch form， especially when dealing with complex or irregular tooth arrangements. Moreover， orthodontists frequently require customized dental arch lines tailored to each patient’s unique oral condition. However， arch lines fitted solely from the patient’s initial intraoral scan may not always align with therapeutic or aesthetic expectations， necessitating labor-intensive manual adjustments. These challenges highlight the need for a flexible and precise approach to dental arch line design that effectively meets clinical standards and patient-specific requirements. Aiming to address these limitations， this paper proposes a novel dental arch line fitting method based on cumulative chord length parameterization combined with Hermite interpolation. This approach aims to enhance control over the dental arch shape， improve fitting accuracy， and provide orthodontists with a highly effective and efficient tool for designing and adjusting dental arch lines during orthodontic treatment planning.

Method

The proposed method begins by inputting the patient’s intraoral scan data， which undergoes a series of preprocessing steps to ensure data quality and consistency. A tooth segmentation algorithm is then applied to accurately isolate each individual tooth， following internationally recognized dental segmentation standards. After segmentation， a landmark detection algorithm is employed to extract key landmarks from each tooth， capturing essential geometric and morphological features. These landmarks serve as the foundation for subsequent dental arch line fitting. Aiming to facilitate the interpolation process， the extracted landmarks are initially reparameterized using cumulative chord length parameterization. This process generates a naturally distributed set of interpolation points along the dental arch by accounting for the varying distances between adjacent landmarks， thereby preserving the true spatial relationships among teeth. Subsequently， Hermite interpolation is employed to construct the dental arch line through the parameterized points. By incorporating position and tangent information， Hermite interpolation enables the construction of smooth， continuous curves with enhanced local control. Aiming to ensure fitting accuracy and smoothness， a coefficient matrix is constructed to formulate a system of linear equations. Solving this system yields the final dental arch line， represented as a piecewise continuous function. This piecewise structure allows for precise local adjustments， making the method particularly effectively for accommodating complicated or irregular tooth arrangements. Furthermore， this paper introduces two new mathematical evaluation metrics： the mean shortest distance and the maximum shortest distance between the extracted landmarks and the fitted curve. These metrics offer an objective and robust means of assessing how accurately the generated dental arch line conforms to the patient’s actual dental morphology.

Result

The proposed fitting method， which integrates cumulative chord length parameterization with Hermite interpolation， exhibits substantial improvements over traditional approaches in dental arch line fitting. First， compared to conventional Beta function-based methods， the proposed approach offers substantially greater flexibility by allowing the inclusion of additional control points. This increased degree of freedom directly addresses the limitations of Beta functions， particularly their inability to support localized shape modifications. The resulting dental arch line provides orthodontists with the flexibility to manually adjust specific， predefined control points， enabling localized adjustments tailored to individual patient needs. The proposed method excels in offering excellent controllability for global and local morphology adjustments of the dental arch line while maintaining high accuracy and smoothness across all regions， attributed to the use of its piecewise functional structure. Experimental evaluations further highlight the advantages of the proposed method. Qualitative analyses show that the generated curves more naturally align with actual dental arch shapes than those produced by conventional methods. Quantitative results， assessed using the proposed shortest distance-based evaluation metrics， confirm a notable improvement in fitting accuracy and alignment with natural tooth arrangements. Additionally， the proposed method enhances clinical flexibility， allowing orthodontists to efficiently adjust the dental arch line by manipulating a limited number of control points， minimizing the need for extensive manual corrections. In practical scenarios， the proposed fitting method is integrated into an existing automated tooth alignment system. This integration led to noticeably improved orthodontic outcomes， further validating the practical effectiveness and clinical applicability of the proposed method.

Conclusion

Compared to existing dental arch fitting methods， the proposed method based on cumulative chord length parameterization and Hermite interpolation demonstrates clear advantages in fitting accuracy and flexibility. This method effectively addresses key limitations of traditional approaches， such as difficulty in achieving an ideal dental arch line and limited adaptability to patient-specific variations. By notably increasing the degrees of freedom and enhancing the controllability of the fitting function， the method produces dental arch lines that are not only smooth and accurate but also highly customizable to meet the diverse clinical requirements of modern orthodontic practice. Furthermore， the introduction of quantitative evaluation metrics offers a systematic and objective framework for assessing fitting quality， ensuring that the resulting dental arch lines are aesthetically aligned and functionally sound. Beyond its technical advantages， the method also improves clinical efficiency by reducing the time and effort typically required for dental arch adjustments during treatment planning. Overall， the proposed method offers strong technical support for the advancement of digital orthodontics and holds substantial potential for broader clinical adoption. This paper establishes a solid foundation for further innovations in automated orthodontic treatment systems， opening new possibilities for personalized and precise dental care.

Lightweight pyramid cross-attention network for orbital image defect detection

Sixu Guo, Huizheng Geng, Li Su, Shen He, Xinyue Zhang

Journal of Image and Graphics. 2025, 30(12): 3824-3837.

Objective

Most existing vision-based rail defect detection methods face challenges such as high parameter counts， computational complexity， slow detection speeds， and limited accuracy. Aiming to overcome these limitations， this paper introduces a lightweight pyramid cross-attention network （LPCANet） for orbital image defect detection using RGB images and depth images.

Method

LPCANet adopts MobileNetv2 as its backbone network to extract multiscale feature maps from RGB images. Simultaneously， a lightweight pyramid module （LPM） is employed to extract similarly-sized feature maps from depth images. Each stage of the LPM comprises a sequence of operations including max pooling， a 3 × 3 convolutional layer， batch normalization， and ReLU activation， enabling efficient extraction of features from depth images. By leveraging deep learning， RGB-D technology， and salient object detection， LPCANet efficiently extracts multiscale feature representations from RGB and depth data. The LPM handles depth image features， while the backbone captures detailed pyramid features from RGB images. Subsequently， a cross-attention mechanism （CAM） is applied to integrate the feature maps from both modalities， enhancing the network’s focus on relevant defect regions. Additionally， a spatial feature extractor （SFE） is introduced to further boost defect detection performance. Finally， a “pixel shuffle” operation is used to restore the output to the original image resolution.

Result

The proposed scheme was computationally evaluated using the PyTorch library in an environment equipped with an NVIDIA 3090 GPU， alongside several benchmark models for comparison. For the evaluation of LPCANet， three publicly available unsupervised RGB-D rail datasets were used： NEU-RSDDS-AUG， RSDD-TYPE1， and RSDD-TYPE2. Experimental results on the NEU-RSDDS-AUG dataset indicate that LPCANet achieves excellent efficiency， with 9.90 million parameters， a computational complexity of 2.50 G， a model size of 37.95 MB， and a running speed of 162.60 frames per second. Compared to 18 existing rail defect detection schemes， LPCANet exhibits superior lightness in performance. In particular， when compared against CSEPNet， the current best-performing model， LPCANet achieves improvements across several evaluation metrics： +1.48% in $S α$ Sα， +0.86% in intersection over union （IOU）， +0.14% in $F β m a x$ Fβmax， +0.03% in mean average precision （mAP）， and +1.77% in mean absolute error （MAE）. An ablation study was conducted on four upsampling methods （interpolation， transposed convolution， patch merging， and “pixel shuffle”） to evaluate their effectiveness within the LPCANet framework. Among these， the “pixel shuffle” method demonstrated clear advantages and was found to be the most suitable for the LPCANet model. Further ablation studies were conducted on four different components （backbone network， LPM， SFE， and CAM）. The results indicate that CAM and SFE notably enhance the detection performance of LPCANet. An in-depth analysis of various backbone networks confirmed that LPCANet model is not only compatible with existing backbone networks but also consistently achieves superior detection results. Aiming to evaluate the model’s generalization capability beyond rail datasets， experiments were also conducted on three non-rail defect datasets： DAGM2007， MT， and Kolektor-SDD2. The results show that LPCANet delivers improved performance across three key metrics： mAP， MAE， and IOU， demonstrating its potential for general-purpose defect detection tasks.

Conclusion

The LPCANet model proposed in this study effectively combines the advantages of traditional and deep learning approaches， demonstrating strong practical value in the field of rail defect image processing. In the future， this scheme will focus on further reducing the model size to achieve rapid detection speeds while ensuring further improvements in performance quality.

Enhanced attention-based joint semantic instance segmentation network for point clouds

Wen Hao, Zhanbin Zuo, Hansen Lu, Wei Liang, Haiyan Jin, Zhenghao Shi

Journal of Image and Graphics. 2025, 30(12): 3914-3926.

Objective

With the rapid advancement of 3D sensing technologies such as LiDAR （light detection and ranging） and depth cameras， large-scale 3D point clouds have emerged as a crucial data source for a wide range of applications， including autonomous driving， robotic navigation， augmented reality， and urban scene reconstruction. Compared to 2D images， point clouds offer precise spatial geometry and provide a comprehensive representation of the environment without perspective distortion. Additionally， they are robust to variations in lighting and texture. Point cloud segmentation plays a crucial role in scene analysis and interpretation. The segmentation can be categorized into three types： semantic segmentation， instance segmentation， and joint semantic-instance segmentation. Semantic segmentation partitions a 3D scene into informative regions and assigns each region to a specific class. Instance segmentation identifies and separates individual objects at the point level， including those that belong to the same semantic category. In recent years， researchers have increasingly focused on combining the two tasks to achieve more consistent and informative scene-level interpretations. Joint semantic-instance segmentation leverages the intrinsic correlation between semantic and instance-level segmentation， enabling the two tasks to complement and reinforce each other. In 3D point cloud contexts， this joint approach substantially improves the capability of the system to comprehend complex environments and offers strong technical support for the development of intelligent systems. Consequently， this approach has become an area of growing interest and active research. However， most existing methods for joint semantic-instance segmentation rely on simplistic feature fusion strategies， which limit their effectiveness in fully capturing the potential relationship between semantic and instance features. Aiming to address this limitation， an enhanced attention-based joint semantic-instance segmentation network is proposed. This network is designed to effectively model and utilize the correlation between semantic and instance information.

Method

The enhanced attention-based joint semantic-instance segmentation neural network （EAJS-Net） incorporates a semantic feature extraction module based on an attention mechanism. This module focuses on the local neighborhood of each point and dynamically adjusts attention weights to emphasize key information， thereby enhancing the extraction of semantic features across points. Additionally， an attention-enhanced semantic/instance feature fusion module is introduced， which adaptively learns the similarity between central and adjacent features. This design reinforces key characteristics and effectively captures the correlation between instance and semantic segmentation， ultimately improving overall segmentation accuracy. EAJS-Net integrates PointNet++ and PointConv as its backbone network and comprises three main components： a point feature enhancement module， an encoder-decoder module， and an enhanced attention-based joint segmentation module. The input to EAJS-Net includes N × 9 dimensional point cloud data， where N represents the number of points， and the nine dimensions include coordinate values （XYZ）， color information （RGB）， and normalized coordinates. A semantic feature extraction module based on an attention mechanism is employed to effectively capture local contextual information between points. The enhanced features extracted by this module are then fed into the encoding layer， which includes four encoding modules： one attention pooling-based set abstraction layer adapted from PointNet++ and three feature encoding layers derived from PointConv. The corresponding decoding layer comprises four decoding modules： three deep feature decoding layers derived from PointConv and one feature propagation layer from PointNet++. By utilizing the attention pooling-based set abstraction layer from PointNet++， the network effectively captures spatial geometric relationships among features. Through the combination of the encoding and decoding layers， the initial semantic and instance features of the point cloud are extracted， laying the foundation for accurate joint segmentation. An enhanced attention module is designed to adaptively learn the similarity between central and neighboring features through dual attention mechanisms， which dynamically compute attention weights. These dual attention weights are summed and applied to the initial semantic features， resulting in enhanced semantic representations. This module is embedded within the semantic branch of the joint segmentation module， enabling more effective integration of semantic and instance features to improve joint segmentation accuracy. The encoded features are then upsampled through two parallel decoder branches to generate an instance feature matrix and a semantic feature matrix， which serve as inputs to the joint segmentation module. Within this module， the semantic and instance branches are integrated using the enhanced attention mechanism. The final output comprises instance embeddings and semantic predictions， supporting precise and consistent segmentation results.

Result

The proposed network is evaluated on the Stanford large-scale 3D indoor spaces （S3DIS） dataset and ScanNet V2 to assess its performance on point cloud segmentation tasks. Six fold cross-validation is performed on the S3DIS dataset， and the results of EAJS-Net are compared with those of the state-of-the-art （SOTA） methods. For semantic segmentation on the S3DIS dataset， EAJS-Net achieves a mean intersection over union （mIoU） of 65.9%， overall accuracy （oAcc） of 89.1%， and mean accuracy （mAcc） of 76.0%. Compared to JSNet++， these results represent improvements of 3.5% （mIoU）， 0.4% （oAcc）， and 3.2% （mAcc）. For instance segmentation， EAJS-Net reaches a weighted coverage rate of 61.1%， outperforming JSNet++ by 4.1% （mean weighted coverage， mWCov）， 4.6% （mean coverage， mCov）， and 1.2% （mean recall， mRec）. On the ScanNet dataset， EAJS-Net improves the mIoU for semantic segmentation by 3.2% and increases the weighted coverage rate for instance segmentation by 2.8% compared to JSNet. Visual comparisons between EAJS-Net and other SOTA methods are also presented， demonstrating that EAJS-Net consistently achieves superior segmentation results， even in complex indoor scenes. In addition， ablation experiments are conducted to validate the effectiveness of individual modules within the network. The enhanced attention-based joint segmentation module in EAJS-Net dynamically adjusts attention weights to effectively capture various features， successfully integrating semantic and instance features into the semantic feature space. This integration notably enhances the performance of the semantic segmentation task.

Conclusion

Aiming to address the limitations of existing feature fusion strategies that fail to fully capture inter-instance semantic correlations， this paper proposes a novel semantic-instance joint segmentation network， EAJS-Net， based on an enhanced attention mechanism. A new semantic feature extraction module is designed to capture contextual relationships among points. Additionally， an enhanced attention module is introduced to effectively aggregate instance features into the semantic feature space. This improved feature fusion strategy boosts the performance of joint semantic-instance segmentation. Experimental results demonstrate that EAJS-Net effectively integrates semantic and instance features， substantially improving the accuracy of both segmentation tasks compared to SOTA methods.

Transformer attention-guided optimal view selection and classification for 3D models

Songle Chen, Ruyue Huang, Sixuan Huang, Yi Chen, Qian Li

Journal of Image and Graphics. 2025, 30(12): 3927-3940.

Objective

3D model classification is a fundamental problem in the fields of computer graphics and computer vision， with wide-ranging applications in areas such as computer-aided design， mixed reality， autonomous driving， and robotic navigation. The challenges associated with 3D model classification primarily arise from three key aspects： the difficulty in representing 3D surface geometric features， the diversity of 3D transformations and deformations， and the incompleteness of geometric and topological structures. Existing multi-view-based 3D model classification methods typically render 3D models from multiple preset viewpoints and input all rendered views into a neural network for classification. However， due to the presence of redundant and ineffective views， not all views contribute equally to the classification task. Selecting views that substantially enhance classification performance can not only improve the overall accuracy of multi-view 3D model classification but also help identify representative views that effectively capture the essential characteristics of the 3D model.

Method

This paper proposes a Transformer attention-guided approach for optimal view selection and classification of 3D models. The 3D model is first rendered from 20 viewpoints arranged on a regular icosahedron. A convolutional neural network is then employed to extract feature information from these multiple views， producing a sequence of local multi-view feature tokens. Aiming to retain spatial location information， position encoding is applied to the token sequence. Next， a learnable global classification token is introduced and concatenated with the multi-view feature tokens， forming the input to a Transformer encoder that performs global view feature fusion and generates an initial global classification feature. Subsequently， the optimal view selection module calculates the contribution of each view to the initial global classification token using the attention score matrix from the feature fusion process. The highest-scoring views are selected as the optimal views. These optimal view feature tokens are then concatenated with the initial global classification token and input into the Transformer encoder for a second round of feature fusion， producing the final global classification token. This final token is passed through a classifier to generate the classification probabilities and simultaneously output the selected optimal views. Aiming to enhance generalization during training， the model incorporates random view dropping and contrastive learning strategies.

Result

This study experiments on the ModelNet40 dataset， which comprises 40 object categories. The dataset is suitable for research in 3D object recognition and is widely used for benchmarking algorithm performance. Evaluation metrics include overall accuracy （OA）， average accuracy （AA）， and speed. OA measures classification accuracy across the entire dataset， while AA calculates the mean accuracy across all categories， addressing issues related to class imbalance. The dataset， created by Stanford University， is widely used for performance evaluation of algorithms. First， the Transformer-based multi-view selection and 3D model classification method proposed in this paper are compared with other state-of-the-art deep learning-based 3D model classification methods to validate its effectiveness. Subsequently， ablation experiments are conducted to analyze the impact of different parameter settings on the performance of the proposed method， including multi-view representation， feature extraction backbone， Transformer hidden layer dimension， number of attention heads， contrastive learning strategy， and random view dropout module. On the ModelNet40 benchmark dataset， the proposed method achieves an overall recognition accuracy of 97.61% and an average recognition accuracy of 96.36%. In addition to reaching state-of-the-art classification performance， the optimal views selected based on the Transformer attention score matrix are shown to be highly representative.

Conclusion

The proposed method leverages the Transformer architecture to perform feature fusion across different views. By employing mechanisms such as self-attention， residual connections， and multi-layer stacking， the Transformer effectively learns complex features and captures global contextual relationships among different views. Furthermore， the attention score matrix generated by the Transformer serves as a basis for optimal view selection， enabling efficient classification while identifying the most representative views.

Adaptive ground-truth heatmap generation for bottom-up human pose estimation

Ling Jiang, Zhuocheng Liu, Yuan Xiong, Wei Wu, Kaige Li

Journal of Image and Graphics. 2025, 30(12): 3870-3883.

Objective

Human pose estimation aims to locate skeletal keypoints of individuals in a given image. As a fundamental task in computer vision， human pose estimation has wide applications in human activity recognition， person re-identification， pose tracking， and related fields. Two main approaches for human pose estimation are available： top-down and bottom-up. Top-down methods first detect human bodies in the image， crop out each person， and then estimate the keypoint coordinates. While effective， these methods perform poorly in cases of occlusion， and their computation cost increases with the number of people in the image. In contrast， bottom-up methods detect all identity-independent keypoints simultaneously and then group them into individual poses. These methods are typically lightweight and fast but must handle varying human scales. Bottom-up human pose estimation methods commonly use 2D Gaussian kernels to generate keypoint heatmaps as regression targets because they provide rich spatial information. However， conventional approaches apply Gaussian kernels with a fixed variance across all keypoints， resulting in uniform heatmap structures. This uniformity is problematic given the existing scale variability in bottom-up methods. On the one hand， different keypoints cover different pixel areas in images， and using large Gaussian kernels may introduce semantic ambiguity， particularly for small joints. On the other hand， differences in keypoint scale imply different levels of annotation uncertainty， which the heatmap variance should ideally reflect. The variance of the Gaussian kernel represents uncertainty； thus， it should be proportional to the scale and ambiguity associated with each keypoint. Aiming to address these issues， an adaptive heatmap generation network （AHGNet） for bottom-up human pose estimation is proposed. AHGNet estimates the appropriate radius of the Gaussian kernel for each keypoint by integrating inherent scale information and geometric relationships. Through formula derivation， the relationship between the radius and the Gaussian kernel variance is established， enabling the creation of customized， scale-adaptive ground-truth heatmaps. This approach improves localization accuracy by effectively aligning the heatmap structure with the spatial characteristics of each keypoint.

Method

First， an adaptive heatmap generation module is introduced. This module combines the inherent scale information from image features and the geometric relationship between adjacent keypoints to constrain the coverage areas of kernels. Keypoint scale is defined by semantic coverage areas in images. However， in the actual scene， accurately allowing pixel areas to occupy keypoints is almost impossible， and determining the potential relationship between Gaussian kernels and coverage areas is difficult. Interestingly， the areas occupied by keypoints are found to be related to geometric distance from adjacent keypoints. Therefore， an adaptive heatmap generation module is introduced to generate kernel scale maps of keypoints. This module combine the geometric relationship between adjacent keypoints and inherent scale information from image features. Second， local probabilistic consistency loss is presented to define the distance between the predicted and ground truth heatmaps globally and locally. Most methods based on heatmap regression use L₂ loss for supervised learning. However， as the loss function for heatmap regression， L₂ loss assumes that each pixel point is independent and overlooks the local structural correlation， making it difficult to describe the probability distribution of heatmaps. A keypoint heatmap is a probability distribution that describes pixels belonging to a certain joint. Thus， KL Divergence must be added to describe local probability consistency. Moreover， samples with large prediction errors are difficult to predict； thus， the weight of difficult samples should be increased. Similarly， the weight of easily detected samples should be reduced. Therefore， the dynamic weight is added to balance the contribution of different samples. Inspired by focal loss， which allows the model to actively focus on hard-to-detect samples， this paper utilizes dynamic weights to reduce the contribution of easily detected samples while enhancing the contribution of hard-to-detect samples.

Result

HrHRNet is used as the baseline to establish AHGNet for bottom-up human pose estimation. The model is tested on two public datasets： MS COCO and CrowdPose. Experimental results reveal that AHGNet surpasses HrHRNet in terms of average precision （AP）， achieving 72.1% AP and 74.1% AP on COCO test-dev and CrowdPose dataset， providing improvements of +1.6% AP and +6.5% AP， respectively. In addition， the substantial improvement on the CrowdPose dataset with crowded scenes indicates that AHGNet helps alleviate the problem of human scale changes in complex crowded scenes. Simultaneously， the ablation experiments verified the effectiveness of the proposed method.

Conclusion

AHGNet leverages geometric features between adjacent keypoints and inherent scale information within the image to generate adaptive heatmaps as groundtruth. This network further employs a local probability consistency loss function to address the challenges posed by various human scales， effectively improving the accuracy of bottom-up human pose estimation. AHGNet provides a new paradigm for optimizing supervision signals in bottom-up pose estimation. By dynamically adjusting the Gaussian kernel scale and enforcing local probability constraints， it effectively reduces multiscale ambiguity in complex scenarios.

Dual-stage guided weakly supervised semantic segmentation with Gaussian correction

Xuefei Bai, Yuanhui Wang, Wenjie Xu, Gaoxia Jiang, Wenjian Wang

Journal of Image and Graphics. 2025, 30(12): 3855-3869.

Objective

Weakly supervised semantic segmentation （WSSS） aims to reduce the cost associated with annotating “strong” pixel-level labels by using “weak” labels， such as points， bounding boxes， image-level class labels， and scribbles. Among these， image-level class labels are the most cost-effective and readily available； however， leveraging them for precise segmentation remains a considerable challenge. A widely used WSSS approach based on image-level class labels generally comprises the following steps： 1） training a neural network for image classification using the class labels； 2） using the trained network to generate class activation maps （CAMs）， which serve as seed regions for the segmentation task； and 3） refining these CAMs into pseudo-labels， which are then used as the ground truth to supervise a segmentation network. These steps can be integrated into a single collaborative stage； typically， single-stage frameworks are highly efficient due to their simplified training pipeline. However， the quality of pseudo-labels is crucial to the overall performance of semantic segmentation. High-quality pseudo-labels result in superior segmentation outcomes， whereas noisy or inaccurate pseudo-labels hinder the capability of the model to learn meaningful features. WSSS based on image-level labels faces considerable challenges due to the absence of precise positional and shape-related information， making it difficult to generate accurate segmentation maps. These challenges have led to the development of various approaches， which can be broadly categorized into two types： single-stage methods and multistage methods. Although single-stage methods offer greater efficiency and simplify the overall training process， they often produce less accurate pseudo-labels. This condition is due to the limited refinement of CAMs， resulting in imprecise supervision signals that ultimately degrade segmentation performance. Aiming to alleviate these limitations， a simple yet novel single-stage WSSS framework that incorporates knowledge distillation is introduced to enhance pseudo-label quality without relying on any additional external supervision. The framework enhances the feature learning process within the teacher-student network using a dual-stage knowledge distillation module. This module allows the student network to acquire more dynamic and informative knowledge from the teacher network while preserving key features， thereby enhancing the overall robustness of the student model. Moreover， to further improve segmentation accuracy， a pseudo-label correction module based on a Gaussian mixture model （GMM） is introduced. This module refines the pseudo-labels by modeling the distribution of the CAMs， resulting in highly accurate and reliable supervision signals. The combination of dual-stage knowledge distillation and the Gaussian correction module ensures accurate learning and improved segmentation results， even under weak supervision signals such as image-level labels. Ultimately， the proposed method effectively mitigates the impact of noise during training and enhances the accuracy of the generated pseudo-labels， resulting in superior semantic segmentation outcomes in WSSS tasks.

Method

A novel weakly-supervised semantic segmentation method， aimed at addressing the challenges posed by noisy data points and weak supervision， is proposed. First， a dual-stage knowledge interaction module is introduced to enhance the feature learning process of the teacher and student networks. By enabling highly effective knowledge exchange between the two networks， the proposed approach notably reduces the impact of noise during training， leading to robust feature extraction. Additionally， a Gaussian correction module is proposed to enhance the quality of pseudo-labels. This module refines the pseudo-labels by modeling the distribution of class activation maps. By fitting the distribution more accurately， the module corrects potential errors in the pseudo-labels， ensuring that the model learns from high-quality， refined labels. Therefore， the method boosts the overall performance of weakly-supervised semantic segmentation， making it more robust to noise and improving segmentation accuracy. This method provides a promising solution for weakly-supervised segmentation tasks.

Result

The mIoU values of this method on the PASCAL VOC 2012 and MS COCO 2014 datasets were 74.8% and 42.3%， respectively， surpassing other comparative methods. Specifically， on the PASCAL VOC 2012 dataset， the proposed method achieved a 3.7% improvement over ToCo， an 8.8% enhancement compared to AFA， a 7.5% increase relative to TSCD， and 1.1% compared to BECO. On the MS COCO 2014 dataset， the method improved performance by 2.2% compared to TSCD， 3.4% compared to AFA， and 5.3% compared to AuxSegNet+. Additionally， the mIoU values of different categories are compared on the PASCAL VOC 2012 validation set. The experimental results showed that the method outperformed the competing methods in 16 categories. Notably， for the background class， the method achieved an mIoU of 92.4%， the highest among all methods evaluated. This result indicates that the method effectively leverages the Gaussian correction module to reduce misclassification of background regions， thereby improving segmentation performance. Furthermore， the method achieved notable improvements in categories such as bird， bottle， car， chair， and cow， further demonstrating its effectiveness.

Conclusion

The proposed method effectively mitigates the impact of noise during training and address the issue of incomplete pseudo-label generation through the integration of a dual-stage knowledge distillation module and a Gaussian correction module. This approach achieves remarkable performance improvements compared to existing methods. Overall， the results demonstrate notable advantages in end-to-end weakly supervised semantic segmentation and holds considerable research value.

Video question answering with large language models： a survey

Junlin Xie, Ruifei Zhang, Guanbin Li

Journal of Image and Graphics. 2025, 30(12): 3760-3781.

In recent years， large language models （LLMs） have achieved remarkable progress in natural language processing （NLP）， demonstrating exceptional capabilities in language understanding and generation. These advancements have driven widespread applications in tasks such as text generation， machine translation， question answering， text summarization， and text classification. However， despite their impressive performance in handling and generating text， LLMs face notable limitations when handling highly complex multimodal tasks， particularly in the domain of video question answering （Video QA）. Video QA is a particularly challenging task that requires models to comprehend and generate responses based on dynamic visual content， which often includes temporal and auditory information. Unlike static images or purely textual contents， video data contains inherent temporal dependencies， where the meaning of events and actions unfolds over time. This temporal dimension adds substantial complexity to the understanding process because models must not only interpret individual frames but also maintain coherent understanding across sequences of frames within the broader video context. Consequently， effective Video QA demands advanced temporal information processing capabilities that many LLMs， primarily designed for static text， often struggle to handle adequately. Moreover， the multimodal nature of video， which often involves the integration of visual， auditory， and occasionally textual cues， further complicates the task. Effective Video QA requires the model to seamlessly fuse information across these different modalities， ensuring accurate interpretation and response to questions regarding video content. This process involves understanding visual scenes， recognizing speech or background sounds， and correlating them with the corresponding textual information. The challenge lies not only in processing each modality independently but also in establishing meaningful connections between them to generate coherent and contextually appropriate responses. This paper presents a comprehensive review of the current state of research on Video QA models based on large language models. The technical characteristics， strengths， and weaknesses of non-real-time and real-time Video QA models are also investigated. Non-real-time Video QA models typically operate on pre-recorded video content， allowing them to access and analyze the entire video sequence before generating responses. These models can leverage global contextual information， making such models particularly effective for tasks that require video content analysis， such as video summarization or detailed scene interpretation. However， they may struggle with efficiency and scalability， particularly when handling long videos or large datasets. In contrast， real-time Video QA models are designed to process video streams as they are received， increasing their suitability for applications requiring immediate responses， such as live video monitoring or interactive video systems. However， these models must maintain a balance between processing speed and accuracy due to their frequently limited access to the full temporal context of the video. The paper discusses the challenges encountered by these models in maintaining performance under real-time constraints， including efficient computation and prediction capability based on partial information. Additionally， the paper explores the commonly used datasets in Video QA research， highlighting their features， limitations， and the types of tasks they are designed to address. The evaluation of Video QA models is also examined， focusing on the metrics and benchmarks used to assess their performance. Understanding the strengths and weaknesses of different datasets is crucial for advancing the field， helping in the identification of gaps in current research and guiding the development of robust and versatile models. Finally， the paper addresses the extensive challenges and bottlenecks in the field of Video QA， including the difficulties in scaling models to handle large and diverse video datasets， the need for efficient multimodal fusion techniques， and the computational demands associated with video data processing in real-time. The discussion is further extended to consider the potential future research directions in Video QA， with particular emphasis on improving the temporal reasoning capabilities of LLMs， enhancing their multimodal integration， and developing efficient model architectures that can operate effectively under resource constraints. Overall， while large language models have presented new possibilities in the field of video interpretation， considerable challenges remain in adapting these models to the specific demands of Video QA. Through the systematic review of the current advancements and the presentation of the key obstacles and future directions， this paper aims to contribute to the ongoing efforts to develop highly capable and intelligent multimodal AI systems. The field must continue innovations in the following areas： temporal modeling， where novel architectures that can effectively capture long-range dependencies in video sequences are needed； multimodal representation learning， where sophisticated approaches for integrating visual， auditory， and textual features could yield substantial improvements. Furthermore， the development of highly efficient training paradigms that can address the computational intensity of video processing while retaining model performance is essential for practical applications. Another critical area for future work focuses on the creation of highly comprehensive and challenging benchmark datasets that effectively reflect real-world scenarios， pushing the boundaries of what current models can achieve. As research in this area progresses， addressing these challenges will be crucial for realizing the full potential of LLMs in video interpretation applications. Achieving this goal will require AI systems that can interpret and reason about dynamic visual content with a level of proficiency comparable to human cognition. The integration of advanced techniques from computer vision， speech processing， and natural language understanding will be pivotal in developing truly multimodal systems capable of managing the complexity and variability in real-world video data. Through continued innovation and interdisciplinary collaboration， the field can overcome current limitations and drive the development of next-generation video understanding technologies with broad applicability across domains such as education， entertainment， surveillance， and human-computer interaction.

Recent progress in rotation-invariant point cloud networks

Zhengbao Wang, Zhenxuan Zeng, Xuan Ouyang, Haozhe Chen, Linjie Li, Jiaqi Yang

Journal of Image and Graphics. 2025, 30(12): 3782-3803.

In recent years， deep learning networks for point clouds have achieved remarkable advancements， with their robust semantic understanding capabilities propelling research across the entire field of three-dimensional （3D） computer vision. These advancements have enabled accurate and efficient processing of 3D data， supporting applications in autonomous driving， robotics， remote sensing and mapping， and augmented reality. However， 3D point clouds often exhibit complex transformation symmetries， with rotation being a particularly challenging yet critical factor. The spatial coordinates of point clouds， which are the fundamental input to point cloud networks， undergo substantial changes， resulting in feature output variations. However， the semantic information embedded within point clouds theoretically remains consistent under various rotational transformations. This spatial variability substantially impacts the stability and reliability of conventional point cloud deep learning networks in semantic perception tasks， such as recognition， classification， and segmentation， reducing their effectiveness in real-world scenarios characterized by arbitrary orientations and poses. Early studies primarily relied on rotational data augmentation to enhance the robustness of point cloud networks against rotational variations. While data augmentation can improve generalization to some extent， it falls short of addressing the fundamental issue posed by the infinite and continuous nature of the rotation group. Acknowledging these limitations， an increasing number of researchers have shifted their focus toward designing rotation-invariant point cloud deep learning networks， which aim to mitigate the impact of rotation on feature extraction at the architectural level. Therefore， researchers seek to achieve consistent semantic perception regardless of point cloud orientation， thereby enhancing the applicability of deep learning models in real-world scenarios where data can be encountered in arbitrary poses. This paper presents a comprehensive survey of the current state of research on rotation-invariant point cloud networks. The research background is first outlined to highlight the importance of rotation invariance in 3D vision tasks and the challenges posed by rotational symmetries in point cloud data. Then， a systematic categorization of the prevailing mainstream methods is investigated. Particularly， the rotation-invariant point cloud networks can be broadly classified into the following three categories： 1） geometric-guided rotation-invariant methods： Using the traditional geometric analysis algorithms， these methods extract rotation-invariant geometric representations such as relative distances， angles， local reference frames， and canonical poses. These representations are then integrated into point cloud networks， facilitating learning of high-level semantic features and maintaining robustness to rotational transformations simultaneously. 2） Feature-guided rotation-invariant methods： These methods employ rotation-equivariant point cloud networks to extract point cloud representations that contain shape and pose information. Leveraging the inherent principles of equivariant networks， they subsequently remove the pose information from the rotation-equivariant representations， obtaining rotation-invariant point cloud features. 3） Training-guided rotation-invariant methods： These methods focus on designing sophisticated and highly generalizable rotational data augmentation training schemes， allowing non-rotation-invariant point cloud networks to gradually acquire robustness of rotations and achieve stable performance simultaneously. An in-depth analysis of the core concepts and algorithmic improvements that support these methods is provided for each category. The current research content on this issue and methodologies within the academic community are outlined， and the advantages and disadvantages of each method are summarized and compared. Subsequently， a comprehensive overview of the prevalent downstream tasks in the research of rotation-invariant point cloud networks is presented. These tasks include point cloud classification， point cloud segmentation， and point cloud retrieval. For each of these tasks， an in-depth discussion of the commonly employed datasets and evaluation metrics， which are essential for assessing network performance， is provided. Additionally， the quantitative performance metrics of mainstream rotation-invariant point cloud networks applied to these tasks are summarized and analyzed， offering a comparative perspective on their efficacy and robustness under rotational variations. Afterward， the downstream application prospects of rotation-invariant point cloud deep learning networks， including point cloud self-supervised representation learning， end-to-end point cloud registration， and point cloud completion， are examined and summarized. Finally， an outlook on future developments and research hotspots is presented. In addition to the ongoing development of new rotation-invariant point cloud networks， three primary issues warrant further research： 1） discrimination of effective geometric attributes. Current approaches are limited by the design of geometric attribute extraction algorithms. An in-depth discussion and determination of the effectiveness of different rotation-invariant geometric attributes within deep learning frameworks could yield novel insights and foster the development of innovative strategies to advance this field. 2） Highly integratable rotation-invariant mechanism. On the one hand， existing non-rotation-invariant point cloud networks continue to demonstrate strong performance on aligned data. The challenge lies in incorporating rotation invariance into these networks in a straightforward manner degrading their original performance. This challenge remains a key research topic because seamless integration requires innovative architectural designs and methodological approaches. On the other hand， rotation-invariant point cloud networks should also exhibit simplicity and reusability， enabling their direct application to downstream tasks with minimal adaptation. 3） High computational efficiency in invariant feature extraction modules. Although many existing methods demonstrate commendable performance， they often incur substantial time and computational costs， making it challenging to efficiently process large-scale point cloud data. Therefore， designing more efficient rotation-invariant point cloud networks that maintain robust feature extraction capabilities while minimizing computational overhead is crucial. Addressing the aforementioned challenges will notably enhance the effectiveness and practicality of rotation-invariant point cloud deep learning networks， facilitating their widespread adoption in complex 3D environments. This survey aims to provide researchers in 3D computer vision with a foundational understanding of current methodologies， highlight key challenges， and suggest potential avenues for future research.