收藏切换
Adaptive ground-truth heatmap generation for bottom-up human pose estimation
收藏切换
PDF
Ling Jiang1, Zhuocheng Liu2, Yuan Xiong2, Wei Wu2, Kaige Li2, *
Journal of Image and Graphics | 2025, 30(12) : 3870 - 3883
Less
收藏切换
Journal of Image and Graphics | 2025, 30(12): 3870-3883
Image Understanding and Computer Vision
Adaptive ground-truth heatmap generation for bottom-up human pose estimation
Full
Ling Jiang1, Zhuocheng Liu2, Yuan Xiong2, Wei Wu2, Kaige Li2, *
Affiliations
  • 1School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan232001, China
  • 2State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing100191, China
Published: 2025-12-16 doi: 10.11834/jig.240615
Outline
收藏切换
Objective

Human pose estimation aims to locate skeletal keypoints of individuals in a given image. As a fundamental task in computer vision, human pose estimation has wide applications in human activity recognition, person re-identification, pose tracking, and related fields. Two main approaches for human pose estimation are available: top-down and bottom-up. Top-down methods first detect human bodies in the image, crop out each person, and then estimate the keypoint coordinates. While effective, these methods perform poorly in cases of occlusion, and their computation cost increases with the number of people in the image. In contrast, bottom-up methods detect all identity-independent keypoints simultaneously and then group them into individual poses. These methods are typically lightweight and fast but must handle varying human scales. Bottom-up human pose estimation methods commonly use 2D Gaussian kernels to generate keypoint heatmaps as regression targets because they provide rich spatial information. However, conventional approaches apply Gaussian kernels with a fixed variance across all keypoints, resulting in uniform heatmap structures. This uniformity is problematic given the existing scale variability in bottom-up methods. On the one hand, different keypoints cover different pixel areas in images, and using large Gaussian kernels may introduce semantic ambiguity, particularly for small joints. On the other hand, differences in keypoint scale imply different levels of annotation uncertainty, which the heatmap variance should ideally reflect. The variance of the Gaussian kernel represents uncertainty; thus, it should be proportional to the scale and ambiguity associated with each keypoint. Aiming to address these issues, an adaptive heatmap generation network (AHGNet) for bottom-up human pose estimation is proposed. AHGNet estimates the appropriate radius of the Gaussian kernel for each keypoint by integrating inherent scale information and geometric relationships. Through formula derivation, the relationship between the radius and the Gaussian kernel variance is established, enabling the creation of customized, scale-adaptive ground-truth heatmaps. This approach improves localization accuracy by effectively aligning the heatmap structure with the spatial characteristics of each keypoint.

Method

First, an adaptive heatmap generation module is introduced. This module combines the inherent scale information from image features and the geometric relationship between adjacent keypoints to constrain the coverage areas of kernels. Keypoint scale is defined by semantic coverage areas in images. However, in the actual scene, accurately allowing pixel areas to occupy keypoints is almost impossible, and determining the potential relationship between Gaussian kernels and coverage areas is difficult. Interestingly, the areas occupied by keypoints are found to be related to geometric distance from adjacent keypoints. Therefore, an adaptive heatmap generation module is introduced to generate kernel scale maps of keypoints. This module combine the geometric relationship between adjacent keypoints and inherent scale information from image features. Second, local probabilistic consistency loss is presented to define the distance between the predicted and ground truth heatmaps globally and locally. Most methods based on heatmap regression use L2 loss for supervised learning. However, as the loss function for heatmap regression, L2 loss assumes that each pixel point is independent and overlooks the local structural correlation, making it difficult to describe the probability distribution of heatmaps. A keypoint heatmap is a probability distribution that describes pixels belonging to a certain joint. Thus, KL Divergence must be added to describe local probability consistency. Moreover, samples with large prediction errors are difficult to predict; thus, the weight of difficult samples should be increased. Similarly, the weight of easily detected samples should be reduced. Therefore, the dynamic weight is added to balance the contribution of different samples. Inspired by focal loss, which allows the model to actively focus on hard-to-detect samples, this paper utilizes dynamic weights to reduce the contribution of easily detected samples while enhancing the contribution of hard-to-detect samples.

Result

HrHRNet is used as the baseline to establish AHGNet for bottom-up human pose estimation. The model is tested on two public datasets: MS COCO and CrowdPose. Experimental results reveal that AHGNet surpasses HrHRNet in terms of average precision (AP), achieving 72.1% AP and 74.1% AP on COCO test-dev and CrowdPose dataset, providing improvements of +1.6% AP and +6.5% AP, respectively. In addition, the substantial improvement on the CrowdPose dataset with crowded scenes indicates that AHGNet helps alleviate the problem of human scale changes in complex crowded scenes. Simultaneously, the ablation experiments verified the effectiveness of the proposed method.

Conclusion

AHGNet leverages geometric features between adjacent keypoints and inherent scale information within the image to generate adaptive heatmaps as groundtruth. This network further employs a local probability consistency loss function to address the challenges posed by various human scales, effectively improving the accuracy of bottom-up human pose estimation. AHGNet provides a new paradigm for optimizing supervision signals in bottom-up pose estimation. By dynamically adjusting the Gaussian kernel scale and enforcing local probability constraints, it effectively reduces multiscale ambiguity in complex scenarios.

human pose estimation  /  adaptive scale  /  bottom-up  /  heatmap regression  /  dynamic weight
Ling Jiang, Zhuocheng Liu, Yuan Xiong, Wei Wu, Kaige Li. Adaptive ground-truth heatmap generation for bottom-up human pose estimation[J]. Journal of Image and Graphics, 2025 , 30 (12) : 3870 -3883 . DOI: 10.11834/jig.240615
Year 2025 volume 30 Issue 12
PDF
118
61
Cite this Article
BibTeX
Article Info
doi: 10.11834/jig.240615
  • Receive Date:2024-10-21
  • Online Date:2026-04-09
  • Published:2025-12-16
Article Data
Affiliations
History
  • Received:2024-10-21
  • Revised:2025-05-22
Affiliations
    1School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan232001, China
    2State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing100191, China
References
Share
https://castjournals.cast.org.cn/joweb/zgtxtxxb/EN/10.11834/jig.240615
Share to
QR

Scan QR to access full text

Cite this article
BibTeX
Citations
表12种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏
  • BibTeX
  • EndNote
  • RefWorks
  • TxT