Cross-modal feature fusion and detail-enhanced RGB-D salient object detection

Cross-modal feature fusion and detail-enhanced RGB-D salient object detection

PDF

Xiaogang Song¹^,²^,^*, Yuping Tan¹, Fuqiang Guo¹, Xiaofeng Lu¹^,², Xinhong Hei¹^,²

Journal of Image and Graphics | 2025, 30(12) : 3838 - 3854

Less

Journal of Image and Graphics | 2025, 30(12): 3838-3854

• Image Understanding and Computer Vision •

Cross-modal feature fusion and detail-enhanced RGB-D salient object detection

Full

Xiaogang Song¹^,²^,^*, Yuping Tan¹, Fuqiang Guo¹, Xiaofeng Lu¹^,², Xinhong Hei¹^,²

Affiliations

¹School of Computer Science and Engineering， Xi’an University of Technology， Xi’an710048， China

²Human Machine Integration Intelligent Robot Shaanxi Provincial University Engineering Research Center， Xi’an710048， China

Published: 2025-12-16 doi: 10.11834/jig.240653

Outline

Abstract

Less

Objective

RGB-D salient object detection （SOD） combines complementary information from RGB and depth images， offering substantially enhanced performance in complex and challenging scenes compared to RGB-only models. This technique has gained considerable attention in the academic community due to its capability to effectively capture salient objects by leveraging visual and spatial information. However， existing RGB-D detection models face several key challenges. First， efficiently utilizing and fusing multi-modal information from RGB and depth inputs remains a difficult task due to the inherent differences between the two modalities. RGB images provide rich color and texture details but lack depth information， whereas depth maps offer spatial cues but are often noisy or of low quality. Second， achieving accurate boundary detection is particularly challenging in cluttered or noisy environments. Noisy depth maps and cluttered backgrounds can obscure object contours， making it difficult to predict sharp and precise boundaries. These challenges highlight the urgent need for a robust model that can effectively integrate RGB and depth information while simultaneously addressing noise and enhancing boundary precision.

Method

Aiming to address these challenges， a novel method， the cross-modal feature fusion and detail-enhanced RGB-D salient object detection network （CFADNet）， is introduced. The proposed network incorporates two innovative modules： the cross-modal attention fusion enhancement module （CAFEM） and the boundary feature extraction module （BFEM）. The CAFEM is designed to enhance the integration of RGB and depth features by leveraging attention mechanisms that emphasize the most informative aspects of each modality. Specifically， channel attention is applied to the RGB features to suppress noise and enhance critical color and texture details. Similarly， spatial attention is applied to the depth features to emphasize spatial regions that are relevant for salient object detection. This attention-based fusion mechanism ensures that the model effectively retains global semantic information from the depth map while preserving fine-grained details from the RGB image. The fusion process is structured in multiple layers， progressively integrating features at different scales to fully utilize the complementary strengths of RGB and depth modalities. In contrast， the BFEM is specifically designed to improve the accuracy of salient object boundaries. Accurate contour detection is crucial for generating high-quality saliency maps； thus， BFEM leverages low-level CNN features， which are rich in edge and texture information. These features are refined through channel attention， which filters out noise and irrelevant details， enhancing the clarity of boundary-related cues. The refined features are then used to guide cross-modal feature decoding， ensuring that the final saliency maps exhibit sharp and accurate boundaries. By combining the edge-extraction capabilities of low-level CNN features with the semantic richness of cross-modal features， BFEM notably improves boundary precision in RGB-D salient object detection.

Result

Aiming to evaluate the performance of CFADNet， extensive experiments are conducted on four widely used RGB-D salient object detection datasets： NJU2K， NLPR， STERE， and SIP. These datasets encompass a wide range of diverse and challenging scenes， making them ideal for evaluating the generalization capability of the proposed model. CFADNet is compared against 16 state-of-the-art RGB-D salient object detection methods， including DCF， CIRNet， and CAVER， using standard quantitative metrics such as mean absolute error （MAE）， F-measure（F_β）， and structural similarity （S_α）. CFADNet demonstrated superior performance across all datasets， particularly excelling in the MAE metric. Specifically， this network outperformed the second-best method by 6.9%， 10.5%， 9.7%， and 2.4% on the NJU2K， NLPR， STERE， and SIP datasets， respectively. These substantial improvements highlight the effectiveness of the attention-based fusion strategy and edge refinement mechanisms. Furthermore， CFADNet consistently achieved higher F-measure and Sα scores， indicating that the model not only reduces pixel-level errors but also more accurately preserves the overall structure and shape of salient objects compared to competing methods. In addition to quantitative evaluations， qualitative comparisons are conducted to visually assess the performance of CFADNet in various challenging scenarios. Results show that the proposed method generates saliency maps with sharp and accurate boundaries， even in cases where salient objects exhibit complex edges or are embedded in cluttered and noisy backgrounds. This finding demonstrates the robustness of CFADNet in handling difficult scenes by effectively separating salient objects from their background while preserving fine boundary details. The visual results further confirm that CFADNet successfully captures global semantic information and local detail， ensuring accurate identification and clear isolation of salient objects from the background.

Conclusion

This paper presents CFADNet， a cross-modal feature fusion and detail-enhancement network for RGB-D SOD， designed to address the two major challenges： effective multimodal feature fusion and accurate boundary detection. CFADNet introduces two novel modules， the CAFEM and the BFEM. CFADNet effectively integrates RGB and depth information while notably enhancing the precision of salient object boundaries. The attention mechanisms used in the CAF0EM enable the network to fully leverage the complementary information from RGB and depth modalities. Simultaneously， the BFEM module focuses on refining edge details， resulting in sharper and more accurate saliency predictions. Extensive experiments conducted on four benchmark datasets demonstrate that CFADNet consistently outperforms existing state-of-the-art methods， achieving superior performance across key evaluation metric， including MAE， F-measure， and structural similarity index. These findings highlight the robustness and strong generalization capability of CFADNet in diverse and challenging environments. By combining attention-based feature fusion with effective edge refinement， CFADNet emerges as a powerful and reliable solution for RGB-D salient object detection into complex scenarios. Future research could explore extending this approach to other multi-modal tasks， such as RGB-Thermal or multi-spectral image processing， where challenges related to multi-modal fusion and boundary detection are also prevalent. Additionally， optimizing the computational efficiency of CFADNet for real-time deployment represents a potential research direction， enabling its application in time-sensitive applications such as autonomous driving and robotics.

Key words

salient object detection（SOD） / attention mechanism / cross-modal / feature fusion / edge detail-enhancement

Cite this Article

Xiaogang Song, Yuping Tan, Fuqiang Guo, Xiaofeng Lu, Xinhong Hei. Cross-modal feature fusion and detail-enhanced RGB-D salient object detection[J]. Journal of Image and Graphics, 2025 , 30 (12) : 3838 -3854 . DOI: 10.11834/jig.240653

Appendix

Less

Year 2025 volume 30 Issue 12

PDF

107

Cite this Article

BibTeX

Article Info

doi: 10.11834/jig.240653

Receive Date：2024-11-07
Online Date：2026-04-09
Published：2025-12-16

Article Data

Affiliations

History

Received：2024-11-07
Revised：2025-04-21

Affiliations

¹School of Computer Science and Engineering， Xi’an University of Technology， Xi’an710048， China

²Human Machine Integration Intelligent Robot Shaanxi Provincial University Engineering Research Center， Xi’an710048， China

References

Share

https://castjournals.cast.org.cn/joweb/zgtxtxxb/EN/10.11834/jig.240653

Share to

Scan QR to access full text

Cite this article

BibTeX

Citations

表12种不同金属材料的力学参数

科 Family	属数 Number of genus	种数 Number of species	占总种数比例 Percentage of total species (%)	属 Genus	种数 Number of species	占总种数比例 Percentage of total species (%)
鹅膏菌科Amanitaceae	2	11	5.26	鹅膏菌属 Amanita	10	4.78
小菇科 Mycenaceae	2	12	5.74	丝盖伞属 Inocybe	5	2.39
多孔菌科 Polyporaceae	8	14	6.70	蜡蘑属 Laccaria	5	2.39
红菇科 Russulaceae	3	23	11.00	小皮伞属 Marasmius	6	2.87
				小菇属 Mycena	11	5.26
				光柄菇属 Pluteus	5	2.39
				红菇属 Russula	17	8.13
				栓菌属 Trametes	5	2.39

关闭全屏

BibTeX
EndNote
RefWorks
TxT

Articles: Latest Articles; Most Read; Collections

Updates: Events; News; Multimedia

About: About Us

Contact

No. 86 Xueyuan South Road, Haidian District, Beijing

100081

010-62199257

qkjq@cast.org.cn

Copyright © 2025 China Association for Science and Technology. All rights reserved. For all open access content, the relevant licensing terms apply.
Sponsored by the Office of the Leading Group for Cybersecurity and Informatization of CAST, and supported by Science and Technology Review Publishing House