RGB-D salient object detection (SOD) combines complementary information from RGB and depth images, offering substantially enhanced performance in complex and challenging scenes compared to RGB-only models. This technique has gained considerable attention in the academic community due to its capability to effectively capture salient objects by leveraging visual and spatial information. However, existing RGB-D detection models face several key challenges. First, efficiently utilizing and fusing multi-modal information from RGB and depth inputs remains a difficult task due to the inherent differences between the two modalities. RGB images provide rich color and texture details but lack depth information, whereas depth maps offer spatial cues but are often noisy or of low quality. Second, achieving accurate boundary detection is particularly challenging in cluttered or noisy environments. Noisy depth maps and cluttered backgrounds can obscure object contours, making it difficult to predict sharp and precise boundaries. These challenges highlight the urgent need for a robust model that can effectively integrate RGB and depth information while simultaneously addressing noise and enhancing boundary precision.
Aiming to address these challenges, a novel method, the cross-modal feature fusion and detail-enhanced RGB-D salient object detection network (CFADNet), is introduced. The proposed network incorporates two innovative modules: the cross-modal attention fusion enhancement module (CAFEM) and the boundary feature extraction module (BFEM). The CAFEM is designed to enhance the integration of RGB and depth features by leveraging attention mechanisms that emphasize the most informative aspects of each modality. Specifically, channel attention is applied to the RGB features to suppress noise and enhance critical color and texture details. Similarly, spatial attention is applied to the depth features to emphasize spatial regions that are relevant for salient object detection. This attention-based fusion mechanism ensures that the model effectively retains global semantic information from the depth map while preserving fine-grained details from the RGB image. The fusion process is structured in multiple layers, progressively integrating features at different scales to fully utilize the complementary strengths of RGB and depth modalities. In contrast, the BFEM is specifically designed to improve the accuracy of salient object boundaries. Accurate contour detection is crucial for generating high-quality saliency maps; thus, BFEM leverages low-level CNN features, which are rich in edge and texture information. These features are refined through channel attention, which filters out noise and irrelevant details, enhancing the clarity of boundary-related cues. The refined features are then used to guide cross-modal feature decoding, ensuring that the final saliency maps exhibit sharp and accurate boundaries. By combining the edge-extraction capabilities of low-level CNN features with the semantic richness of cross-modal features, BFEM notably improves boundary precision in RGB-D salient object detection.
Aiming to evaluate the performance of CFADNet, extensive experiments are conducted on four widely used RGB-D salient object detection datasets: NJU2K, NLPR, STERE, and SIP. These datasets encompass a wide range of diverse and challenging scenes, making them ideal for evaluating the generalization capability of the proposed model. CFADNet is compared against 16 state-of-the-art RGB-D salient object detection methods, including DCF, CIRNet, and CAVER, using standard quantitative metrics such as mean absolute error (MAE), F-measure(Fβ), and structural similarity (Sα). CFADNet demonstrated superior performance across all datasets, particularly excelling in the MAE metric. Specifically, this network outperformed the second-best method by 6.9%, 10.5%, 9.7%, and 2.4% on the NJU2K, NLPR, STERE, and SIP datasets, respectively. These substantial improvements highlight the effectiveness of the attention-based fusion strategy and edge refinement mechanisms. Furthermore, CFADNet consistently achieved higher F-measure and Sα scores, indicating that the model not only reduces pixel-level errors but also more accurately preserves the overall structure and shape of salient objects compared to competing methods. In addition to quantitative evaluations, qualitative comparisons are conducted to visually assess the performance of CFADNet in various challenging scenarios. Results show that the proposed method generates saliency maps with sharp and accurate boundaries, even in cases where salient objects exhibit complex edges or are embedded in cluttered and noisy backgrounds. This finding demonstrates the robustness of CFADNet in handling difficult scenes by effectively separating salient objects from their background while preserving fine boundary details. The visual results further confirm that CFADNet successfully captures global semantic information and local detail, ensuring accurate identification and clear isolation of salient objects from the background.
This paper presents CFADNet, a cross-modal feature fusion and detail-enhancement network for RGB-D SOD, designed to address the two major challenges: effective multimodal feature fusion and accurate boundary detection. CFADNet introduces two novel modules, the CAFEM and the BFEM. CFADNet effectively integrates RGB and depth information while notably enhancing the precision of salient object boundaries. The attention mechanisms used in the CAF0EM enable the network to fully leverage the complementary information from RGB and depth modalities. Simultaneously, the BFEM module focuses on refining edge details, resulting in sharper and more accurate saliency predictions. Extensive experiments conducted on four benchmark datasets demonstrate that CFADNet consistently outperforms existing state-of-the-art methods, achieving superior performance across key evaluation metric, including MAE, F-measure, and structural similarity index. These findings highlight the robustness and strong generalization capability of CFADNet in diverse and challenging environments. By combining attention-based feature fusion with effective edge refinement, CFADNet emerges as a powerful and reliable solution for RGB-D salient object detection into complex scenarios. Future research could explore extending this approach to other multi-modal tasks, such as RGB-Thermal or multi-spectral image processing, where challenges related to multi-modal fusion and boundary detection are also prevalent. Additionally, optimizing the computational efficiency of CFADNet for real-time deployment represents a potential research direction, enabling its application in time-sensitive applications such as autonomous driving and robotics.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |