In recent years, single-object tracking in satellite videos has gained substantial attention and plays a pivotal role in military and civilian domains. This tracking has found applications in urban-scale disaster relief, public security surveillance, and the monitoring of emergency events, among others. However, due to a combination of factors, such as small target size, interference from similar targets, motion blur, and complex backgrounds, single-object tracking in satellite videos presents numerous challenges. Aiming to promote further exploration in this domain by scholars domestically and internationally, this paper comprehensively reviews and critically analyzes the current state of the art in satellite video-based single-object tracking. Considering challenges and advantages, video satellites offer an expansive field of view. Targets such as vehicles typically occupy only a few to a dozen pixels in satellite videos, with limited distinguishing features or textures. Additionally, satellite videos contain many targets, and the distinguishability between the targets of interest and interfering objects is poor, presenting a high degree of similarity. Moreover, target blurring may occur due to their rapid target movement or satellite platform jitter. When the moving target is inconspicuous and background information overshadows target features, tracking failure is likely to occur. However, compared to ground-based or low-altitude videos, satellite video-based object tracking offers certain advantages. For example, external factors related to the target, such as the camera perspective, are relatively stable, aiding tracking algorithms in maintaining a consistent lock on the target. Most objects in satellite videos are rigid and rarely undergo substantial deformation during tracking. Additionally, the aspect ratios of targets remain approximately consistent across video frames, reducing the potential for algorithmic confusion. The motion of targets is typically straightforward, with trajectories generally following straight lines or smooth curves, enabling the prediction of target positions based on historical motion data. Regarding the development of tracking methods, this paper reviews the evolution of single-object tracking methods for satellite videos and highlights typical tracking paradigms, including generative-based approaches, correlation filter-based methods, and deep learning-based techniques. Deep learning-based tracking methods can be further classified into convolutional neural network (CNN)-based and Transformer-based methods. In contrast to the hand-crafted features employed in correlation filter-based methods, CNNs can extract more comprehensive and robust features, thereby enhancing target tracking performance. In recent years, an increasing number of scholars have applied CNNs to satellite video object tracking tasks. However, when processing high-resolution images, long time-series data, and complex backgrounds, which are common in satellite videos, CNNs exhibit certain limitations. Aiming to address these limitations, Transformers have been gradually introduced into satellite video object tracking. Transformers can capture global spatial information and long-term temporal dependencies, offering a promising alternative for improving tracking accuracy in complex scenarios. Regarding datasets and evaluation metrics, this study compiles existing single-object tracking datasets for satellite videos, along with commonly adopted performance evaluation metrics. Prominent datasets in this field include XDU-BDSTU, video satellite objects(VISO), SatSOT, and the oriented object tracking benchmark(OOTB). Among them, the VISO dataset is the largest in scale, comprising training and test subsets. The XDU-BDSTU dataset features images with a large swath width, making it suitable for long-term tracking tasks. The OOTB dataset provides annotations using rotated bounding boxes, which accurately represents the actual target geometry. The main performance evaluation metrics include precision, success rate, and frame rate, which collectively assess tracking methods in terms of tracking accuracy and speed. Aiming to evaluate the applicability of various tracking algorithms across different scenarios, this paper selects 18 algorithms for performance evaluation and analysis on a self-constructed test set. Experimental results highlight the critical roles of motion estimation, temporal information utilization, and background information exploitation in satellite video object tracking. Specifically, the correlation filter with motion estimation(CFME) algorithm leverages historical motion information of the target to enhance tracking performance, while the Trdimp algorithm incorporates temporal and background information, yielding favorable outcomes. When a vehicle makes a turn, the hand-crafted features employed by the correlation filter-based method CFME lack rotational invariance and are poorly equipped to handle changes in the target’s bounding box due to rotation, resulting in suboptimal tracking performance. Conversely, methods such as Trdimp and Trsiam directly estimate the target’s bounding box, while approaches such as siamese region proposal network(SiamRPN) and SiamRPN++ predefine anchor boxes with different aspect ratios, effectively addressing the challenge of in-plane rotation. Finally, in terms of future perspectives, this paper outlines the anticipated trajectory of single-object tracking algorithms for satellite videos across several key dimensions: standardizing evaluation metrics for tracking results, developing large-scale and high-quality satellite video object tracking datasets, devising models specifically tailored to satellite video tracking challenges, and enabling robust long-term tracking capabilities. In the domain of general video target tracking, commonly used evaluation metrics include those from the OTB and VOT benchmarks. For satellite video target tracking, scholars predominantly adopt the precision and success rate metrics defined by the OTB evaluation framework. In the OTB metrics for general videos, the precision threshold is customarily set to 20 pixels, and the success rate is evaluated based on the area under the curve (AUC) of the overlap score. However, in satellite video target tracking, researchers often adopt varying threshold settings, which hinders the objective evaluation of algorithms under a unified standard. Thus, standardizing evaluation metrics for tracking results is essential for the advancement of satellite video single-object tracking. Before the emergence of large-scale test datasets, most studies in satellite video object tracking verified algorithms using only a few targets, which restricted comprehensive algorithm performance assessment. Moreover, the use of different test dataset across studies has further hindered direct comparisons between algorithms. Consequently, the development of large-scale, high-quality satellite video object tracking datasets is urgently needed, not only for effective model training, but also for model testing and performance benchmarking. Future research could benefit from rapidly assimilating the latest advancements in general video object tracking domain and adapting them to the unique characteristics of satellite videos. Given the rich background information and the continuous, linear nature of target motion trajectories between adjacent frames in satellite videos, these priors can be fully leveraged to explore global spatial and temporal information, thereby enhancing tracking accuracy. Furthermore, techniques such as knowledge distillation, network pruning, and neural architecture search hold considerable potential for autonomously constructing streamlined, low-complexity models specifically tailored to satellite video single-object tracking. These approaches can enable high-precision, real-time target tracking under constrained computation resources. In contrast to ground-based surveillance videos, satellite videos offer broad coverage, making it possible to track trajectories across entire urban areas. However, in such large-scale scenarios, multiple challenges, such as occlusion, interference from similar objects, motion blur, illumination variation, and target rotation, often occur simultaneously. Aiming to address the demands of real-world applications, the development of satellite video tracking algorithms capable of simultaneously addressing these challenges is imperative.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |