Deep Learning-Enabled Integration of Histology and Transcriptomics for Tissue Spatial Profile Analysis

Deep Learning-Enabled Integration of Histology and Transcriptomics for Tissue Spatial Profile Analysis

PDF

Yongxin Ge¹^,^†^,^*, Jiake Leng¹^,^†, Ziyang Tang²^,^†, Kanran Wang³^,⁴^,^†, Kaicheng U⁵^,⁶, Sophia Meixuan Zhang⁷^,⁸, Sen Han⁹, Yiyan Zhang¹⁰^,¹¹^,¹², Jinxi Xiang¹³, Sen Yang¹³, Xiang Liu¹⁴, Yi Song¹⁵^,^*, Xiyue Wang¹³^,^*, Yuchen Li¹³^,^*, Junhan Zhao¹²^,¹⁶^,^*

Research. Vol 8 Article ID 0568

Less

Research. Vol 8 Article ID 0568

• Research Article •

Deep Learning-Enabled Integration of Histology and Transcriptomics for Tissue Spatial Profile Analysis

Full

Affiliations

¹ School of Big Data and Software Engineering, Chongqing University, Chongqing, China.

² Department of Computer and Information Technology, Purdue University, West Lafayette, IN, USA.

³ Radiation Oncology Center, Chongqing University Cancer Hospital, Chongqing, China.

⁴ Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China.

⁵ Tri-Institutional Computational Biology & Medicine, Weill Cornell Medicine, New York, NY, USA.

⁶ Department of Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA.

⁷ College of Agriculture and Life Sciences, Cornell University, Ithaca, NY, USA.

⁸ Harvard College, Harvard University, Cambridge, MA, USA.

⁹ Division of Genetics, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.

¹⁰ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

¹¹ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

¹² Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

¹³ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA.

¹⁴ Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA.

¹⁵ Department of Neurosurgery, Chongqing University Three Gorges Hospital, Chongqing, China.

¹⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Published: 2025-01-17 doi: 10.34133/research.0568

Outline

Abstract

Less

Spatially resolved transcriptomics enable comprehensive measurement of gene expression at subcellular resolution while preserving the spatial context of the tissue microenvironment. While deep learning has shown promise in analyzing SCST datasets, most efforts have focused on sequence data and spatial localization, with limited emphasis on leveraging rich histopathological insights from staining images. We introduce GIST, a deep learning-enabled gene expression and histology integration for spatial cellular profiling. GIST employs histopathology foundation models pretrained on millions of histology images to enhance feature extraction and a hybrid graph transformer model to integrate them with transcriptome features. Validated with datasets from human lung, breast, and colorectal cancers, GIST effectively reveals spatial domains and substantially improves the accuracy of segmenting the microenvironment after denoising transcriptomics data. This enhancement enables more accurate gene expression analysis and aids in identifying prognostic marker genes, outperforming state-of-the-art deep learning methods with a total improvement of up to 49.72%. GIST provides a generalizable framework for integrating histology with spatial transcriptome analysis, revealing novel insights into spatial organization and functional dynamics.

Cite this Article

Yongxin Ge, Jiake Leng, Ziyang Tang, Kanran Wang, Kaicheng U, Sophia Meixuan Zhang, Sen Han, Yiyan Zhang, Jinxi Xiang, Sen Yang, Xiang Liu, Yi Song, Xiyue Wang, Yuchen Li, Junhan Zhao. Deep Learning-Enabled Integration of Histology and Transcriptomics for Tissue Spatial Profile Analysis[J]. Research, 2025 , 8 (1) : 0568 . DOI: 10.34133/research.0568

Full Text

Less

Introduction

Less

Advances in spatial molecular imaging have enabled the examination of spatial transcriptional profiles within complex tissues at a subcellular resolution [1–3]. Exploring the spatial coordinates and the transcriptional profiles of individual cells within tissue mircroenvironment deepensour understanding of the spatial diversity in cellular interactions. Commercially available technologies for single-cell spatial profiling, such as the NanoString CosMx Spatial Molecular Imager (SMI) [4] and Vizgen MERSCOPE/MERFISH platforms [4,5], have demonstrated promising results in providing transcriptional profiles, cell locations and boundaries, and multichannel imaging modalities. For example, the NanoString CosMx platform can simultaneously interrogate up to 1,000 genes and analyze 100,000 to 600,000 cells per slide, surpassing the prevailing single-cell omics methodologies. These emerging single-cell spatial transcriptomics (SCST) platforms, along with accurate and timely histopathological assessments [6], are catalyzing a paradigm shift in biomedical research, advancing our understanding of complex tissue architecture both spatially and functionally and disease mechanisms with unprecedented resolution [7,8].

Refining spatial gene expression data remains a substantial challenge. Spatial transcriptomics (ST) profiles are affected by issues such as missing values [9], data sparsity [10], low coverage [2], and noise [11], which complicate effective biological exploration, especially creating precise training dataset for artificial neural networks [12]. Meanwhile, multiplex immunofluorescence images within single-cell spatial data capture high-resolution, detailed features observed in tissue samples, including cell types, cellular compartment morphologies, and spatial cell distributions. The integration of such imaging attributes with transcriptomic data holds promise for mitigating challenges arising from missing values and data noise. Given that spatial relationships between individual cells and their neighboring counterparts can naturally be represented through a spatial adjacency graph, making graph-based artificial intelligence an intuitive approach for spatial data modeling. Notably, graph-based models augmented with attention mechanisms, such as the GAT and graph transformer models [13,14], have shown promising advancements and enhanced investigation outcomes.

Accurate identification of spatial domains of tissues is critical but challenging for understanding distinct anatomical and functional regions. Current methods using SCST data focus on revealing spatial clusters, such as the integrated tool Seurat [15] and Scanpy [16]. These clustering techniques were originally proposed for processing nonspatial single-cell RNA-sequencing data. Therefore, only gene expression data are used as input. Researchers have attempted to integrate gene expression data with basic spatial and cellular information to refine the identification of spatial domains. StLearn [17] exploits both gene expression profiles and features extracted from tissue images. BayesSpace [18] employs a Bayesian statistical framework, which analyzes gene expression matrix and spatial proximity information. Additionally, SpaGCN [19] uses a graph convolutional network to identify spatial domains by constructing a spatial graph of gene expression based on the geospatial information in the images. STAGATE [20] leverages the graph attention network (GAT) [21] to dynamically consider nearby gene expressions. By integrating morphological and spatially resolved transcription data, MUSE [22] uses a multimodal structured embedding approach to find any tissue subgroups that are missed by multiple modes and compensate for pattern-specific noise. PROST [23] has optimized the integration of spatial information and gene expression profiles through 2 key modules, PROST Index (PI) and PROST Neural Network (PNN). CellCharter [24] leverages variational autoencoders (VAE) to enhance the merging of cellular characterization with histopathology. Despite the demonstrated efficacy of these methods, the potential to harness cellular morphological information embedded within spatial imaging profiles, extending beyond analogous cell localization, remains underutilized.

Many advanced deep learning-based methods were proposed to better extract the image features. STACI [25] analyzes spatial transcriptomics gene data and chromatin imaging data by employing overparameterized graph-based autoencoders. To reduce the interference of noise in ST data, TIST [26] extracts complementary cellular phenotypic information in high-resolution histopathology images by comprehensively analyzing transcriptomic data and images. Leng et al. [27] proposed a label-efficient approach that leverages curriculum learning and confidence learning to detect noise for the analysis of ST data. To decipher intercellular communication within spatial transcriptomics graphs, BLEEP [28] constructs paired images and expression profiles simultaneously using contrast learning at micrometer resolution, thus mapping the original dataset to a low-dimensional joint embedding space. TCGN [29] takes advantage of convolutional neural networks (CNNs), the transformer encoder, and graph neural networks (GNNs) as input for histopathological image analysis to process the pathology images in ST data. SiGra [30] utilizes graph transformers to achieve state-of-the-art performance by aggregating morphology features from surrounding cells. However, these methods have not been developed for fully utilizing histology image features for specifically extracting the unique morphological characteristics of single-cell spatial data, instead relying on vision models trained on natural images, which primarily treat histological images as general image data or apply basic image-processing techniques like segmentation. This approach results in a lack of trained histopathological perception and domain-specific intelligence necessary for fully interpreting the images.

Deep learning-enabled digital pathology has uncovered quantitative morphological signals in histology images that are indicative of diagnostics and prognostic prediction [31–35]. PhaseFIT [36] improves image generation by utilizing a segmentation algorithm that precisely executes image translation while integrating channel-wise and spatial-wise attention to concentrate on the most influential feature maps. The use of self-supervised learning (SSL) to train pathology foundation models [35,37] with millions of histology images has advanced significantly in recent years. CTransPath [38], as a pioneering histology foundation model, employed a CNN and vision transformer [39] hybrid trained on 15.6 million tiles from 32,220 whole-slide images spanning 25 anatomic sites and over 32 cancer subtypes. This model has been independently evaluated for different tasks, such as image retrieval, disease classification, mitosis detection, and lesion segmentation. Afterward, UNI [37], directly training on 100 million tiles using DINOv2 [40] architecture, was successfully validated by 33 pathology analytical tasks. These methods highlight the potential of SSL to enhance visual features without incurring high dataset labeling costs. Similarly, Virchow2 [41] was trained on 3.1 million histopathological whole-slide images using a domain-inspired training approach, functioning as a visual converter with 632 million parameters. Prov-GigaPath [42] was pretrained on 1.3 billion 256

×

256 pathology image tiles derived from 171,189 whole slides. To capture both local and global patterns across entire slices, Prov-GigaPath transformed slides into long strings of visual markers by tiling the images into these markers. UNI, Virchow2, and Prov-GigaPath all utilize the DINOv2 framework, while claiming distinct pretraining strategies tailored to different datasets.

In this paper, we established a novel deep learning framework for multimodal SCST data analysis, named GIST (Gene expression and histology Integration for SpaTial cellular profiling). GIST leverages self-supervised histology image foundation models to extract detailed morphological features of tissues and cells. By integrating multimodal data through hybrid graph encoding, GIST efficiently combines morphological information with transcriptomic data to precisely identify cell types and analyze spatial expression patterns. We showed that GIST effectively denoises ST data and excels in downstream tasks, including spatial domain identification, amplification of specific marker gene detection, and differential expression gene analysis. We validated the generalizable performance of GIST using human lung, breast, and colorectal cancer datasets collected using different ST platforms. GIST outperformed the state-of-the-art deep learning methods and improved the accuracy of segmenting the microenvironment and denoising transcriptomics data by up to 49.72%. GIST potentially serves as a robust framework for integrating histology and spatial gene expression data, offering a scalable approach for analyzing spatial transcriptomic data and understanding complex diseases.

Results

Less

Overview of the GIST

We developed a novel GIST method, i.e., a deep learning-enabled gene expression and histology integration for spatial cellular profiling. GIST leverages histopathology image foundation models for extracting image features and employs hybrid graph transformers to fuse features from both transcriptomics and tissue images (Fig. 1). To demonstrate the generalizability of GIST, we applied it to diverse tissue sections, including lung, human breast cancer, and colorectal tissues, achieving notable results in spatial domain identification and differential gene expression analysis.

Our GIST framework consists of 3 main components: (a) feature extraction from both transcriptomic data (gene expression) and histology images, (b) a hybrid graph transformer model for fusing the multimodal features , and (c) downstream approaches for spatial domain identification through denosing gene expression data (Fig. 1A). In the data preprocessing stage, we obtained gene expression profiles and cell spatial location information from spatial transcriptomic data and selected histology images containing cell morphology as multimodal input to GIST. In feature extractor and hybrid graph transformer model stage, we first identified cell positions in larger histological images using the spatial location information of cells and then extracted image features using foundation models from these smaller image patches accordingly. This approach enhances the discriminative power of the learned representations (Fig. 1B and C). The extracted image features and the transcriptome features processed from the gene expression data were then input into hybrid graph transformer model to obtain the final enhanced representations. In downstream analysis stage, we analyzed the transcriptomic alterations within the enhanced datasets generated by GIST, facilitating various forms of downstream tasks.

Data sources

We validated GIST using spatial transcriptomic data for 3 different anatomic sites, namely, formalin-fixed, paraffin-embedded (FFPE) non-small cell lung cancer (NSCLC) tissue samples obtained by NanoString CosMx SMI [43], FFPE human breast tissue, and fresh-frozen invasive ductal carcinoma breast tissue from BioIVT Asterand obtained by 10x Genomics and FFPE human colorectal cancer tissue from Discovery Life Sciences obtained by 10x Genomics. The NanoString FFPE NSCLC dataset encompasses 8 tissue samples with non-small cell lung cancer (NSCLC). Each NSCLC sample is associated with a range of 20 to 45 high-resolution images. Samples labeled lungs 5-1, 5-2, and 5-3 originate from a single patient, and lungs 9-1 and 9-2 are also from a single patient. The remaining samples are derived from individual patients, each contributing to the dataset's diversity. The FFPE tissues contain diverse cellular populations, identifying 18 distinct cell types. These classifications are further divided into 8 primary cell types: endothelial, epithelial, fibroblast, lymphocyte, mast, myeloid, neutrophil, and tumor cells. Regarding the human breast cancer datastes, we utilized 2 spatial gene expression datasets from human breast cancer specimens, each processed with different versions of the Space Ranger: Version 1.0 and Version 1.3. The dataset processed with Space Ranger Version 1.0 consists of samples from freshly frozen invasive ductal carcinoma of the mammary tissue. The dataset processed with Space Ranger Version 1.3 is derived from FFPE human breast tissue specimens. For human colorectal cancer, the spatial gene expression dataset was prepared using Space Ranger Version 2.0.1. The examination of hematoxylin and eosin (H&E) images revealed colorectal cancers exhibiting a connective tissue proliferative response.

GIST deciphers cell types in the single-cell spatial landscapes of lung cancer

To quantitatively evaluate cell type identification in lung cancer, we applied GIST to CosMx SMI dataset on 8 FFPE NSCLC specimens. We benchmarked the clustering accuracy of GIST against 8 state-of-the-art spatial clustering methods using the adjusted rand index (ARI) as the evaluation metric. Compared to other methods, GIST exhibited significantly better ARI indices (Fig. S1A and Table S2). GIST with UNI reached an average ARI of 0.61, which surpasses the performance of state-of-the-art models, CellCharter (ARI = 0.50), followed by SiGra (ARI = 0.47), stlearn (ARI = 0.42), Seurat (ARI = 0.33), Scanpy (ARI = 0.31), BayesSpace (ARI = 0.27), spaGCN (ARI = 0.25), and STAGATE (ARI = 0.22). The ARI distribution across all FFPE NSCLC samples markedly improved with GIST. Notably, GIST's ability to address an outlier underscores its efficacy in resolving aberrant data points.

GIST also enhanced the spatial domain detection and clustering results of the CosMx SMI dataset. For example, in the FFPE NSCLC slice lung13, GIST with UNI achieved the highest clustering accuracy of predicting spatial domains among all tested methods in Fig. 2A (ARI = 0.62). The FFPE lung13 sample comprises 77,043 cells and encompasses 960 genes, organized into 20 fields of view (FOVs). Results for other samples are available in the Supplementary Materials (Figs. S5 and S6). Spatial clustering results at FOV level showed that GIST's predictions matched with the ground truth (Fig. 2B). Across FOV1 and FOV2, characterized by heightened tumor concentrations, GIST accurately discerned these focal areas and effectively delineated adjacent regions intermixed with diverse cellular constituents. Whether encountering tumors juxtaposed with lymphocytes (FOV3) or myeloid cells (FOV4), GIST precisely categorized these compositions, further exemplifying its efficacy in spatial domain analysis. Conversely, Cellcharter misidentified myeloid cells as mast and fibroblast. While Cellcharter performed adequately in tumor-dominated FOVs (FOV2), its accuracy degraded in multicellular-based FOVs (FOV3 and FOV4), indicating a limited capability to distinguish intricate biological scenarios involving the fusion of multiple cell types. SiGra misidentified myeloid cells as neutrophil almost in all FOVs. In contrast, GIST demonstrated proficiency in both scenarios, underscoring its expansive utility in discerning cell types.

GIST enhances lung cancer-based gene expression of NanoString CosMx SMI

GIST also enhances the detection and characterization of clinically relevant gene markers in downstream analysis by refining data quality and resolution. We applied Uniform Manifold Approximation and Projection (UMAP) to reduce the dimensionality of the original dataset to visualize the clusters of cell types based on the feature similarities corresponding to both the original SCST dataset and the GIST-enhanced dataset (Fig. 2C). The enhanced dataset from GIST revealed a more discernible separation in the reduced-dimensional space. Notably, the enhanced dataset facilitated tumor segmentation and also improved the differentiation of cell types that were previously merged in the original dataset, such as fibroblast and endothelial cells.

Preclinical and clinical studies have identified ERBB2, also commonly known as HER2, as a targetable driver mutation in NSCLC [44]. We visualized the expression of the tumor-specific gene ERBB2 in Fig. 2D and found that its expression was significantly enhanced in the GIST dataset (t test, P = 5.3 × 10⁻⁷). GIST enabled more accurate detection of ERBB2 in tumor regions, aiding in the assessment of the functional consequences of the mutation. Comparing cell type-specific gene expression between the original dataset and the GIST-enhanced dataset, GIST amplified the visibility of certain gene-of-interests while maintaining the general expression trends of the original dataset (Fig. 2E). For instance, KRAS, a frequently mutated oncogene in NSCLC, is implicated in predicting clinical outcomes for patients undergoing diverse treatments [45]. Additionally, ERBB2 represents a therapeutic target mutation in NSCLC patients, and ERBB2-directed therapies can be effective in managing disease progression in individuals with metastatic ERBB2-mutated NSCLC [46]. To elucidate the expression profiles of these genes in lung cancer tissues, we visualized the expression patterns of KRAS and ERBB2, revealing high expression in tumor regions. These genes can not be identified in the raw data, demonstrating GIST's capability to improve gene expression analysis in lung cancer tissues.

GIST effectively identifies additional prognostic marker genes from differentially expressed genes in human breast cancer

GIST was further evaluated on human breast tissues sampled using BioIVT Asterand. The first FFPE human breast tissue (Space Ranger 1.3.0) contains 2,518 cells and 17,943 genes including four annotation classes : desmoplastic changes, lymphocytes, necrosis and hemorrhage, and tumor. The second human breast cancer (Space Ranger 1.0.0), consists of freshly frozen invasive ductal breast cancer tissue, comprising 3,813 cells and 33,538 gene data points annotated in 3 classes: desmoplastic changes, lymphocytes, and tumors.

The spatial regions of human breast cancer (Space Ranger 1.3.0) were depicted in comparison with the ground truth (Fig. 3A). The spatial regions predicted by GIST demonstrated greater accuracy compared to the ground truth than baseline models such as CellCharter and PROST, especially within the tumor region. We further conducted a comparative analysis using both raw and enhanced datasets (Fig. 3B). In the raw dataset, cells associated with tumor types are clustered together with cells indicative of necrosis and hemorrhage. In contrast, the enhanced data by GIST separated tumor cells from the other types more clearly. Our enhancement was significantly improved by the feature extraction process, resulting in better separation and identification of cellular subpopulations. In human breast cancer, the overexpression of ERBB2 has been suggested a strong association with poor prognosis [47]. Therefore, we visualized the expression of ERBB2 in both the raw and the enhanced datasets (Fig. 3C). In the raw dataset, the expression pattern for ERBB2 appeared noisy, and differed significantly among distinct tissues. After being enhanced by GIST, the high and low expression regions of ERBB2 were more obvious, and the separation of different expression levels was significantly clearer than the previous (t test, P = 0.00084). We further illustrated the results of GIST-enhanced gene expression using violin plots (Fig. 3D). Figure 3E shows the changes in the expression of the specific gene ESR1 [48] before and after enhancement. The original expression of ESR1 was relatively low and difficult to be identified in the raw data. However, GIST enhances the identification, which is critical for tailoring hormone therapy strategies and predicting treatment response in breast cancer patients. GIST identified more differentially expressed genes (DEGs) in specific cell populations, revealing expression variations undetected in the raw data and providing deeper insights into gene expression under varying conditions.

We evaluated the efficacy of GIST using the Human Breast Cancer (Space Ranger 1.0.0) Spatial Gene Expression Dataset. Figure 4A presented the spatial region as determined by ground truth, by GIST with CtransPath, by CellCharter, and by PROST. Overall, the tumor regions predicted by GIST are more accurate. Although this particular dataset has fewer classes, resulting in a slightly lower ARI index compared to other datasets, the overall accuracy of spatial region identification remains high. Concurrently, Fig. 4B illustrated the effect of GIST-enhanced tumor segmentation. ESR1 mutations as emerging clinical biomarkers in metastatic hormone receptor-positive breast cancer may help monitor disease progression and cause treatment resistance [49]. The GIST-enhanced gene expression of ESR1 (t test, P = 0.036) improved tumor cell detection (Fig. 4C). The violin plot in Fig. 4D quantitatively demonstrated the impact of GIST on enhancing marker gene expression. For each cell type, DEGs were better identified after GIST enhancement (Fig. 4E). These results demonstrat that the GIST's ability to reduce noise and improve gene expression patterns in breast cancer datasets.

GIST improves the expression pattern of specific genes in cell types in colorectal cancer

We also conducted further evaluation using human colorectal cancer datasets. The colorectal cancer samples were obtained from Discovery Life Sciences by 10x Genomics. The H&E images were acquired through sectioning, dewaxing, H&E staining, and imaging. The H&E images reveal that colorectal cancer with a proliferative connective tissue response, as well as infiltrating tumor areas with a large amount of tumor stroma. This dataset contains 9,080 cells and 18,085 gene data, categorized into 5 groups: desmoplastic changes, muscularis propria, tumor, tumor necrosis, and vessel.

Figure 5A shows the visualization of the original and the predicted spatial domains using GIST, PROST, and CellCharter. The GIST-predicted tumor (green) and necrosis (red) regions matched the ground truth with high accuracy. In the UMAP plot (Fig. 5B), GIST clustered tumor cells that were originally dispersed. Additionally, the tumor and necrosis regions were located near each other. MKI67 is a potential diagnostic and prognostic biomarker in microsatellite instability stage II/III high colorectal cancer [50]. The enhanced visualization of MKI67 gene expression showed significantly (t test, P = 0.00013) clearer areas of high MKI67 expression with greater spatial continuity, indicating that the enhancement effectively captured the spatial pattern of MKI67 expression. Similarly, regions with low MKI67 expression appeared to be more uniform in the enhanced visualization (Fig. 5C). Additional downstream tasks, visualized using violin plots (Fig. 5D) for the detection of DEGs (Fig. 5E), suggested that GIST improved the interpretation of gene expression patterns by effectively denoising the gene expression data before integrating it with staining images.

Discussion

Less

Spatial transcriptomic techniques consistently provide high-resolution tissue histology images. While histology examination remains the gold standard for cancer diagnostics and disease understanding due to its rich cell morphology information, current methods for processing ST data have not fully utilize this morphological information, mostly relying instead on it solely for localization or comparing basic cellular similarity. In contrast, the field of digital pathology has rapidly advanced, with deep learning-enabled microscopic image analysis showing promising applications in computer-aided diagnostics. Recent developments in histopathology image foundation models have enabled accurate extraction of tissue-level cellular image features. In this work, we present GIST, a novel approach that leverages pretrained self-supervised histology image foundation models to extract features and employs a hybrid graph transformer to efficiently fuse these image features with transcriptomic features. In our experimental setup, we used multiple state-of-the-art pretrained histology foundation models as backbones, including CTransPath, Virchow2, Gigapath, and UNI. All models were trained on millions of diagnostic H&E-stained images and used for extracting cellular morphological features. Despite differences in their architectures and pretraining datasets, these backbones efficiently capture both local and global cellular features from processed patches for aiding downstream histopathology diagnostic tasks. CTransPath employs contrastive learning to generate and refine feature vectors via parallel network processing, contrastive learning, and exponential moving average (EMA). In contrast, Virchow2, Gigapath, and UNI use a self-supervised student–teacher network framework, where the student progressively learns to extract meaningful features under the guidance of the teacher network. These complementary strengths shaped the selection of two foundation model families for our study.

The performances from different backbone variants in the framework were introduced in Table S1, including GIST with UNI, GIST with CTransPath, GIST with Virchow2, and GIST with Gigapath for processing SCST data. Among these, GIST with UNI demonstrated superior overall performance in the lung cancer dataset, achieving a consistent score of 0.64 across the lung5-1, lung5-2, and lung5-3 datasets. In contrast, GIST with CTransPath excelled in both the breast cancer and colorectal cancer datasets, showcasing its superior adaptability to these data types. Overall, models based on the GIST framework demonstrate notable improvements in the accuracy and reliability of spatial transcriptome analyses. In the main text, we present the results of the GIST with UNI model for visualization and detailed analysis in the lung cancer dataset. For the breast and colorectal cancer datasets, the results of the GIST with CTransPath model overperformed the baseline models and showed the advantages of pathology-informed approaches in enhancing spatial transcriptome analyses.

GIST employed a multimodal strategy to fuse histological image with spatial transcriptomic features, effectively leveraging the strengths of both modalities. By integrating the local and global context from staining images with spatial transcriptome data at the cellular level, GIST enhanced precision in distinguishing cellular structures. GIST's hybrid graph transformer efficiently addressed the challenge of dropout events, where certain genes may remain undetected despite active expression. This integration enhances gene expression data by mitigating dropouts and denoising, enabling better domain segmentation and biomarker identification, even with incomplete data.

The identification of genes and spatial domains are 2 essential tasks in SCST data analysis for understanding tissue architecture and disease microenvironments using spatial transcriptomic data [51]. In this study, we focused on assessing GIST's generalizability in these 2 tasks. Although these tasks target to learn different information (domain recognition focuses on classifying cell types across an image, while gene identification focuses on identifying and classifying specific genes), both require models capable of extracting meaningful insights from complex biomedical images. In our experiments, we visualized and compared spatial domain detection and clustering results before and after GIST enhancement. We found that GIST was more accurate in delineating structures with different anatomical contours. Moreover, we demonstrated the efficacy of GIST in identifying differentially expressed marker genes by visualizing their enhanced expression in spatial maps.

In addition to GIST's usability in tumor-related tasks, we also explored its usage in recognizing noncancerous tissue structures. Our analysis of the dorsolateral prefrontal cortex (DLPFC) dataset showed GIST's acceptable accuracy in brain region recognition (Figs. S2 to S4). However, because all of these histology image foundation models were predominantly developed using large-scale datasets of neoplastic histopathology images, these models often fail to encode noncancerous phenotypes hindering accurate spatial mapping of gene expression in human cerebral cortex. Brain-related datasets are relatively scarce compared to other fields, given the challenges in obtaining brain tissue samples and the smaller dataset sizes.

Apart from GIST's superior performance and technical advantages, GIST can be further improved in the future. First, current histology foundation models are not yet single-cell specific when encoding morphological features. Further improvements, such as better cell segmentation, could be evaluated [52]. Our investigation also did not explicitly explore the potential of integrating foundation models in transcriptomics or genomics, which may require more computational resources for pretraining but extract more biologically meaningful features from the gene expression data [53]. Additionally, given the broad scope of the assessment, our focus has been primarily on publicly available cancer datasets, with limited consideration of other diseases in anatomical pathology or normal tissue structures. Future work will aim to further evaluate the model's generalizability to other diseases, such as neurological disorders in anatomical pathology [20].

Methods

Less

Data preprocessing for gene expression and histology images

SCST contains multimodal features including gene expression, spatial location, and histological information [54]. We obtained gene expression profiles and gene location matrices from the dataset. We employed the function of var_names_make_unique from the Scanpy library to ensure the consistency of the variable names. Then, we normalized the gene expression profiles to ensure a consistent data scale across cells, mitigating the impact of sample size differences in subsequent analyses. The logarithm was applied to each data element to approximate a normal distribution. For image processing, a systematic procedure was adopted for various datasets. Each cell was cropped into a patch with 3

×

240

×

240 pixels, with the cell positioned at the center. Spatial coordinates were used to identify neighboring cells within a Euclidean distance of

≤

80 pixels, constructing a spatial neighborhood map and ensuring consistency in preprocessing across datasets.

Feature extraction

We employed both contrastive learning-based and knowledge distillation-based histology image foundation models to extract image features. The contrastive learning-based image feature extractor (i.e., CTransPath), two views,

x 1

and

x 2

, were created from the original histology image patch x. These views, along with the original image, were fed into 3 parallel branches of CTransPath backbone, with the branches for

x 1

and

x 2

sharing the same model parameters. Stemmed from MoCo v3 [55],

x 1

was processed by an online network's backbone, while

x 2

was passed through a target network for extracting feature vectors

z 1

and

z 2

. To increase the number of positive samples, the original image x was also processed through an additional branch to generate a feature vector z. The target branch and the original image branch queried the memory bank for getting semantically similar samples. These similar samples within the memory bank were ranked by their similarity, with the top S samples designated as positives and the reminding ones as negatives. During contrastive learning, the feature vector

z 1

from the online branch served as an anchor, pulling the positive samples closer and pushing the negative samples further away from itself in the embedding space.

CTransPath replaced the patch partition of Swin Transformer with a nonlinear mapper based on CNN, which improved the stability of network training and captured more local features of the image. The adjusted feature extractor scanned local information from histology images using the CNNmodule and utilized the transformer module to obtain global features of histology images (Fig. S1C). In each image, each patch was sampled centered as a cell-of-interest in

H × W × 3

. The image patch of

H × W × 3

was fed into the CNN module to obtain the local feature map

F ∈ ℝ H 4 × W 4 × C

. The CNN module were with 3 consecutive convolution layers with kernel sizes

3 × 3

3 × 3

, and

1 × 1

. Convolution kernels of different sizes capture features of different scales in the image patch, i.e., local features. The local feature map F was then input into the Swin Transformer, and hierarchical features were obtained through 4 Swin Transformer blocks. In each layer of the Swin Transformer (Fig. S1C), the input feature was downsampled. The Swin Transformer consists of a window-based multi-head self-attention (W-MSA) layer and a shift-window-based multi-head self-attention (SW-MSA) layer. The Swin Transformer [39] has 2 GELU nonlinear multilayer perceptrons (MLPs) in between, along with a LayerNorm (LN) layer in between each MSA module and MLP module, while each module is connected with residuals. The continuous Swin Transformer is calculated as follows:

z ̂ l = W − M S A L N z l − 1 + z l − 1, z l = M L P L N z ̂ l + z ̂ l, z ̂ l + 1 = S W − M S A L N z l + z l, z l + 1 = M L P L N z ̂ l + 1 + z ̂ l + 1

(1)

where

z ̂ l

and

z ̂ l + 1

represent the product of the (S)W-MSA module and the MLP module for block l.

For knowledge distillation-based image feature extractor, we chose a family (i.e., UNI, Virchow2, and Gigapath) utilizing DINOv2 [56], a self-supervised framework based on the teacher–student network architecture. The patches from the original image were augmented to create 2 views,

x 1

and

x 2

. These augmented views were then fed into the student network. The parameters of the teacher network were updated using an EMA of the student network's parameters. The knowledge-distillation-based image feature extractor also employed the augmented view of the original image as a random mask. The unmasked portion of these augmented views was used in the teacher's network, while the masked portion was used in the student network. In addition, alignment losses were used for assessing the consistency between feature representations generated by a network of students and teachers.

Hybrid graph transformer model

Hybrid graph transformer model contained 3 major modules: the construction of cellular spatial graph, graph transformer model, and loss function. In the construction phase of single-cell spatial graph, we obtained gene expression profiles and position matrices of cells from spatial transcriptome data. In single-cell spatial graph, each graph node

v i

represents a cell, and the edges between nodes represent 2 cells that are neighbours of each other. The adjacency was measured by Euclidean distance. Specifically, each cell includes gene expression data from gene expression profiles and an

N × N

image patch centered on a cell-of-interest.

The graph transformer model contains 3 graph transformer autoencoders (Fig. S1B) for different usages, including image autoencoder, transcriptome autoencoder, and hybrid autoencoder. The image autoencoder took the features

z M, i

. Similarly, the transcriptome autoencoder utilized the raw cell expression features

z g, i

. Then, these two features were concatenated as the hybrid features and fed into the hybrid autoencoder to obtain hybrid embedding features

z h, i

. In order to better maintain the spatial information of the data, the features of the neighbouring cells were also fed into the graph transformer model. The implementation was done by single-cell spatial graph. A mean squared error (MSE) self-supervised loss was applied to guide the model in learning how to integrate image embedding features, gene embedding features, and hybrid embedding features. The overall loss function is defined as follows:

L = ∑ i = 1 N λ 1 L M, i + λ 2 L g, i + L h, i

(2)

where N is the total number of cells and

L M, i

is the gene embedding loss.

L g, i

is the image embedding loss, and

L h, i

is the hybrid embedding loss. In constructing the overall loss function,

λ 1

and

λ 2

hyperparameters are also used to balance the weights of different loss components, which should satisfy

λ 1

λ 2

≥

The state-of-the-art models for baseline comparisons

To benchmark spatial domain segmentation performance, we compared GIST against nine state-of-the-art methods, each developed using different model architectures and technical focuses, including STAGATE [14], spaGCN [13], BayesSpace [12], Scanpy [10], Seurat [9], stLearn [11], SiGra [30], PROST [17], and CellCharter [18]. STAGATE, SpaGCN, and SiGra utilize graph-based methods to capture cell spatial correlations and uncover spatial domains. BayesSpace employs a comprehensive Bayesian framework to enhance clustering resolutions. PROST uses unsupervised clustering to identify spatial domains and quantitatively assess spatial variations in gene expression patterns. CellCharter excels in identifying biologically meaningful cell niches. Additionally, Scanpy is a tool specialized in single-cell gene expression analysis and managing annotated transcriptomic data. stLearn maps cell progression within tissues and explores regions exhibiting significant intercellular interactions.

Comparison measurement and experiment parameters

The ARI is a metric for measuring the similarity between 2 data partitions and is commonly used to evaluate the performance of clustering algorithms in ST data analysis. To compare the similarity between domain segmentation results, we employed the ARI to measure the similarity to assess the accuracy and reliability of each method. The formula for the ARI is defined as below:

ARI = ∑ ij n ij 2 − ∑ i a i 2 ∑ j b j 2 / n 2 12 ∑ i a i 2 + ∑ j b j 2 − ∑ i a i 2 ∑ j b j 2 / n 2

(3)

where n is the total number of samples,

n i, j

is the number of sample pairs belonging to both

A i

and

B j

a i

is the number of samples in

A i

, and

b j

is the number of samples in

B j

. The value of ARI usually ranges from −1 to 1, where 1 means that the 2 parts are in perfect agreement, 0 means that the agreement is equivalent to random assignment, and −1 means that 2 parts are completely against each other.

We optimized several parameters to enhance experimental performance, employing a grid-search approach informed by insights from previous studies [30]. During image feature extraction, we determined the optimal patch size referring to the previous empirical results from SiGra. The comparison included 3 configurations: a size of 3

×

120

×

120 as used by SiGra, a size of 3

×

240

×

240 augmented by SiGra, and a standard size of 3

×

224

×

224 commonly used in histology image foundation models. Our experiments suggest that the size of 3

×

240

×

240 pixels reached the best performance. During model training, we utilized a learning rate of 0.001, 1,000 epochs, a maximum gradient norm of 5, and the Adam optimizer with a weight decay of 0.0001. The loss weight on both gene and image features was set to 0.1. To construct spatial networks among cells, A Euclidean distance threshold was adjusted for each dataset to ensure comparable average neighbor counts across spatial maps. Specifically, we set the Euclidean distance to 80 for lung cancer, 300 for breast cancer, and 20 for colorectal cancer.

References

Less

Moffitt

, Lundberg

, Heyn

. The emerging landscape of spatial profiling technologies. Nat Rev Genet. 2022;23(12):741–759.

Moses

, Pachter

. Museum of spatial transcriptomics. Nat Methods. 2022;19(5):534–546.

Palla

, Fischer

, Regev

, Theis

. Spatial components of molecular tissue biology. Nat Biotechnol. 2022;40(3):308–318.

, Bhatt

, Brown

, Buhr

, Chantranuvatana

, Danaher

, Dunaway

, Garrison

, Geiss

, et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat Biotechnol. 2022;40(12):1794–1806.

Crosetto

, Bienko

, Van Oudenaarden

. Spatially resolved transcriptomics and beyond. Nat Rev Genet. 2015;16(1):57–66.

Nasrallah

, Zhao

, Tsai

, Meredith

, Marostica

, Ligon

, Golden

, Yu

. Machine learning for cryosection pathology predicts the 2021 WHO classification of glioma. Med. 2023;4(8):526–540.

Giacomello

, Salmén

, Terebieniec

, Vickovic

, Navarro

, Alexeyenko

, Reimegård

, McKee

, Mannapperuma

, Bulone

, et al. Spatially resolved transcriptome profiling in model plant species. Nat Plants. 2017;3:17061.

Berglund

, Maaskola

, Schultz

, Friedrich

, Marklund

, Bergenstråhle

, Tarish

, Tanoglidi

, Vickovic

, Larsson

, et al. Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity. Nat Commun. 2018;9(1):2419.

Lopez, R, Nazaret, Langevin M, Samaran J, Regier J, Jordan MI, Yosef N. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. arXiv. 2019. https://doi.org/10.48550/arXiv.1905.02269

10.

Piwecka

, Rajewsky

, Rybak-Wolf

. Single-cell and spatial transcriptomics: Deciphering brain complexity in health and disease. Nat Rev Neurol. 2023;19(6):346–362.

11.

Liu

, Tran

, Vemuri

VNP

, Byrne

, Borja

, Kim

, Agarwal

, Wang

, Awayan

, Murti

, et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance. 2023;6(1):e202201701.

12.

Naghizadeh A, Xu H, Mohamed M, Metaxas DN, Liu D. Semantic aware data augmentation for cell nuclei microscopical images with artificial neural networks. Paper presented at: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021; Montreal, QC, Canada.

13.

Niu

, Zhong

, Yu

. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.

14.

Shi Y, Huang Z, Feng S, Zhong H, Wang W, Sun Y. Masked label prediction: Unified message passing model for semi-supervised classification. arXiv. 2020. https://doi.org/10.48550/arXiv.2009.03509

15.

Hao

, Hao

, Andersen-Nissen

, Mauck WM III, Zheng

, Butler

, Lee

, Wilk

, Darby

, Zager

, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587.

16.

Wolf

, Angerer

, Theis

. Scanpy: Large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15.

17.

Pham D, Tan X, Xu J, Grice LF, Lam PY, Taghubar A, Vikomic J, Ruitenberg MJ, Nguyuen W. stLearn: Integrating spatial location, tissue morphology and gene expression to find cell types, cell-cell interactions and spatial trajectories within undissociated tissues. BioRxiv. 2020. https://doi.org/10.1101/2020.05.31.125658.

18.

Zhao

, Stone

, Ren

, Guenthoer

, Smythe

, Pulliam

, Williams

, Uytingco

, Taylor

SEB

, Nghiem

, et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat Biotechnol. 2021;39:1375–1384.

19.

, Li

, Coleman

, Schroeder

, Ma

, Irwin

, Lee

, Shinohara

, Li

. Spagcn: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021;18(11):1342–1351.

20.

Dong

, Zhang

. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat Commun. 2022;13(1):1739.

21.

Veličković, P, Cucurukk G, Casanova A, Romero A, Lio R, Bengio Y. Graph attention networks. arXiv. 2017. https://doi.org/10.48550/arXiv.1710.10903

22.

Bao

, Deng

, Wan

, Shen

, Wang

, Dai

, Altschuler

, Wu

. Integrative spatial analysis of cell morphologies and transcriptional states with muse. Nat Biotechnol. 2022;40(8):1200–1209.

23.

Liang

, Shi

, Cai

, Yuan

, Xie

, Yu

, Huang

, Shi

, Wang

, Li

, et al. Prost: Quantitative identification of spatially variable genes and domain detection in spatial transcriptomics. Nat Commun. 2024;15(1):600.

24.

Varrone

, Tavernari

, Santamaria-Martinez

, Walsh

, Ciriello

. Cellcharter reveals spatial cell niches associated with tissue remodeling and cell plasticity. Nat Genet. 2024;56:74–84.

25.

Zhang

, Wang

, Shivashankar

, Uhler

. Graph-based autoencoder integrates spatial transcriptomics with chromatin images and identifies joint biomarkers for Alzheimer's disease. Nat Commun. 2022;13:7480.

26.

Shan

, Zhang

, Guo

, Wu

, Miao

, Xin

, Lian

, Gu

. Tist: Transcriptome and histopathological image integrative analysis for spatial transcriptomics. Genomics Proteomics Bioinformatics. 2022;20(5):974–988.

27.

Leng

, Zhang

, Liu

, Cui

, Zhao

, Ge

. Error-robust and label-efficient deep learning for understanding tumor microenvironment from spatial transcriptomics. IEEE Trans Circuits Syst Video Technol. 2023;34(8):6785–6796.

28.

Xie R, Pang K, Chung SW, Perciani CT, MacParland SA, Wang B, Bader GD. Spatially resolved gene expression prediction from histology images via bi-modal contrastive learning. arXiv. 2024. https://doi.org/10.48550/arXiv.2306.01859.

29.

Xiao

, Kong

, Li

, Wang

, Lu

. Transformer with convolution and graph-node co-embedding: An accurate and interpretable vision backbone for predicting gene expressions from local histopathological image. Med Image Anal. 2024;91: Article 103040.

30.

Tang

, Li

, Hou

, Zhang

, Yang

, Su

, Song

. Sigra: Single-cell spatial elucidation through an image-augmented graph transformer. Nat Commun. 2023;14:5618.

31.

Wang

, Cai

, Yang

, Cui

, Zhu

, Wang

, Zhao

. Sac-net: Enhancing spatiotemporal aggregation in cervical histological image classification via label-efficient weakly supervised learning. IEEE Trans Circuits Syst Video Technol. 2023;34(8):6774–6784.

32.

Perez-Lopez

, Ghaffari Laleh

, Mahmood

, Kather

. A guide to artificial intelligence for cancer researchers. Nat Rev Cancer. 2024;24(6):427–441.

33.

Cai

, Chen

, Zhao

, Xue

, Yang

, Yuan

, Feng

, Weng

, Liu

, Peng

, et al. Hicervix: An extensive hierarchical dataset and benchmark for cervical cytology classification. IEEE Trans Med Imaging. 2024.

34.

Verghese

, Lennerz

, Ruta

, Ng

, Thavaraj

, Siziopikou

, Naidoo

, Rane

, Salgado

, Pinder

, et al. Computational pathology in cancer diagnosis, prognosis, and prediction—Present day and prospects. J Pathol. 2023;260:551–563.

35.

Wang

, Zhao

, Marostica

, Yuan

, Jin

, Zhang

, Li

, Tang

, Wang

, Li

, et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature. 2024;634(8035):970–978.

36.

Zhao

, Wang

, Zhu

, Chukwudi

, Finebaum

, Zhang

, Yang

, He

, Saeidi

. Phasefit: Live-organoid phase-fluorescent image transformation via generative AI. Light Sci Appl. 2023;12:297.

37.

Chen

, Ding

, Lu

, Williamson

DFK

, Jaume

, Song

, Chen

, Zhang

, Shao

, Shaban

, et al. Towards a general-purpose foundation model for computational pathology. Nat Med. 2024;30(3):850–862.

38.

Wang

, Yang

, Zhang

, Wang

, Zhang

, Yang

, Huang

, Han

. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. 2022;81: Article 102559.

39.

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical vision transformer using shifted windows. Paper presented at: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021; Montreal, QC, Canada.

40.

Oquab M, Darcet T, Moutakanni T, Vo HV, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A, et al. DINOv2: Learning robust visual features without supervision. arXiv. 2023. https://doi.org/10.48550/arXiv.2304.07193

41.

Zimmermann E, Vorontsov E, Viret J, Casson A, Zelechowski M, Shaikovski G, Tenenholtz N, Hall J, Klimstra D, Yousfi R, et al. Virchow2: Scaling self-supervised mixed magnification models in pathology. arXiv. 2024. https://doi.org/10.48550/arXiv.2408.00738

42.

, Usuyama

, Bagga

, Zhang

, Rao

, Naumann

, Wong

, Gero

, Gonzalez

, Gu

, et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630:181–188.

43.

, Bhatt

, Brown

, Buhr

, Chantranuvatana

, Danaber

, Dunaway

, Garisson

, Geiss

, et al. High-plex multiomic analysis in FFPE at subcellular level by spatial molecular imaging. Nat Biotechnol. 2022;40(12):1794–1806.

44.

Minami

, Shimamura

, Shah

, LaFramboise

, Glatt

, Liniker

, Borgman

, Haringsma

, Feng

, Weir

, et al. The major lung cancer-derived mutants of ERBB2 are oncogenic and are associated with sensitivity to the irreversible EGFR/ERBB2 inhibitor hki-272. Oncogene. 2007;26:5023–5027.

45.

Riely

, Marks

, Pao

. Kras mutations in non–small cell lung cancer. Proc Am Thorac Soc. 2009;6(2):201–205.

46.

Chuang

, Stehr

, Liang

, das

, Huang

, Diehn

, Wakelee

, Neal

. Erbb2-mutated metastatic non–small cell lung cancer: Response and resistance to targeted therapies. J Thorac Oncol. 2017;12(5):833–842.

47.

Revillion

, Bonneterre

, Peyrat

. Erbb2 oncogene in human breast cancer and its clinical significance. Eur J Cancer. 1998;34(6):791–808.

48.

Jeselsohn

, Buchwalter

, De Angelis

, Brown

, Schiff

. Esr1 mutations—A mechanism for acquired endocrine resistance in breast cancer. Nat Rev Clin Oncol. 2015;12(10):573–583.

49.

Brett

, Spring

, Bardia

, Wander

. Esr1 mutation as an emerging clinical biomarker in metastatic hormone receptor-positive breast cancer. Breast Cancer Res. 2021;23(1):85.

50.

, Li

, Wang

, Guo

, Wang

, Zhang

, Yu

, Chen

, Niu

, Wang

, et al. Identification of mki67, TPR, and TCHH mutations as prognostic biomarkers for patients with defective mismatch repair colon cancer stage II/III. Dis Colon Rectum. 2023;66(11):1481–1491.

51.

Zhao

, Liu

, Tang

, Wang

, Yang

, Liu

, Chen

. Mesoscopic structure graphs for interpreting uncertainty in non-linear embeddings. Comput Biol Med. 2024;182: Article 109105.

52.

Wang

, Zhao

, Xu

, Han

, Tao

, Zhou

, Geng

, Liu

, Ji

. A systematic evaluation of computational methods for cell segmentation. Brief Bioinform. 2024;25:bbae407.

53.

Hao

, Gong

, Zeng

, Liu

, Guo

, Cheng

, Wang

, Ma

, Zhang

, Song

. Large-scale foundation model on single-cell transcriptomics. Nat Methods. 2024;21:275–285.

54.

Williams

, Lee

, Asatsuma

, Vento-Tormo

, Haque

. An introduction to spatial transcriptomics for biomedical research. Genome Med. 2022;14(1):68.

55.

Chen X, Xie S, He K. An empirical study of training self-supervised vision transformers. Paper presented at: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021; Montreal, QC, Canada.

56.

Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A, et al. Dinov2: Learning robust visual features without supervision. arXiv. 2023. https://doi.org/10.48550/arXiv.2304.07193

Appendix

Less

Year 2025 volume 8 Issue 1

PDF

197

109

Cite this Article

BibTeX

Article Info

doi: 10.34133/research.0568

Receive Date：2024-09-23
Online Date：2025-07-23
Published：2025-01-17

Article Data

Affiliations

History

Received：2024-09-23
Revised：2024-11-28
Accepted：2024-12-11

Affiliations