Skip to main content

Instance-level semantic segmentation of nuclei based on multimodal structure encoding

Abstract

Background

Accurate segmentation and classification of cell nuclei are crucial for histopathological image analysis. However, existing deep neural network-based methods often struggle to capture complex morphological features and global spatial distributions of cell nuclei due to their reliance on local receptive fields.

Methods

This study proposes a graph neural structure encoding framework based on a vision-language model. The framework incorporates: (1) A multi-scale feature fusion and knowledge distillation module utilizing the Contrastive Language-Image Pre-training (CLIP) model’s image encoder; (2) A method to transform morphological features of cells into textual descriptions for semantic representation; and (3) A graph neural network approach to learn spatial relationships and contextual information between cell nuclei.

Results

Experimental results demonstrate that the proposed method significantly improves the accuracy of cell nucleus segmentation and classification compared to existing approaches. The framework effectively captures complex nuclear structures and global distribution features, leading to enhanced performance in histopathological image analysis.

Conclusions

By deeply mining the morphological features of cell nuclei and their spatial topological relationships, our graph neural structure encoding framework achieves high-precision nuclear segmentation and classification. This approach shows significant potential for enhancing histopathological image analysis, potentially leading to more accurate diagnoses and improved understanding of cellular structures in pathological tissues.

Peer Review reports

Background

Precise segmentation of cell nuclei is recognized as a critical step in computational pathology for understanding disease mechanisms and assisting in diagnosis [1]. The importance of this process stems from the characteristics of cell nuclei as the core of life activities, with their morphology, distribution, and type directly reflecting the physiological and pathological states of tissues [2]. However, traditional nuclear segmentation methods still face numerous challenges when processing complex histopathological images, particularly exhibiting significant limitations when dealing with situations such as nuclear density, morphological diversity, and blurred boundaries [3]. Furthermore, segmentation results are susceptible to human factors, such as diagnostic differences between pathologists and inconsistencies in judgment by the same pathologist at different times or under different conditions, which further increase the uncertainty of analytical results. While convolutional neural network-based methods have demonstrated tremendous potential in cell nucleus segmentation, their generalization performance and robustness still need to be improved when faced with complex and diverse real pathological images.

Achieving accurate nuclear classification based on nuclear segmentation not only significantly improves diagnostic accuracy but also provides crucial information for personalized treatment and research into disease mechanisms [4]. Accurate nuclear classification requires the simultaneous capture of nuclear morphological features and spatial relationships between nuclei [5]. However, existing methods still face numerous challenges in this regard. Firstly, while traditional convolutional neural networks can effectively extract local features, they often struggle to fully utilize global semantic information and topological relationships between cell nuclei [6]. This limitation leads to models being prone to misjudgments when processing complex scenarios, particularly performing poorly in distinguishing morphologically similar but different types of cell nuclei [7]. Secondly, existing methods typically treat segmentation and classification as independent tasks, neglecting the intrinsic connection between the two, which further limits the overall performance of the model [8]. In the face of these challenges, recent technological advancements have provided new approaches to address these issues. Vision-language models such as Contrastive Language-Image Pre-training (CLIP) and Graph Neural Networks (GNNs) have made breakthrough progress in cross-modal learning and processing structured data, offering new possibilities for integrating visual and language information and capturing inter-nuclear relationships [9]. Nevertheless, how to effectively combine these technologies to enhance the accuracy of cell nucleus classification while maintaining model interpretability and generalization ability remains an urgent challenge to be addressed [10].

To address the aforementioned problems, a multimodal structure encoding framework is proposed in this paper, aimed at improving nuclear segmentation and classification tasks. This method is designed to effectively address the shortcomings of traditional approaches in handling complex scenarios and lacking global information by integrating visual features of digital pathology images, semantic descriptions of nuclear morphology, and spatial relationships between nuclei. Firstly, a multi-scale feature fusion and knowledge distillation module is designed. The visual encoder of CLIP is utilized as a teacher model to guide a Visual Transformer (ViT) student model in extracting visual features of cell nuclei, thereby enhancing the model’s perceptual capability for complex nuclear structures. Secondly, to capture high-level semantic information of nuclei, morphological features of nuclei are transformed into textual descriptions. The text encoder of CLIP is then employed to obtain high-level semantic representations of nuclear node features. Finally, a graph structure with nuclei as nodes is constructed. Spatial relationships and contextual information between nuclei are learned through graph neural networks, achieving fusion of semantic information and visual features. Through this process, more comprehensive and precise representations of cell nuclear features are obtained to support more accurate classification. The main contributions of this work are as follows:

  • Multimodal structure encoding framework is proposed, integrating the CLIP model’s cross-modal representation capabilities with graph neural networks’ structured learning characteristics. Local visual features and global semantic information are fused through multi-scale feature fusion and knowledge distillation.

  • Semantic representation method for cell nuclear morphology is developed. Morphological features of nuclei are transformed into structured textual descriptions, and semantic representations are obtained using the text encoder of CLIP.

  • Graph structure-based approach is proposed to model cell nuclear relationships. Spatial relationships and contextual information among nuclei are learned through graph neural networks.

Related work

Cell nuclear segmentation

Morphological image-based approaches have been extensively employed for nuclear instance segmentation in current literature. For instance, Majanga et al. achieved high-precision automatic nuclear segmentation in breast cancer histopathological images by combining data augmentation, staining normalization, and morphological processing [11]. Kaushal et al. utilized grayscale conversion, median filtering, and bottom-hat filtering as preprocessing steps to enhance image contrast, while post-processing morphological techniques including dilation, region opening, and hole filling were applied to improve the final cellular segmentation results [12]. Kumar et al. [13], Veta et al. [14], and Wienert et al. [15] implemented various classical image processing techniques, encompassing mathematical morphology processing, level set methods, graph-based segmentation algorithms, color threshold segmentation, and active contour models. However, these segmentation techniques rely heavily on consistent intensity differences between nuclei and background, which proves inadequate for more complex images, often yielding unreliable results. Furthermore, these methods typically exhibit high sensitivity to manually selected parameters.

Recently, deep learning methods have garnered significant attention due to their exceptional performance in various computer vision tasks [16, 17]. Han et al. proposed an edge-aware ensemble approach specifically designed for segmenting nuclei with abnormal morphologies [18]. Jin et al. developed a feature aggregation-based semi-supervised model that improves segmentation accuracy by considering both inter-class and intra-class uncertainties [19]. Li et al. designed a Transformer-based Nuclear Segmentation method (NST) specifically targeting gastrointestinal cancer pathology images [20]. In another study, Jin et al. achieved semi-supervised histological image segmentation through hierarchical consistency constraints [21]. Imtiaz et al. introduced the Boundary-Aware Wavelet-Guided Network (BAWGNet), which enhanced the precision of nuclear segmentation in histopathological images [22]. However, these methods continue to follow natural image processing paradigms without fully considering the unique biological characteristics of cell nuclei, such as morphological constraints and size distribution patterns. Particularly when dealing with closely connected nuclei, there is a lack of modeling and utilization of spatial topological relationships between nuclei, where such biological prior knowledge could provide crucial guidance for segmentation tasks. Therefore, how to effectively integrate cell-specific knowledge and spatial position constraints into deep learning frameworks remains a research direction worthy of further exploration.

Cell nuclear classification

Beyond instance segmentation, determining the type of each nucleus is crucial for facilitating and improving downstream analysis. Bagchi et al. significantly improved breast cancer grading accuracy by integrating morphological and deep features [23]. Javed et al. enhanced cancer tissue classification through multi-scale feature extraction [24]. Wang et al. improved cell classification generalization by combining contrastive learning with a multi-task framework [25]. Mi et al. proposed a two-stage architecture based on deep learning for four-class breast cancer pathology image classification (normal tissue, benign lesions, ductal carcinoma in situ, and invasive cancer), achieving high classification accuracy and validating its generalization performance across multiple public datasets [26]. However, in nuclear classification, conventional feature extraction methods may fail to fully capture the complex variations and subtle differences in nuclear morphology. Moreover, existing methods often overlook the spatial relationships and contextual information between nuclei, which are crucial for accurate determination of nuclear types.

Visual-language model

Vision-language models have achieved significant advancements in recent years, introducing new possibilities for nuclear segmentation and classification tasks. These models are designed to establish connections between images and text, paving new paths for multimodal learning and cross-modal understanding. The Contrastive Language-Image Pre-training (CLIP) model, proposed in literature [27], marks a significant breakthrough in vision-language pre-training. Through large-scale contrastive learning, CLIP demonstrates powerful zero-shot learning capabilities, offering new insights for medical image analysis. Subsequently, researchers have begun to explore the potential of applying CLIP to the medical domain. Med-CLIP, introduced in literature [28], significantly improves model performance across various medical visual tasks by fine-tuning CLIP on medical image-report pairs. In the realm of nuclear segmentation and classification, a CLIP-based method is developed in literature [29]. This approach achieves more precise nuclear classification by transforming cell nuclear morphological features into textual descriptions. Recent research, such as CPLIP proposed in literature [30], further extends this concept by combining CLIP with pathological knowledge, enhancing model performance in complex pathological image analysis. These advancements not only improve the accuracy of nuclear segmentation and classification but also enhance model interpretability, providing new directions for the development of computational pathology.

However, existing methods primarily focus on single-modal feature extraction, lacking comprehensive utilization of multi-dimensional nuclear information. Furthermore, existing methods often overlook the comprehensive diagnostic approach of pathologists, who integrate multiple dimensional features including nuclear morphology and spatial relationships, creating a considerable gap between computational methods and actual clinical diagnostic reasoning. To overcome these limitations, a novel representation method is needed that can integrate local visual features, spatial position information, and high-level semantic descriptions. This approach should better simulate the diagnostic reasoning patterns of pathologists while providing more comprehensive and precise nuclear feature representations.

Graph neural network

Graph Neural Networks (GNNs) have made significant advancements in recent years in processing non-Euclidean data structures, providing effective tools for modeling complex relationships and dependencies. The Graph Convolutional Network (GCN) proposed in [31] is considered a seminal work in this field, as it defines convolution operations on the adjacency matrix of graphs, enabling the processing of graph-structured data. Subsequently, GraphSAGE was introduced in [32], which is recognized as an inductive learning framework capable of generating embeddings for new nodes. Graph Attention Networks (GAT), presented in [33], are characterized by their attention mechanism that allows nodes to assign different levels of importance to their neighbors. In the domain of biomedical image analysis, GNNs have demonstrated considerable potential for application. For instance, the use of GNNs in analyzing cellular images is explored in [34], where the spatial relationships between cells are modeled to enhance segmentation and classification processes. As a result, new possibilities are being opened up for various tasks, including nuclear segmentation and classification.

Clinical applications of instance-level nuclear segmentation and classification

The computer-aided analysis of nuclear morphology, particularly in segmentation and classification tasks, has shown promising advances in clinical applications, providing pathologists with efficient tools for quantitative assessment and diagnosis support. Chan et al. proposed a whole-slide image analysis method based on heterogeneous graph representation learning [35]. In breast cancer research, Zhao et al. developed a deep learning framework for molecular subtyping and prognostic stratification of triple-negative breast cancer [36]. Yang et al. utilized graph deep learning to predict ovarian cancer patients’ prognosis and treatment response from histopathological images [37]. In another study, Zhao et al. mapped the single-cell morphological and topological atlas of breast cancer [38]. In kidney cancer research, Hu et al. revealed metabolic reprogramming during disease progression through multi-omics analysis [39]. Li et al. mapped the single-cell transcriptome atlas of kidney cancer, revealing heterogeneity within tumors and associated regions [40]. Braun et al. discovered gradual immune function decline during disease progression [41]. Obradovic et al. identified tumor-associated macrophages related to recurrence [42]. In other cancer studies, Jackson et al. mapped the single-cell pathology atlas of breast cancer [43]. Kather et al. utilized deep learning to analyze gastrointestinal tumors for microsatellite instability prediction [44]. Bulten et al. developed an AI system for Gleason grading of prostate cancer [45]. AbdulJabbar et al. analyzed immune variation characteristics in lung adenocarcinoma [46]. Schürch et al. studied the immune microenvironment of colorectal cancer [47]. These studies underscore the clinical significance of integrating computer-aided nuclear analysis into routine pathological practice.

Methods

Overview of framework

As illustrated in Fig. 1, a multimodal structure encoding-based framework for cell nucleus segmentation and classification is proposed in this study. This framework is designed to achieve precise nucleus segmentation and classification by exploiting morphological features of cell nuclei and their topological relationships. Initially, a cell nucleus instance segmentation branch is constructed (as shown in Fig. 1a). This branch utilizes histopathological image slices as input and employs a pre-trained Vision Transformer (ViT) as the encoder. The selection of ViT is based on its exceptional performance in capturing global image features. To further enhance feature representation capabilities, a CLIP image encoder is introduced as a teacher model, and its feature extraction abilities are transferred to the ViT through knowledge distillation (KD) techniques.In the decoder section, ConvNeXt [48] is adopted to achieve pixel-level classification of cell nuclei. ConvNeXt effectively captures nucleus features at different scales through multi-scale feature fusion. The decoder’s output maintains the same dimensions as the input image, ensuring complete preservation of information during the feature extraction process. The encoder-decoder (E-D) structure follows a top-down pathway, where high-level feature maps are upsampled to match the resolution of low-level feature maps, thereby constructing feature maps with different spatial information hierarchies.

Furthermore, a graph neural network incorporating cell nucleus morphological information is constructed (as illustrated in Fig. 1b) to enhance nucleus feature representation. This network initially acquires nucleus segmentation maps and the final layer feature map from the nucleus instance segmentation branch as key inputs. Subsequently, semantic feature vectors of nucleus morphological information are extracted based on the nucleus segmentation map, and a CLIP text encoder is introduced to perform secondary encoding on these features, generating high-level semantic representations. The construction of the cell nucleus graph is based on three sets of features: cell nucleus morphological information feature vectors, high-level semantic representations generated by the CLIP text encoder, and the feature map from the final layer of the nucleus instance segmentation branch. Upon completion of construction, a graph neural network is applied to extract features from the cell nucleus graph, resulting in enhanced feature representations. These features are then fused and input into a classifier composed of fully connected (FC) layers and a Softmax layer. To optimize model performance, a combination of cross-entropy loss, Dice loss, and focal loss is employed to train the nucleus classification model, thereby improving classification accuracy.

Fig. 1
figure 1

A framework for cell nucleus segmentation and classification based on multi-modal structure encoding. a Network architecture for nucleus instance segmentation based on multimodal feature fusion and knowledge distillation, b graph neural network-based architecture for nucleus classification

Network architecture

The proposed network architecture is used to process cropped 270x270 pixel pathology images, which mainly includes the following steps: First, the input image (B,3,270,270) is processed by ViT into a feature map (B,768,17,17), while the CLIP image encoder generates features of (B,512,17,17); these features are combined through a multi-scale feature fusion module, by which fused features of (B,768,17,17) are output. The fused features are upsampled to the original resolution by the decoder, by which a nuclear segmentation map (B,2,270,270) is output. Based on the fused features and nuclear segmentation map, a graph structure is generated by the system, which includes node features (N,256+2), edge indices (2,E), and core information (N,3). Subsequently, text descriptions are generated using the nuclear segmentation results, which are processed by the CLIP text encoder to obtain text features (N,512), which are then fused with node features to obtain enhanced node features (N,256). Finally, these graph structure data are processed by the graph neural network, by which cell classification results of \((N,nr\_types)\) are output and mapped back to the image space of \((B,nr\_types,270,270)\). Comprehensive image analysis from pixel to semantic level is achieved by this architecture through multimodal feature fusion, graph structure representation, and cell-level classification.

Feature fusion and knowledge distillation

As shown in Fig. 1(a), ViT is adopted as the primary feature extractor in this study, and a pre-trained CLIP model is introduced to extract high-level semantic features. ViT is utilized to capture the global structure and contextual information of cells, while CLIP provides semantic understanding of cell functions and types. The weights of CLIP are kept fixed during the training process to ensure the stability of semantic representations. The key innovation lies in the design of a multi-scale feature fusion module, which integrates features from four different scales of ViT with those extracted by CLIP. This integration achieves a comprehensive synthesis of multi-scale visual and semantic information, thereby generating more discriminative feature representations.

Although CLIP demonstrates excellent performance in natural image understanding, directly applying or fine-tuning CLIP for pathological images presents several challenges. Firstly, the domain gap between natural and pathological images may lead to suboptimal feature extraction. Secondly, fine-tuning on relatively small pathological datasets may result in catastrophic forgetting and overfitting issues. To address these challenges, a knowledge distillation method is designed to transfer knowledge from the CLIP image encoder to the ViT student encoder. First, a student-based knowledge distillation strategy is constructed. Specifically, the feature layers of CLIP’s image encoder and Vit are defined as \(Laye{r_T} = \left[ {L_T^1,L_T^2,...,L_T^M} \right]\) and \(Laye{r_S} = \left[ {L_S^1,L_S^2,...,L_S^M} \right]\), where \(L_T^k\) is CLIP’s 2D tensor, and \(L_S^k\) is Vit’s 3D tensor. For Vit, learnable linear projections are used to project features into the same dimensional space as CLIP. For CLIP, classification tokens are removed and features are reshaped. After reshaping, \(\tilde{L}aye{r_T}\) and \(\tilde{L}aye{r_S}\) are obtained. Finally, the similarity between the two is calculated as:

$$\begin{aligned} W = \tilde{L}aye{r_T} \times \tilde{L}aye{r_S} \end{aligned}$$
(1)

where W is a similarity tensor reflecting the correlation between each layer’s features of the teacher network (CLIP) and the student network. Efficient knowledge distillation is thus achieved.

Pyramid pooling attention module

As shown in Fig. 2, the application of Vit faces challenges due to the high computational cost introduced by the sequence of image tokens. This study fully leverages the context extraction capability of pyramid pooling module (PPM) and incorporates it into the multi-head self-attention (MHSA) mechanism within the visual Transformer. This integration not only reduces computational costs but also extracts richer spatial hierarchical information. The input feature map X is reshaped into a two-dimensional space. Then, multiple average pooling layers of different scales are applied to the reshaped feature map X to generate a feature pyramid map, as follows:

$$\begin{aligned} \begin{array}{*{20}{c}} \begin{array}{l} \mathrm{{PP}}{\mathrm{{A}}_1} = \mathrm{{AvgPoo}}{\mathrm{{l}}_{\mathrm{{1}}}}\left( \mathrm{{X}} \right) ,\\ \mathrm{{PP}}{\mathrm{{A}}_2} = \mathrm{{AvgPoo}}{\mathrm{{l}}_{\mathrm{{2}}}}\left( \mathrm{{X}} \right) , \end{array}\\ {...,}\\ {\mathrm{{PP}}{\mathrm{{A}}_{\mathrm{{n}}}} = \mathrm{{AvgPoo}}{\mathrm{{l}}_{\mathrm{{n}}}}\left( \mathrm{{X}} \right) ,} \end{array} \end{aligned}$$
(2)

where \(\mathrm{{PP}}{\mathrm{{A}}_i}\) represents the generated feature pyramid map, and n is the number of pooling layers.

Fig. 2
figure 2

Attention mechanism module based on spatial pyramid pooling for feature encoding

Furthermore, the \(\mathrm{{PP}}{\mathrm{{A}}_i}\) undergoes relative position encoding, and the feature map with position encoding is flattened and concatenated, thus obtaining the token sequence \(\mathrm{{P}}\). The sequence length of \(\mathrm{{P}}\) can be reduced by enhancing the pooling efficiency, making it shorter than the input sequence X. Moreover, \(\mathrm{{P}}\) contains the contextual information of the input X, providing a better representation of the input data. Assuming that the query, key, and value tensors in the MHSA are respectively Q, K, and V, the traditional calculation method and the improved method are shown in equations (3) and (4) respectively:

$$\begin{aligned} \left( {\mathrm{{Q}},\mathrm{{K}},\mathrm{{V}}} \right) = \left( {\mathrm{{X}}{\mathrm{{W}}^q},\mathrm{{X}}{\mathrm{{W}}^k},\mathrm{{X}}{\mathrm{{W}}^v}} \right) \end{aligned}$$
(3)
$$\begin{aligned} \left( {\mathrm{{Q}},{\mathrm{{K}}_{\mathrm{{PPM}}}},{\mathrm{{V}}_{\mathrm{{PPM}}}}} \right) = \left( {\mathrm{{X}}{\mathrm{{W}}^q},\mathrm{{P}}{\mathrm{{W}}^k},\mathrm{{P}}{\mathrm{{W}}^v}} \right) \end{aligned}$$
(4)

Since the lengths of \({\mathrm{{K}}_{\mathrm{{PPM}}}}\) and \({\mathrm{{V}}_{\mathrm{{PPM}}}}\) are smaller than X, the improved PPA module is more efficient. Moreover, as \({\mathrm{{K}}_{\mathrm{{PPM}}}}\) and \({\mathrm{{V}}_{\mathrm{{PPM}}}}\) contain multi-scale information of the input image, the improved PPA module has a stronger representational ability in modeling global contextual dependencies.

Nuclear instance segmentation decoder

As shown in Fig. 3, a feature decoder for cell nucleus instance segmentation is designed to process feature maps extracted from CLIP and ViT. The decoder is constructed with a dual-path architecture that is incorporated with a feature extraction backbone and feature volume fusion, where ConvNeXt is utilized as the backbone network with a multi-scale feature processing approach. In the feature extraction backbone path, the \({v_4}\) (768 channels) from the multi-scale features \({{\{v_1,v_2,v_3,v_4\}}}\) output by the ViT-CLIP fusion module is used as the primary input, which is enhanced through \({3 \times 3}\) convolution layers for spatial information and is normalized to 256 channels using layer normalization to stabilize training. The preprocessed features are then processed through four cascaded stages: in the first stage, 3 consecutive ConvNeXt blocks (\({x_1}\)) are performed, where each ConvNeXt block is constructed with \({7 \times 7}\) depthwise separable convolution (expanding receptive field), layer normalization (feature standardization), two \({3 \times 3}\) convolution layers with GELU activation (non-linear feature transformation), learnable layer-scale parameters (adjusting residual contribution), and residual connections (preserving identity information), followed by \({2\times }\) bilinear upsampling; in the second stage, 3 ConvNeXt blocks (\({x_2}\)) are similarly executed with upsampling; in the third stage, 9 ConvNeXt blocks (\({x_3}\)) are implemented for larger receptive field and richer semantic information, followed by upsampling; in the final stage, \({x_4}\) features are generated through 3 ConvNeXt blocks, adaptive average pooling, and layer normalization. In the feature volume fusion path, 4 feature volume blocks are implemented, where features are first modeled through \({3 \times 3 \times 3}\) convolutions for three-dimensional spatial modeling, then multi-scale spatial context information is captured through pyramid pooling, and finally feature resolution is restored through 3D upsampling operations. These feature volume blocks are used to progressively process and fuse multi-scale features \({{\{v_1,v_2,v_3,v_4\}}}\) from ViT-CLIP with corresponding ConvNeXt block output features \({{\{x_1,x_2,x_3,x_4\}}}\) from high to low levels, where feature integration from coarse to fine granularity is achieved. Finally, all feature volume block outputs are fused through \({1 \times 1}\) convolution to generate precise cell nucleus segmentation masks. This dual-path architecture design is not only utilized to fully leverage the complementarity of multi-scale features but also is employed to enhance the model’s perception of cell nucleus spatial structure through three-dimensional feature volume construction, where segmentation accuracy is effectively improved.

Fig. 3
figure 3

Multi-stage Feature Fusion Decoder Based on ConvNeXt Module

Graph neural network classifier

As shown in Fig. 4, a cell classification method based on Graph Neural Network (GNN) is constructed to better capture the spatial relationships and contextual information between cells. Each cell is represented as a node in the graph, with node features including 256-dimensional visual features extracted from the fused features and 2-dimensional normalized spatial coordinates. Edges are constructed using the K-nearest neighbor algorithm (K=5) based on the Euclidean distance of cell nuclei centers, by which sufficient local structural information and spatial proximity are preserved. The classifier employs a two-layer Graph Convolutional Network (GCN). In the first layer, node features are mapped to 512-dimensional hidden features, while in the second layer, hidden features are mapped to the final classification space. Each GCN layer is followed by layer normalization and ReLU activation function, by which the model’s expressive power and training stability are enhanced.

Fig. 4
figure 4

Nucleus Graph Construction. This includes feature maps extracted by the nuclear instance semantic segmentation branch and the inferred nuclear masks, local and spatial positional information of the nuclei, and high-level representations of the morphological information of the nuclei

Furthermore, the morphological features of cells are transformed into text descriptions, such as ’a label of [shape] cell nucleus located at [position] with [size] diameter’, which are then input into the text encoding module of the CLIP model. By this process, high-level semantic representations of cell nucleus node features are obtained, thereby enabling better capture of the essential characteristics of cells and their potential biological significance.

Loss function

In our proposed framework, we adopt a multi-task learning strategy, simultaneously optimizing three components: pixel-level feature extraction, instance-level classification, and knowledge transfer from CLIP to ViT. To ensure that the model can learn rich texture features and accurate classification information, we combine multiple complementary loss functions during the training process. Specifically, we employ the Dice loss function (Eq. 5) and the cross-entropy loss function (Eq. 6) for instance segmentation. The Dice loss helps address class imbalance issues and improves segmentation accuracy, while the cross-entropy loss facilitates the model’s learning of discriminative features.

$$\begin{aligned} {L_{\mathrm{{Dice}}}} = 1 - \frac{{2 \times \sum _{i = 1}^{H \times W}\sum _{q = 1}^Q\left( {y_{i,q}^S \times x_{i,q}^S} \right) + \varepsilon }}{{\sum _{i = 1}^{H \times W}\sum _{q = 1}^Q\left( {y_{i,q}^S \times x_{i,q}^S} \right) + \varepsilon }} \end{aligned}$$
(5)
$$\begin{aligned} {L_{CE}} = - \frac{1}{{H \times W}}\sum \limits _{i = 1}^{H \times W} {\sum \limits _{q = 1}^Q {y_{i,q}^S\log x_{i,q}^S} } \end{aligned}$$
(6)

where \({x^S}\) is a prediction map of size \(HW \times Q\) for an instance segmentation task, \({y^S}\) is the ground truth map. Q is the number of nuclei, H/W are the height and width of the prediction map and ground truth map. \(\varepsilon\) is a smoothing constant, set to 1e-8.

In cell nucleus classification tasks, the distribution of samples across different categories is typically imbalanced. This often leads to the dominance of easily classified samples (such as low-grade tumor or non-tumor cells) in the training process. This imbalance may hinder model training, resulting in insufficient attention and learning ability for samples that are difficult to classify. To address this issue, Focal Loss is introduced:

$$\begin{aligned} {L_{Focal}} = - \frac{1}{N}\sum \limits _{i = 1}^N {\sum \limits _{q = 1}^Q {{\tau _q}{{\left( {1 - {t_{i,q}}} \right) }^\gamma }y_{i,q}^o\log {t_{i,q}}} } \end{aligned}$$
(7)

where t contains the predicted probabilities for N cell nuclei, and \({y^O}\) is the true label. \(\gamma\) is a hyperparameter used to make the network focus on hard samples. \({\tau _q}\) is the weight of each category, set as the reciprocal of the proportion of class q in the training set.

To facilitate effective knowledge transfer from CLIP to ViT, we introduce a knowledge distillation loss:

$$\begin{aligned} L_{KD} = \frac{1}{M}\sum _{m=1}^M ||F_t^m - F_s^m||_2 \end{aligned}$$
(8)

where \(F_t^m\) and \(F_s^m\) represent the normalized feature maps from the m-th selected layer of teacher (CLIP) and student (ViT) models respectively.

The total loss function combining all components is defined as:

$$\begin{aligned} L_{total} = \alpha (L_{Dice} + L_{CE}) + \beta L_{Focal} + \lambda L_{KD} \end{aligned}$$
(9)

where \(\alpha\), \(\beta\), and \(\lambda\) are balancing weights for different loss components. Through this comprehensive approach, our model can simultaneously optimize instance segmentation and classification performance while effectively transferring semantic knowledge from CLIP to ViT in an end-to-end training process.

Experimental results

Datasets

As shown in Fig. 5, we evaluated our proposed framework on four public datasets and one self-built dataset. The public datasets include CPM-15 [13], CPM-17 [49], TNBC [50], and Kumar [13], primarily used to evaluate nuclear instance segmentation performance. Specifically, the CPM-15 dataset from the MICCAI 2015 cell segmentation challenge contains 15 images covering two types of cancer tissues; the CPM-17 dataset from the MICCAI 2017 cell segmentation challenge includes 32 training images and 32 test images, encompassing four different types of cancer tissues; the TNBC dataset focuses on triple-negative breast cancer tissue samples, containing 50 high-resolution nuclear segmentation annotated images, randomly divided into training, validation, and test sets in a 7:1:2 ratio; the Kumar dataset is a multi-organ dataset comprising 30 tissue images from 7 different organs, where each organ provides multiple high-power field (40\(\times\)) H&E stained images, evaluated using 5-fold cross-validation.

Fig. 5
figure 5

Samples of cropped nucleus instance segmentation public dataset

Additionally, as shown in Fig. 6, we selected 50 representative samples from the Whole Slide Images (WSI) of the TCGA Muscle-Invasive Bladder Cancer cohort. These samples were processed into 1024\(\times\)1024 pixel patches, from which 500 high-quality patches were carefully selected for nuclear classification annotation. The annotation was performed by experienced pathologists, categorizing nuclei into three classes: tumor cell nuclei, stromal cell nuclei, and immune cell nuclei. To ensure annotation accuracy and consistency, standardized annotation tools and protocols were employed, along with a double-review mechanism. These 500 patches were randomly divided into training (300 images), validation (100 images), and test (100 images) sets in a 6:2:2 ratio, ensuring balanced representation of each nuclear class across all sets.

Fig. 6
figure 6

Samples of cropped nucleus classification dataset

Implementation details

Our framework was evaluated on two tasks: nuclear segmentation using four public datasets (CPM-15, CPM-17, TNBC, and Kumar) and nuclear classification using our self-built TCGA-based cell classification dataset. All images were uniformly resized to \(1024\times 1024\) pixels. Experiments were conducted on a high-performance workstation equipped with 4 NVIDIA RTX 4090 GPUs (24GB memory each) and Intel Xeon Platinum processor, implemented using PyTorch 2.1.0 framework. The model architecture consists of a ViT and CLIP-based feature extraction backbone, along with nuclear segmentation and classification modules. A multi-task learning strategy was adopted, combining Dice loss, cross-entropy loss, and Focal loss to optimize the model, effectively addressing class imbalance issues. Training utilized the AdamW optimizer with an initial learning rate of \(1e\text {-}4\), weight decay of 0.01, and a cosine annealing learning rate schedule with 5 epochs of warmup. The model was trained in parallel across 4 GPUs using DistributedDataParallel, with a batch size of 16 per GPU (total batch size 64) for 100 epochs. To enhance model generalization, comprehensive data augmentation strategies were implemented, including random flipping, rotation (\(\pm 90\) degrees), color jittering (brightness, contrast, and saturation all set to 0.2), random scaling (\(0.8\text {-}1.2\) times), and random cropping (\(768\times 768\)). Additionally, mixed precision training (AMP) and gradient accumulation techniques were employed to improve training efficiency, with the complete training process taking approximately 9 h.

Evaluation metrics

Multiple metrics were adopted to comprehensively evaluate the model’s performance on nuclear segmentation and classification tasks. For the nuclear segmentation task, three metrics were used: Ensemble Dice (DICE2) [49], Aggregated Jaccard Index (AJI) [13], and Panoptic Quality (PQ) [51]. The overall accuracy of segmentation, instance-level quality, and interpretable performance evaluation were assessed by these metrics, respectively. In the nuclear classification task, cells were categorized into three types: immune cells, stromal cells, and tumor cells, with F-score (Fc) [52] being used as the primary evaluation metric. Both detection and classification performance were considered by Fc, where predicted nuclei were paired with ground truth nuclei, correctly detected, undetected, and incorrectly detected nuclei were counted, and classification accuracy among correctly detected nuclei was evaluated, ultimately calculating the F-score for each cell type.

Comparison with other methods

Nuclear segmentation

The proposed method was compared with four existing advanced methods, including HoVer-Net [53], Mask2former [54], Triple U-net [55], and TSFD-net [56]. Table  1 presents the performance of our proposed method on four datasets: Kumar, CPM-15, CPM-17, and TNBC. The results show that the proposed method achieved significant effectiveness across multiple evaluation metrics. On the Kumar dataset, the proposed method ranked first in three key metrics: DICE, AJI, and PQ, with a notable improvement of about 1 percentage point in PQ compared to the second-best solution, highlighting its advantage in accurately identifying and segmenting individual cells. For the CPM-15 and CPM-17 datasets, the proposed method performed most prominently in AJI and PQ metrics. Although slightly inferior to Mask2former in the DICE metric, it still maintained a lead in instance-level segmentation and identification tasks. On the TNBC dataset, the proposed method significantly outperformed other comparative methods across all metrics, especially achieving a significant improvement of about 1 % in the PQ metric, demonstrating the adaptability and robustness of the proposed method when dealing with highly complex and heterogeneous pathological images.

Table 1 Quantitative comparison of existing methods and our method on nuclear segmentation

To rigorously validate the statistical significance of these improvements, we conducted a detailed analysis with 95% confidence intervals, as shown in Table  2. The analysis reveals that our method’s superior performance is statistically significant across all datasets. Specifically, on the Kumar dataset, our method achieved a DICE score of 0.839 (95% CI: 0.832\(-\)0.846), which is significantly higher than the second-best method Mask2former’s 0.831 (95% CI: 0.824\(-\)0.838). On the CPM-15 &17 dataset, our method achieved 0.827 (95% CI: 0.820\(-\)0.834) for DICE and 0.749 (95% CI: 0.742\(-\)0.756) for PQ, consistently outperforming other methods. The most notable improvements were observed in the TNBC dataset, where our method reached a DICE score of 0.850 (95% CI: 0.843\(-\)0.857) and PQ of 0.780 (95% CI: 0.773\(-\)0.787), showing clear statistical superiority over the second-best performer Mask2former (DICE: 0.842, 95% CI: 0.835\(-\)0.849; PQ: 0.772, 95% CI: 0.765\(-\)0.779). This statistical analysis further strengthens our findings by demonstrating that the performance improvements are not due to random variation but represent genuine advances in segmentation capability.

Table 2 Statistical significance analysis of nuclear segmentation results with 95% confidence intervals for different methods across three datasets

As shown in Fig. 7, which displayed some of the nuclear segmentation results, the proposed method performed excellently in handling complex cellular structures and densely distributed nuclei. Particularly when dealing with overlapping cells and blurred boundaries, the proposed method could more accurately identify and segment individual cell nuclei. Compared to other methods, the segmentation results of the proposed method were closer to the ground truth, with clearer boundaries and better preservation of cell nuclear shapes. This advantage was especially evident when processing densely distributed cell areas, where the proposed method could effectively distinguish adjacent cell nuclei, reducing instances of over-segmentation and under-segmentation. Furthermore, while maintaining the overall morphology of cell nuclei, the proposed method could also capture the subtle structural features of the nuclei, which was crucial for subsequent classification tasks.

Fig. 7
figure 7

Visualization of nuclear instance segmentation results

Nuclear classification

As shown in Table  3, the proposed method was compared with eight existing advanced methods on the cell classification task, including HoVer-Net [53], Mask2former [54], UNet++ [57], Triple U-net [55], DeepLabV3 [58], TSFD-net [56], PSPNet [59], and SegNet [60]. F-score (F) was used as the evaluation metric, including overall F-score (\(F_{\text {c}\_\text {avg}}\)) and F-scores for immune cells (\(F_{\text {immune}}\)), stromal cells (\(F_{\text {stroma}}\)), and tumor cells (\(F_{\text {tumor}}\)). The experimental results demonstrated that the proposed method achieved good performance across all four metrics. For the overall F-score, the proposed method reached 0.858, about 2.3 percentage points higher than the second-best method, Mask2former (0.835). This advantage was consistently reflected in the classification of various cell types, with F-scores of 0.822, 0.875, and 0.877 for immune cells, stromal cells, and tumor cells, respectively, all significantly outperforming other methods. Other methods such as HoVer-Net and TSFD-net also showed good performance but were slightly inferior to the proposed method in all metrics. Notably, the proposed method made significant progress in handling immune cells, achieving an F-score of 0.822, 2.4 percentage points higher than the second-best method. These results proved the effectiveness and superiority of the proposed method in cell classification tasks, not only excelling in overall performance but also maintaining consistently high performance in handling different types of cells, demonstrating good generalization ability and adaptability to different cell types.

Table 3 Quantitative comparison of existing methods and our method on nuclear classification using TCGA-based Cell Classification Dataset

To further validate the statistical significance of these improvements, we conducted a detailed analysis with 95% confidence intervals (Table  4). The analysis confirms that our method’s superior performance is statistically significant across all metrics. For the overall F-score (\(F_{\text {c}\_\text {avg}}\)), our method’s confidence interval (0.851\(-\)0.865) shows no overlap with the second-best method Mask2former (0.828\(-\)0.842), indicating a statistically significant improvement. Similar patterns were observed in cell-type specific metrics, particularly for immune cell classification where our method achieved 0.822 (0.815\(-\)0.829), significantly outperforming the second-best result of 0.798 (0.791\(-\)0.805) from Mask2former. The statistical superiority is also evident in stromal and tumor cell classification, with our method achieving 0.875 (0.868\(-\)0.882) and 0.877 (0.870\(-\)0.884) respectively, consistently outperforming Mask2former’s corresponding scores of 0.852 (0.845\(-\)0.859) and 0.855 (0.848\(-\)0.862), providing strong statistical evidence for the claimed improvements across all cell types.

Table 4 Statistical analysis of nuclear classification performance with 95% confidence intervals on TCGA-based Cell Classification Dataset

Figure 8 visually demonstrates the performance of several top-performing methods on the cell nucleus classification task. Color coding was used to distinguish different types of cell nuclei: red represented immune cells, green represented stromal cells, and blue represented tumor cells. Visually, it was evident that the proposed method outperformed other methods in terms of accuracy and consistency in nuclear classification, especially when dealing with complex tissue structures. Compared to the ground truth, it showed a higher degree of similarity.

Fig. 8
figure 8

Visualization of nuclear classification results

Performance metrics and resource consumption

To evaluate computational efficiency, we selected representative methods that have demonstrated excellence in both segmentation and classification tasks for comparative testing. These methods include the classic end-to-end framework HoVer-Net [53], as well as the recently high-performing Mask2former [54] and Triple U-net [55]. As shown in Table 5, for end-to-end nuclear segmentation and classification tasks, our MSA-Net requires only \(45.6\text {ms}\) inference time and \(8.4\text {GB}\) GPU memory when processing \(270\times 270\) pixel images with a batch size of 16. In comparison, HoVer-Net requires \(82.3\text {ms}\) inference time and \(11.2\text {GB}\) memory consumption under the same configuration. Our method also outperforms Mask2former (\(63.7\text {ms}\), \(9.8\text {GB}\)) and Triple U-net (\(58.9\text {ms}\), \(10.5\text {GB}\)). All experiments were conducted on a single NVIDIA RTX 4090 GPU, with inference times averaged over 1000 runs to ensure measurement reliability. These results demonstrate that our proposed architecture achieves competitive computational efficiency while maintaining high segmentation and classification accuracy.

Table 5 Comparison of inference speed and GPU memory consumption for end-to-end nuclear segmentation and classification

Ablation study

We conducted comprehensive ablation experiments on the nuclear segmentation task across all four datasets (Kumar, CPM-15, CPM-17, and TNBC) to investigate the roles of feature fusion and knowledge distillation components, with results averaged and shown in Table 6. Regarding feature fusion, the transition from single-scale to multi-scale features brought significant improvements, with the DICE metric increasing from 0.805 to 0.825, confirming the effectiveness of multi-scale feature fusion. For knowledge distillation, the results demonstrate its crucial role in model performance - without knowledge distillation, the DICE score was 0.810, while incorporating our proposed CLIP-based knowledge distillation strategy achieved better performance across all evaluation metrics (DICE: 0.830, AJI: 0.670, PQ: 0.750). These results validate the effectiveness of both our multi-scale feature fusion strategy and knowledge distillation approach in enhancing model performance.

Table 6 Ablation Study of Feature Fusion and Knowledge Distillation (Averaged Results across Kumar, CPM-15, CPM-17, and TNBC Datasets)

To validate the adaptability of the CLIP model in the pathological image domain, we compared the effects of pre-training using natural images alone versus incorporating pathological images (as shown in Table 7). The results demonstrate that introducing pathological images for pre-training significantly improves model performance, with approximately 0.015 improvement across all metrics (DICE increased from 0.815 to 0.830, AJI from 0.655 to 0.670, and PQ from 0.735 to 0.750). These results confirm the importance of domain-adaptive pre-training, indicating that the CLIP model can better understand nuclear visual features through additional pathological image pre-training, thus providing more suitable feature representations for downstream tasks. This pre-training strategy effectively mitigates the domain gap between natural and pathological images.

Table 7 Impact of CLIP Pretraining Data (Averaged Results across Kumar, CPM-15, CPM-17, and TNBC Datasets)

To thoroughly evaluate our method’s performance on cell type classification, we conducted a series of experiments on our custom cell nucleus classification dataset based on TCGA. We first investigated different CLIP adaptation strategies, followed by detailed ablation studies on feature configurations and loss components. As shown in Table 8, different strategies for direct CLIP application demonstrate notable limitations. CLIP’s zero-shot transfer performs the poorest, with an average F-score of only 0.730, reflecting significant domain differences between natural and pathological images. While directly fine-tuning CLIP and linear probing show better results than zero-shot, their performance improvements are limited (0.785 and 0.790 respectively), indicating that simple adaptation strategies struggle to effectively overcome domain gaps. In contrast, our proposed knowledge distillation approach achieves superior performance across all metrics (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}\)=0.858, \({\mathrm{{F}}_{\text {tumor}}}\)=0.877, \({\mathrm{{F}}_{\text {stroma}}}\)=0.875, \({\mathrm{{F}}_{\text {immune}}}\)=0.822). These results confirm that our method not only successfully avoids the issues of direct CLIP usage but also effectively leverages CLIP’s visual understanding capabilities to enhance pathological image analysis.

Table 8 Comparison of Different CLIP Adaptation Strategies on TCGA-based Cell Classification Dataset

To verify the impact of different input features on model performance, we conducted systematic comparative experiments on node feature configurations (as shown in Table 9). The basic visual features derived from multi-scale feature fusion achieved an average classification F1 score of 0.840. After incorporating spatial features (normalized x,y coordinates), the performance slightly improved (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}\) increased to 0.845), indicating that positional information provides some benefit for nuclear classification. When CLIP text features were introduced, model performance improved significantly (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}\) reached 0.850), suggesting that semantic information extracted from nuclear morphology descriptions effectively guides the classification task. Finally, the complete scheme combining visual features, spatial features, and CLIP text features achieved optimal performance (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}\) reached 0.858), showing particularly strong performance in identifying stromal cells (improvement of 0.045) and tumor cells (improvement of 0.037). These results confirm the complementarity of multimodal features while also demonstrating that the CLIP text encoder can effectively transform nuclear morphological descriptions into meaningful feature representations.

Table 9 Ablation Study on Node Feature Configurations on TCGA-based Cell Classification Dataset

We conducted systematic ablation experiments on different components of the loss function and key hyperparameters (as shown in Table 10). The experimental results show that when using only the Dice loss, the model achieves baseline performance (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}=0.820\)); after adding cross-entropy loss, the performance improves significantly (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}=0.835\)), confirming the complementarity of the two loss functions; with the introduction of Focal Loss, model performance further improves (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}=0.845\)), showing particularly significant effects when handling difficult-to-classify tumor cells. Finally, after adding knowledge distillation loss, the model achieves optimal performance (\({\mathrm{{F}}_{\text {c}\_\text {avg}}}=0.858\)), validating the effectiveness of CLIP semantic knowledge transfer. Regarding hyperparameters, the focusing parameter \(\gamma =2.0\) of Focal Loss achieved the best results, effectively balancing the contributions of easy and hard samples. For loss function weights, we found that a balanced ratio of 1:1:1 for segmentation loss (\(\alpha\)), classification loss (\(\beta\)), and knowledge distillation loss (\(\lambda\)) works best, ensuring balanced optimization of segmentation accuracy and classification performance while fully utilizing the semantic enhancement from knowledge distillation. The experimental results thoroughly validate the effectiveness of our proposed multi-task loss function design.

Table 10 Ablation Study on Loss Function Components and Weights on TCGA-based Cell Classification Dataset

Discussion

This paper proposes a novel multimodal fusion method for cell nucleus segmentation and classification tasks, combining the CLIP model with visual transformers. Our multi-scale feature fusion module integrates visual information from different spatial scales and high-level semantic features, aiming to enhance the understanding of cell morphology and context. This fusion strategy attempts to address the challenges of fine-grained feature extraction and semantic understanding in biomedical image analysis.

Our proposed multi-task decoder design is another significant innovation. By simultaneously performing nucleus segmentation, boundary detection, and type prediction, our method provides comprehensive cell analysis results. This multi-task learning strategy not only improves computational efficiency but also allows knowledge transfer between different tasks, thereby enhancing overall performance. Notably, we map the classification results of GNN back to the image space, ensuring spatial consistency of classification results while preserving rich contextual information. The proposed multi-task decoder not only improves model performance but also enhances its practicality, supporting a wide range of biomedical research needs from basic cell morphology studies to complex disease diagnosis and drug screening. Through extensive experiments, we found that despite CLIP’s powerful visual understanding capabilities, directly fine-tuning it for pathological images is not the optimal choice. Our knowledge distillation framework successfully addresses this challenge by transferring useful visual features while avoiding the pitfalls of direct fine-tuning.

A novel cell graph construction method is designed to transform complex cell images into structured graph data. This representation method not only preserves the features of individual cells but also captures spatial relationships and contextual information between cells. By representing each cell nucleus as a node in the graph and constructing edge connections using the K-nearest neighbor algorithm, our method effectively simulates interactions within cell populations. Based on the constructed cell graph, we designed a two-layer graph convolutional network (GCN) classifier. This graph-based classification method allows the model to consider local and global contextual information of cells when making classification decisions. Through message passing on the graph structure, each cell node can aggregate information from neighboring cells, thus making classification decisions in a broader context.

Through systematic experiments and analysis, our method has achieved good performance on multiple public datasets. In the nucleus segmentation task, our method outperforms existing methods on the Kumar, CPM-15 &17, and TNBC datasets in terms of DICE, AJI, and PQ metrics, particularly excelling in handling the complex TNBC dataset. In the nucleus classification task, our method achieves relatively good performance in overall F-score as well as F-scores for immune cells, stromal cells, and tumor cells. These results amply demonstrate the effectiveness and superiority of our proposed method in cell nucleus segmentation and classification tasks.

Conclusions

This paper presents an innovative multi-modal structure encoding framework for addressing the challenging task of automatic cell nucleus segmentation and classification in H&E stained multi-organ tissue pathology images. Firstly, we designed a novel feature fusion and knowledge distillation module that combines the advantages of ViT and the CLIP model, significantly enhancing the model’s ability to understand cell morphology and semantic information. Although our method effectively leverages CLIP’s knowledge through distillation, there may still be room for improvement in bridging the domain gap between natural and pathological images. Future work could explore more sophisticated distillation strategies. Secondly, we developed a cell graph construction module that effectively captures spatial relationships and contextual information between cells through graph neural networks, providing rich structured representations for nuclear classification. Thirdly, we proposed a multi-task decoder that simultaneously performs nuclear segmentation, boundary detection, and type prediction, offering comprehensive cell analysis results. Experimental results demonstrate that our method significantly outperforms existing approaches in nuclear segmentation and classification tasks across multiple public datasets, exhibiting excellent performance and good generalization ability. In the future, we plan to extend this framework into a unified graph-based nuclear detection and classification model and apply it to pathological analysis of more cancer types and different organs, further validating its potential and value in clinical practice.

Data availability

Not applicable.

References

  1. Basu A, Senapati P, Deb M, Rai R, Dhal KG. A survey on recent trends in deep learning for nucleus segmentation from histopathology images. Evol Syst. 2024;15(1):203–48.

    Google Scholar 

  2. Morales S, Engan K, Naranjo V. Artificial intelligence in computational pathology-challenges and future directions. Digital Signal Process. 2021;119: 103196.

    Google Scholar 

  3. Huang Y, Yang X, Liu L, Zhou H, Chang A, Zhou X, Chen R, Yu J, Chen J, Chen C. Segment anything model for medical images? Med Image Anal. 2024;92: 103061.

    PubMed  Google Scholar 

  4. Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin unetr: swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI brainlesion workshop, 2021;272–84.

  5. Zhou Y, Graham S, Alemi Koohbanani N, Shaban M, Heng P-A, Rajpoot N. Cgc-net: cell graph convolutional network for grading of colorectal cancer histology images. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019;0–0

  6. Atabansi CC, Nie J, Liu H, Song Q, Yan L, Zhou X. A survey of transformer applications for histopathological image analysis: new developments and future directions. Biomed Eng Online. 2023;22(1):96.

    PubMed  PubMed Central  Google Scholar 

  7. Mo Y, Han C, Liu Y, Liu M, Shi Z, Lin J, Zhao B, Huang C, Qiu B, Cui Y. Hover-trans: anatomy-aware hover-transformer for roi-free breast cancer diagnosis in ultrasound images. IEEE Trans Med Imaging. 2023;42(6):1696–706.

    PubMed  Google Scholar 

  8. Park H-C, Ghimire R, Poudel S, Lee S-W. Deep learning for joint classification and segmentation of histopathology image. J Internet Technol. 2022;23(4):903–10.

    Google Scholar 

  9. Hafner M, Katsantoni M, Köster T, Marks J, Mukherjee J, Staiger D, Ule J, Zavolan M. Clip and complementary methods. Nat Rev Methods Primers. 2021;1(1):1–23.

    Google Scholar 

  10. Lee HH, Gu Y, Zhao T, Xu Y, Yang J, Usuyama N, Wong C, Wei M, Landman BA, Huo Y et al. Foundation models for biomedical image segmentation: a survey. arXiv preprint arXiv:2401.07654 2024.

  11. Majanga V, Mnkandla E. Automatic watershed segmentation of cancerous lesions in unsupervised breast histology images. Appl Sci. 2024;14(22):10394.

    CAS  Google Scholar 

  12. Kaushal C, Singla A. Automated segmentation technique with self-driven post-processing for histopathological breast cancer images. CAAI Trans Intell Technol. 2020;5(4):294–300.

    Google Scholar 

  13. Kumar N, Verma R, Sharma S, Bhargava S, Vahadane A, Sethi A. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans Med Imaging. 2017;36(7):1550–60.

    PubMed  Google Scholar 

  14. Veta M, Van Diest PJ, Kornegoor R, Huisman A, Viergever MA, Pluim JP. Automatic nuclei segmentation in h &e stained breast cancer histopathology images. PLoS ONE. 2013;8(7):70221.

    Google Scholar 

  15. Wienert S, Heim D, Saeger K, Stenzinger A, Beil M, Hufnagl P, Dietel M, Denkert C, Klauschen F. Detection and segmentation of cell nuclei in virtual microscopy images: a minimum-model approach. Sci Rep. 2012;2(1):503.

    PubMed  PubMed Central  Google Scholar 

  16. Zhang J, Xiong H, Jin Q, Feng T, Ma J, Xuan P, Cheng P, Ning Z, Ning Z, Li C. A multi-information dual-layer cross-attention model for esophageal fistula prognosis. In: International conference on medical image computing and computer-assisted intervention, 2024;25–35.

  17. Jin Q, Cui H, Sun C, Huang J, Xuan P, Xu Y, Wang L, Cao L, Wei L, Su R. Shape-aware contrastive deep supervision for esophageal tumor segmentation from ct scans. In: 2023 IEEE international conference on bioinformatics and biomedicine (BIBM), 2023;1188–93.

  18. Han Y, Lei Y, Shkolnikov V, Xin D, Auduong A, Barcelo S, Allebach J, Delp EJ. An ensemble method with edge awareness for abnormally shaped nuclei segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023;4315–25.

  19. Jin Q, Cui H, Sun C, Song Y, Zheng J, Cao L, Wei L, Su R. Inter-and intra-uncertainty based feature aggregation model for semi-supervised histopathology image segmentation. Expert Syst Appl. 2024;238: 122093.

    Google Scholar 

  20. Li Z, Tang Z, Hu J, Wang X, Jia D, Zhang Y. Nst: a nuclei segmentation method based on transformer for gastrointestinal cancer pathological images. Biomed Signal Process Control. 2023;84: 104785.

    Google Scholar 

  21. Jin Q, Cui H, Sun C, Zheng J, Wei L, Fang Z, Meng Z, Su R. Semi-supervised histological image segmentation via hierarchical consistency enforcement. In: International conference on medical image computing and computer-assisted intervention, 2022;3–13.

  22. Imtiaz T, Fattah SA, Kung S-Y. Bawgnet: boundary aware wavelet guided network for the nuclei segmentation in histopathology images. Comput Biol Med. 2023;165: 107378.

    PubMed  Google Scholar 

  23. Bagchi A, Pramanik P, Sarkar R. A multi-stage approach to breast cancer classification using histopathology images. Diagnostics. 2022;13(1):126.

    PubMed  PubMed Central  Google Scholar 

  24. Javed S, Mahmood A, Dias J, Werghi N, Rajpoot N. Spatially constrained context-aware hierarchical deep correlation filters for nucleus detection in histology images. Med Image Anal. 2021;72: 102104.

    PubMed  Google Scholar 

  25. Wang X, Yang S, Zhang J, Wang M, Zhang J, Yang W, Huang J, Han X. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. 2022;81: 102559.

    PubMed  Google Scholar 

  26. Mi W, Li J, Guo Y, Ren X, Liang Z, Zhang T, Zou H. Deep learning-based multi-class classification of breast digital pathology images. Cancer Manag Res. 2021;10:4605–17.

    Google Scholar 

  27. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J. Learning transferable visual models from natural language supervision. In: International conference on machine learning, 2021;8748–63.

  28. Wang Z, Wu Z, Agarwal D, Sun J. Medclip: contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163 2022.

  29. Liu J, Zhang Y, Chen J-N, Xiao J, Lu Y, Landman AB, Yuan Y, Yuille A, Tang Y, Zhou Z. Clip-driven universal model for organ segmentation and tumor detection. Proceedings of the IEEE/CVF international conference on computer vision. 2023;21152–64.

  30. Javed S, Mahmood A, Ganapathi II, Dharejo FA, Werghi N, Bennamoun M. Cplip: zero-shot learning for histopathology with comprehensive vision-language alignment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024;11450–9.

  31. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016.

  32. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in neural information processing systems 30 2017.

  33. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. arXiv preprint arXiv:1710.10903 2017.

  34. Bahade S, Edwards M, Xie X. Graph convolution networks for cell segmentation. In: ICPRAM, 2021;620–7.

  35. Chan TH, Cendra FJ, Ma L, Yin G, Yu L. Histopathology whole slide image analysis with heterogeneous graph representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023;15661–70.

  36. Zhao S, Yan C-Y, Lv H, Yang J-C, You C, Li Z-A, Ma D, Xiao Y, Hu J, Yang W-T. Deep learning framework for comprehensive molecular and prognostic stratifications of triple-negative breast cancer. Fundam Res. 2024;4(3):678–89.

    CAS  PubMed  Google Scholar 

  37. Yang Z, Zhang Y, Zhuo L, Sun K, Meng F, Zhou M, Sun J. Prediction of prognosis and treatment response in ovarian cancer patients from histopathology images using graph deep learning: a multicenter retrospective study. Eur J Cancer. 2024;199: 113532.

    CAS  PubMed  Google Scholar 

  38. Zhao S, Chen D-P, Fu T, Yang J-C, Ma D, Zhu X-Z, Wang X-X, Jiao Y-P, Jin X, Xiao Y. Single-cell morphological and topological atlas reveals the ecosystem diversity of human breast cancer. Nat Commun. 2023;14(1):6796.

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Hu J, Wang S-G, Hou Y, Chen Z, Liu L, Li R, Li N, Zhou L, Yang Y, Wang L. Multi-omic profiling of clear cell renal cell carcinoma identifies metabolic reprogramming associated with disease progression. Nat Genet. 2024;56(3):442–57.

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Li R, Ferdinand JR, Loudon KW, Bowyer GS, Laidlaw S, Muyas F, Mamanova L, Neves JB, Bolt L, Fasouli ES. Mapping single-cell transcriptomes in the intra-tumoral and associated territories of kidney cancer. Cancer Cell. 2022;40(12):1583–99.

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Braun DA, Street K, Burke KP, Cookmeyer DL, Denize T, Pedersen CB, Gohil SH, Schindler N, Pomerance L, Hirsch L. Progressive immune dysfunction with advancing disease stage in renal cell carcinoma. Cancer Cell. 2021;39(5):632–48.

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Obradovic A, Chowdhury N, Haake SM, Ager C, Wang V, Vlahos L, Guo XV, Aggen DH, Rathmell WK, Jonasch E. Single-cell protein activity analysis identifies recurrence-associated renal tumor macrophages. Cell. 2021;184(11):2988–3005.

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Jackson HW, Fischer JR, Zanotelli VR, Ali HR, Mechera R, Soysal SD, Moch H, Muenst S, Varga Z, Weber WP. The single-cell pathology landscape of breast cancer. Nature. 2020;578(7796):615–20.

    CAS  PubMed  Google Scholar 

  44. Kather JN, Pearson AT, Halama N, Jäger D, Krause J, Loosen SH, Marx A, Boor P, Tacke F, Neumann UP. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med. 2019;25(7):1054–6.

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Bulten W, Pinckaers H, Boven H, Vink R, Bel T, Ginneken B, Laak J, Kaa C, Litjens G. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 2020;21(2):233–41.

    PubMed  Google Scholar 

  46. AbdulJabbar K, Raza SEA, Rosenthal R, Jamal-Hanjani M, Veeriah S, Akarca A, Lund T, Moore DA, Salgado R, Al Bakir M. Geospatial immune variability illuminates differential evolution of lung adenocarcinoma. Nat Med. 2020;26(7):1054–62.

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Schürch CM, Bhate SS, Barlow GL, Phillips DJ, Noti L, Zlobec I, Chu P, Black S, Demeter J, McIlwain DR. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell. 2020;182(5):1341–59.

    PubMed  PubMed Central  Google Scholar 

  48. Yu W, Zhou P, Yan S, Wang X. Inceptionnext: when inception meets convnext. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 2024;5672–83.

  49. Vu QD, Graham S, Kurc T, To MNN, Shaban M, Qaiser T, Koohbanani NA, Khurram SA, Kalpathy-Cramer J, Zhao T. Methods for segmentation and classification of digital microscopy tissue images. Front Bioeng Biotechnol. 2019;7: 433738.

    Google Scholar 

  50. Naylor P, Laé M, Reyal F, Walter T. Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans Med Imaging. 2018;38(2):448–59.

    Google Scholar 

  51. Kirillov A, He K, Girshick R, Rother C, Dollár P. Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;9404–13.

  52. Taha AA, Hanbury A. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15:1–28.

    Google Scholar 

  53. Graham S, Vu QD, Raza SEA, Azam A, Tsang YW, Kwak JT, Rajpoot N. Hover-net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med Image Anal. 2019;58: 101563.

    PubMed  Google Scholar 

  54. Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022;1290–9.

  55. Zhao B, Chen X, Li Z, Yu Z, Yao S, Yan L, Wang Y, Liu Z, Liang C, Han C. Triple u-net: hematoxylin-aware nuclei segmentation with progressive dense feature aggregation. Med Image Anal. 2020;65: 101786.

    PubMed  Google Scholar 

  56. Ilyas T, Mannan ZI, Khan A, Azam S, Kim H, De Boer F. Tsfd-net: tissue specific feature distillation network for nuclei segmentation and classification. Neural Netw. 2022;151:1–15.

    PubMed  Google Scholar 

  57. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp. 3–11 (2018).

  58. Yurtkulu SC, Şahin YH, Unal G. Semantic segmentation with extended deeplabv3 architecture. In: 2019 27th signal processing and communications applications conference (SIU), 2019;1–4.

  59. Zhou J, Hao M, Zhang D, Zou P, Zhang W. Fusion pspnet image segmentation based method for multi-focus image fusion. IEEE Photonics J. 2019;11(6):1–12.

    Google Scholar 

  60. Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481–95.

    PubMed  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 52475040, by Research Project on Teaching Reform in Ordinary Undergraduate Universities in Hunan Province under Grants 202401000364.

Author information

Authors and Affiliations

Authors

Contributions

BG conceived and designed the study, performed the experiments, and drafted the manuscript. GC contributed to data analysis and interpretation. ZW assisted with the experiments and data collection. JL and BY supervised the project, provided critical feedback, and revised the manuscript. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Jianmin Li or Bo Yi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guan, B., Chu, G., Wang, Z. et al. Instance-level semantic segmentation of nuclei based on multimodal structure encoding. BMC Bioinformatics 26, 42 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06066-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06066-8

Keywords