MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks

Li, Jianwei; Zhang, Xukun; Li, Bing; Li, Ziyu; Chen, Zhenzhen

doi:10.1186/s12859-025-06040-4

Research
Open access
Published: 13 January 2025

MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks

Jianwei Li¹,
Xukun Zhang¹,
Bing Li¹,
Ziyu Li¹ &
…
Zhenzhen Chen²

BMC Bioinformatics volume 26, Article number: 13 (2025) Cite this article

845 Accesses
Metrics details

Abstract

Background

MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.

Results

In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.

Conclusions

The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model’s generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.

Peer Review reports

Background

MicroRNAs (miRNAs) represent a class of endogenous small non-coding RNAs (ncRNAs) that are approximately 22 nucleotides in length [1]. A substantial body of research has established a strong correlation between the dysregulated expression of miRNAs and the initiation and progression of various diseases [2]. In the domain of oncology research, nearly 50% of the identified miRNAs have been found to be located in genomic regions or fragile sites that exhibit strong associations with cancers [3, 4]. As a crucial regulatory molecule, miRNA exerts its influence via two primary mechanisms. Initially, miRNA impedes translation by binding to target mRNA, thereby obstructing the translation process. Subsequently, miRNA facilitates mRNA degradation by targeting the RNA-induced silencing complex (RISC) [5]. The synergistic interaction of these mechanisms enables miRNAs to intricately regulate the expression of multiple genes in a complex manner.

Many studies indicate that mature miRNAs may serve as potential targets for small molecule (SM) drugs [6]. SMs can regulate the functionality of miRNAs via two distinct mechanisms: by altering the expression levels of miRNAs and by modulating their interactions with mRNAs. In contemporary biomedical research, the precise regulation of miRNAs by SMs has emerged as a promising strategy for the treatment of tumors and a variety of other diseases [7]. Consequently, the precise identification of potential Small Molecule-miRNA (SM-miRNA) associations is of paramount importance. With ongoing advancements in the field of SM-miRNA association prediction research, numerous related databases have been developed, including SM2miR [8], RNAInter [9], and NoncoRNA [10]. These databases serve as invaluable data resources for exploring the potential SM-miRNA associations and establish a robust foundation for the developing more accurate prediction models. The computational models for SM-miRNA prediction are generally classified into three categories: models based on biological networks, those employing machine learning algorithms, and alternative predictive models.

The first category of SM-miRNA association prediction models is a biological network-based model, which utilizes the structure and dynamic characteristics of biological networks to predict the potential SM-miRNA associations. Li et al. [11] proposed SMiR-NBI, a network-based model for predicting SM-miRNA association, which was constructed by systematically collecting and organizing relevant information on SMs, miRNAs, and genes. Network inference algorithms were then employed to reveal potential SM-miRNA association. Guan et al. [12] introduced the GISMMA model, which is founded on graphlet interactions, as documented in their study. By calculating the Graphlet interaction heterostructures and applying linear regression to the association scores within the SM and miRNA similarity network, it predicted the SM-miRNA associations. Chen et al. [13] proposed the BNNRSMMA model, which constructed a heterogeneous network using SM and miRNA similarity. This network was represented in matrix form and optimized by minimizing its core norm, thereby uncovering potential associations between SMs and miRNAs. The limitations in this category of models are primarily to their dependence on the reliability and comprehensiveness of the biological network for accurate prediction. Additionally, these models are vulnerable to error in network modeling and the presence of incomplete data.

The second category of the computational models identify SM-miRNA associations through the application of machine learning algorithms. These models necessitate a substantial corpus of known data for training and optimization of the algorithms [14]. Researchers extract meaningful features from SMs and miRNAs, and subsequently utilize diverse machine learning algorithms. For instance, Wang et al. [15] proposed the RFSMMA model based on the Random Forest algorithm. A filtering method was employed to extract reliable features from the similarity data of SM-miRNA pairs. These extracted features were subsequently utilized to train a Random Forest model for predicting potential SM-miRNA associations. In 2022, Wang et al. [16] proposed the EKRRSMMA model, which employed an ensemble learning approach by utilizing multiple kernel ridge regression algorithms. Although traditional machine learning algorithms process some utility in predicting SM-miRNA associations under specific conditions, the ongoing advancements in deep learning algorithms have consistently demonstrated their superior performance in tasks such as link prediction and node classification. Furthermore, traditional methods may encounter challenges in fully capturing the complex biological interaction patterns and potential associations inherent in such datasets.

Several researchers have investigated other types of models to predict potential SM-miRNA associations. Jiang et al. [17] constructed specific SM-miRNA interaction network for each of the 23 different cancer types through the collection and integration of pertinent data. They used a functional enrichment scoring method to predict the associations between cancer-related miRNAs and SMs. In 2022, Li et al. [18] proposed the SMMA-HNRL model for predicting SM-miRNA associations. This model employed two heterogeneous network representation learning algorithms, HeGAN and HIN2Vec, to derive feature vectors for the SM and miRNA nodes. These vectors were then concatenated and fed into a LightGBM classifier to predict potential SM-miRNA associations. Nonetheless, a notable limitation of the current research methodologies employed in this study is their insufficient capacity to effectively capture the nodal characteristic information of SM-miRNA associations. This inadequacy may consequently constrain the enhancement of prediction accuracy.

Considering that the integration of multi-source data can more comprehensively represent the SM-miRNA association information and the deep learning methods are adept at capturing complex associations between SMs and miRNAs, we proposed a novel end-to-end deep learning model: Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA). It integrated multi-source data fusion and graph neural networks to improve the accuracy of SM-miRNA association prediction. Firstly, the fingerprint feature vectors of SMs were extracted based on their SMILES sequences, encompassing both Atom Pairs fingerprint vectors and MACCS molecular fingerprint vectors. For the miRNAs, K-mer feature vectors of miRNAs were constructed based on the base sequences. To effectively compress redundant information and retain key features, Principal Component Analysis (PCA) was also applied to reduce the dimensionality of the high-dimensional feature vectors of SMs to 32 dimensions. Secondly, the adjacency matrix was constructed by computing the cosine similarity between the SM fingerprint feature vectors and the K-mer feature vectors. Thirdly, the different types of SM feature vectors were concatenated, and a logical OR operation was applied to the adjacency matrices. These representations were fed into the model, which comprised a graph attention network and a graph sampling and aggregation network, to derive the final feature vectors for the input SMs and miRNAs. Ultimately, the SM and miRNA feature vectors were averaged and input into a multilayer perceptron (MLP) consisting of fully connected layers to produce the predicted scores for SM-miRNA associations. To comprehensively assess the predictive performance of the MDFGNN-SMMA model, we performed a comparative analysis against four mainstream models using 10-fold cross-validation. Additionally, case studies were employed to further demonstrate the efficacy of MDFGNN-SMMA in predicting potential SM-miRNA associations.

Materials and methods

MDFGNN-SMMA model

In this study, we introduced MDFGNN-SMMA, an SM-miRNA association prediction model utilizing multi-source data fusion and graph neural networks, comprising four main sections.

In Section A, feature extraction was conducted for both SMs and miRNAs. For SMs, the RDKit software was utilized to compute the Atom Pairs fingerprint feature vectors and the MACCS molecular fingerprint feature vectors based on their SMILES sequences. For miRNAs, the K-mer feature vectors were calculated based on their base sequences.

Section B encompassed the computation and integration of feature vectors and adjacency matrices. Firstly, two feature vectors pertaining to the SMs were dimensionally reduced to 32 dimensions using PCA. This technique optimized the retention of essential information while mitigating noise and redundant data within high-dimensional datasets, thereby decreasing computational complexity [19]. Subsequently, cosine similarity was applied to each feature to derive the cosine similarity matrices. The dimensionally reduced Atom Pairs feature vectors were concatenated with the Molecular ACCess System (MACCS) feature vectors, culminating in the formation of fused SM feature vectors. The two SM similarity matrices were then combined using a logical OR operation, leading to the fused SM adjacency matrix. For the miRNAs, the K-mer features were utilized directly as the miRNA feature vector and input into the subsequent graph neural network. The miRNA adjacency matrix was derived by calculating the cosine similarity of the K-mer feature vectors.

Section C of the study concentrated on the implementation of the Graph Attention Network (GAT) and the Graph Sample and Aggregated Network (GraphSAGE). In this section, the fused SM feature vectors, along with the adjacency matrices derived in Section B were input into the GAT. Within the multiple layers of the GAT, feature concatenation was employed on the computed features, culminating in a feature vector denoted as $V_{GAT}$ with dimensions $N_{head} \times N_{dim - GAT}$. This method effectively extracts and retains more important information from the input vectors. Next, the feature vector $V_{GAT}$ was input into GraphSAGE. In this study, two SAGEconv layers were employed to aggregate the information from the target node and its second-order neighboring nodes, thereby deriving the final SM feature vector $V_{SM}$. Similarly, for the miRNA feature vector and adjacency matrix, the combined module of GAT and GraphSAGE was also used for processing to obtain the final miRNA feature vector $V_{miRNA}$.

Section D pertained to the final prediction phase of the model. In this section, a multilayer perceptron (MLP) comprising three fully connected layers was developed. The SM feature vector $V_{SM}$ and miRNA feature vector $V_{miRNA}$, derived in Section C were averaged and input into the MLP to predict the final SM-miRNA association score.

The descriptions for each section of the MDFGNN-SMMA model flowchart shown in Fig. 1 are as follows: (A) SM and miRNA feature extraction stage. (B) Feature vectors and adjacency matrices computation and fusion stage. (C) Implementation of the GAT and GraphSAGE frameworks. (D) Association score prediction. For further detailed information, please refer to Fig. 1.

Datasets

To attain a thorough comprehension of the pertinent information concerning SMs and miRNAs, this study meticulously gathered and curated data from multiple databases. This included data on SMs and miRNAs, as well as existing SM-miRNA associations. The relevant information on SMs was obtained from the DrugBank [20] and PubChem databases [21], where their corresponding SMILES sequences can be queried using the DrugBank ID or CID of SMs. The information on miRNAs was obtained from the miRBase database [22], with corresponding miRNA base sequences accessible by querying miRBase with the miRNA name (the SMILES sequence data and miRNA base sequence data refer to supplementary files 1 and 2). The experimentally validated SM-miRNA associations were collected from the SM2miR [8], RNAInter [9], NoncoRNA [10], and ncDR database [23] respectively. Following the standardization of the SM and miRNA nomenclature derived from these databases, and the elimination of duplicate SM-miRNA associations, we obtained a total of 7632 unique SM-miRNA associations, encompassing 985 distinct miRNAs and 325 SMs (refer to supplementary files 3 and 4).

The Atom Pairs fingerprint feature vectors of SMs

In the domain of cheminformatics and drug discovery, molecular fingerprinting is a prevalent technique utilized for the characterization and representation of molecular structures. The Atom Pairs fingerprint serves as a significant molecular descriptor by considering both the interatomic distances and the specific types of atoms between pairs of atoms [24]. When calculating the Atom Pairs fingerprint, the first step is to obtain the structural information of the molecule, typically using the SMILES sequence or other molecular file formats, to extract basic information such as atoms, bonds, and rings. Next, an exhaustive traversal of all potential atom pairs within the molecular graph structure is conducted. For each atom pair, a range of features is systematically computed, encompassing atom types, distances, and topological properties. These computed features are then aggregated into a high-dimensional vector, which constitutes the Atom Pairs fingerprint of the molecule (See supplementary file 5).

In this study, the SMILES sequences of SMs were transformed into a 1024-dimensional hashed Atom Pairs fingerprint vector with the RDKit toolkit. This fingerprint feature vector captures the local chemical environment within the drug molecule and the interrelationships between atom pairs, better reflecting the common characteristics. Employing a hash function to generate a fixed-length feature vector proved advantageous for subsequent graph neural network applications and data analysis processes.

The MACCS molecular fingerprint feature vectors of SMs

The Molecular ACCess System (MACCS) molecular fingerprint is a widely utilized molecular descriptor for the characterization of the chemical structures of drugs [25]. It employs a binary coding scheme, valued for its simplicity, intuitive nature, and ease of calculation and comparison. The MACCS molecular fingerprint is constructed based on specific substructure definitions and consists of 166 distinct molecular features. Each feature is indicative of a distinct chemical substructure within the molecule, including but not limited to hydroxyl groups, benzene rings, and nitrogen atoms. The presence of a particular chemical substructure within the molecule is represented by a binary bit value of 1; otherwise, it is 0. The MACCS molecular fingerprint consists of 167 bits, each indicating a specific chemical substructure or property. The first 166 bits represent specific chemical substructures in the molecule, and the 167th bit addresses unknown or incorrect structures. When specific structural elements in a drug molecule can’t be encoded with known features, the 167th bit is set to 1. The encoding scheme employed in MACCS molecular fingerprints provides a high level of specificity and sensitivity, enabling accurate depiction of the structural characteristics of molecules (See supplementary file 6).

In this study, the SMILES sequences of SMs were converted into molecular entities with the MolFromSmiles function in the RDKit for subsequent structural analysis and computation. Next, the MACCSkeys function implemented in RDKit was employed to compute the MACCS molecular fingerprint feature vector for SMs, which effectively captured crucial structural and property information specifically associated with SMs.

The K-mer feature vectors of miRNAs

In the disciplines of genomics and bioinformatics, K-mer features are integral to sequence representation and analysis. The K-mer approach utilizes a contiguous fragment of DNA or RNA sequence, composed of k consecutive nucleotides, as the fundamental unit to encapsulate local sequence information [26]. K-mer features are derived from a statistically based approach to sequence representation, in which the parameter ‘k’ determines the length of each K-mer fragment. In this study, the parameter ‘k’ was set to 3. The K-mer in a miRNA sequence could be represented by combinations such as ‘AUC’, ‘GAU’ or ‘UUC’, resulting in 64 possible sub-sequences.

To extract K-mers from a given miRNA sequence, consecutive sub-sequences of length ‘k’ were sequentially obtained as K-mers starting from the first base of the sequence. If the sequence was not long enough to generate a complete K-mer, the redundancy was ignored. The frequency of each generated K-mer in the sequence was counted, and a normalized table of K-mer frequency distributions was created, with each K-mer linked to its frequency in a base sequence of length k. Due to the short length of miRNA sequences (approximately 20–25 bases), K-mer features have unique advantages in miRNA sequence analysis. A fixed 64-dimensional feature vector for the miRNA was obtained for further analysis and processing (See supplementary file 7). Figure 2 provides a comprehensive illustration of the K-mer feature extraction process.

Fusion of the feature vectors

To maintain the critical information and unique characteristics present in the original data while minimizing noise and removing redundancy in high-dimensional datasets, we employed PCA on the Atom Pairs fingerprint feature vectors and the MACCS molecular fingerprint feature vectors. By reducing the dimensionality to of each feature vector to 32 dimensions, the complexity of model training and analysis was reduced, thereby enhancing processing efficiency. Subsequently, the two dimensionality-reduced feature vectors were concatenated to form the fused feature vector for SMs. The concatenation operation is shown in Eq. 1:

$$\begin{array}{*{20}c} {SM = \left[ {X\parallel Y} \right] = \left[ {\begin{array}{*{20}c} {X_{1,1} } & \cdots & { X_{1,m} } \\ \vdots & \cdots & \vdots \\ {X_{n,1} } & \cdots & { X_{n,m} } \\ \end{array} \begin{array}{*{20}c} { Y_{1,1} } & \cdots & { Y_{1,k} } \\ \vdots & \cdots & \vdots \\ { Y_{n,1} } & \cdots & { Y_{n,k} } \\ \end{array} } \right]} \\ \end{array}$$

(1)

In the equation, $X \in R^{n \times m}$ represents the reduced-dimensional Atom Pairs fingerprint feature vector, and $Y \in R^{n \times k}$ represents the reduced-dimensional MACCS molecular fingerprint feature vector, where $n$ is the number of $SM$ samples, $m$ is the dimension of the Atom Pairs fingerprint feature vector, and $k{ }$ is the dimension of the MACCS molecular fingerprint feature vector. By merging the two feature matrices $X{ }$ and ${ }Y{ }$ in the column dimension, a fused feature vector $SM \in R^{{n \times \left( {m + k} \right)}}$ containing the features of multiple SMs was finally obtained.

Calculation of the similarity matrices

Cosine similarity is a widely utilized metric for quantifying the similarity between two vectors, particularly in the context of high-dimensional spaces [27]. It determines the degree of similarity by evaluating the cosine value of the angle formed between the vectors. When applied to compute the similarity matrix for SMs and miRNAs, cosine similarity evaluates the similarity based on the directional alignment of the vectors, thereby remaining invariant to their absolute magnitudes. Consequently, cosine similarity captures the relative relationship between the features rather than simple quantitative differences [28]. This approach effectively handles high-dimensional data, simplifies comparisons in complex feature spaces, and is robust to sparse and noisy data. Cosine similarity is calculated as shown in Eq. 2:

$$\begin{array}{*{20}c} {S\left( {A,B} \right) = \frac{A \cdot B}{{\|A\| \times \|B\|}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {A_{i} \times B_{i} } \right)}}{{\sqrt {\Sigma_{{{\text{i}} = 1}}^{n} A_{i}^{2} } \times \sqrt {\Sigma_{{{\text{i}} = 1}}^{n} B_{i}^{2} } }}} \\ \end{array}$$

(2)

where $S\left( {A,B} \right)$ represents the cosine similarity between vector $A$ and vector ${ }B$, $A \cdot B$ denotes the dot product of vector $A$ and vector $B$, $\|A\|$ and $\|B\|$ denote the magnitude (or paradigm) of the vector $A$ and vector $B$, and $A_{i}$ and $B_{i}$ denote the $i$-th component of vector $A$ and vector $B$.

GAT and GraphSAGE framework

The swift advancement of deep learning has popularized graph neural networks across fields like bioinformatics, social network analysis, and recommender systems. These models excel in graph machine learning tasks like node classification and link prediction. Considering the inherent structure of the input data, the ability of Graph Neural Networks (GNNs) to model complex relationships, and the extensive application of GNNs in bioinformatics, the MDFGNN-SMMA model adopted a GNN-based approach to further extract SM and miRNA feature.

Among these approaches, Graph Convolutional Network (GCN) effectively captures the feature information of neighboring nodes through spectral graph convolution, which has achieved outstanding performance in several benchmark tests. Graph Attention Network (GAT) utilizes attention mechanisms to dynamically assign weights to neighboring nodes, thereby enhancing model’s expressiveness. Graph Sample and Aggregated Network (GraphSAGE) demonstrates robust scalability when dealing with large graph data by using random sampling and aggregation of neighbor features. The Graph Isomorphism Network (GIN), based on graph isomorphism theory, excels at capturing graph structures and performs exceptionally well in benchmark tests. After comparative experiments, this study finally chose to apply the combination of two graph neural network models, GAT and GraphSAGE.

GAT has been shown to offer distinct advantages in node-level tasks, such as node classification and node attribute prediction, due to its innovative attention mechanism [29]. By dynamically assigning attention weights to each node in the graph, GAT enables the model to focus on the nodes and relationships that are more important for specific tasks [30]. This attention mechanism not only facilitates an in-depth exploration of complex interactions and dependencies between nodes but also enables the extraction of richer and more informative feature representations.

Like other attention mechanisms, the computation of GAT can be divided into two parts: the calculation of the attention coefficient and the weighted summation. Initially, for a given node $i$, we computed the similarity coefficients with its neighboring nodes ($j \in N_{i}$) and itself. The calculation formula is shown in Eq. 3:

$$e_{ij} = \text{a}\left( {\left[ {Wh_{i} \parallel Wh_{j} } \right]} \right),\;j \in N_{i}$$

(3)

where $e_{ij}$ denotes the importance between node $j{ }$ and node $i$,${ }N_{i}$ is the set of neighboring nodes of node ${ }i$. $W$ is a common parameter, the features of the vertices are dimensionally augmented by linear mapping, which is a common feature augmentation method. $\left[ { \cdot \parallel \cdot } \right]$ denotes the concatenation operation of the features of node $i$ and node $j$, and finally ${\text{a}}\left( \cdot \right)$ maps the concatenated high-dimensional features to a real number. Next the values are normalized using the softmax function to obtain the attention coefficient $a_{ij}$. The calculation formula is shown in Eq. 4:

$$a_{ij} = \text{softmax}\left( {e_{ij} } \right) = \frac{{{\text{exp}}\left( {{\text{LeakyReLU}}\left( {e_{ij} } \right)} \right)}}{{\mathop \sum \nolimits_{{k \in N_{i} }} {\text{exp}}\left( {{\text{LeakyReLU}}\left( {e_{ik} } \right)} \right)}}$$

(4)

The steps for calculating the attention coefficient $a_{ij}$ can be understood by referring to supplementary file 8.

The attention coefficients are used to weight and sum features, creating new node features that include neighborhood information. The weighted summation formula is shown in Eq. 5:

$$\begin{array}{*{20}c} {h_{i}^{\prime } = \sigma \left( {\mathop \sum \limits_{{j \in N_{i} }} a_{ij} Wh_{j} } \right)} \\ \end{array}$$

(5)

where $h_{i}^{\prime }$ denotes the new feature obtained by weighted summation of each node $i$ computed by GAT, $W$ is the dimension transformation matrix, and $\upsigma \left( \cdot \right)$ is the activation function, which is the RELU activation function used in this study.

To improve the stability and generalization of the model, GAT introduces a multi-head attention mechanism. In GAT, the original features are split into subspaces, each using a distinct attention mechanism to aggregate neighboring node features. These new node representations are then combined to form the final representation. The computational formula is shown in Eq. 6:

$$\begin{array}{*{20}c} {h_{i}^{\prime \prime } = \begin{array}{*{20}c} K \\ \parallel \\ {K = 1} \\ \end{array} h_{i}^{\prime k} = \begin{array}{*{20}c} K \\ \parallel \\ {K = 1} \\ \end{array} \sigma \left( {\mathop \sum \limits_{{j \in N_{i} }} a_{ij}^{k} W^{k} h_{j} } \right)} \\ \end{array}$$

(6)

where $h_{i}^{\prime \prime }$ denotes the final feature vector obtained after concatenation, $K{ }$ denotes the number of attention heads, and $\parallel$ denotes the splicing operation of the new features $h_{i}^{\prime }$ of the $K{ }$ nodes, resulting in a final feature vector of dimension $K \times {\text{dim}}\left( {h_{i}^{\prime } } \right)$. This process is illustrated in supplementary file 9.

Graph Sample and Aggregated Network (GraphSAGE) is an inductive learning framework that leverages node features and structural data to create effective graph embeddings [31]. It captures the local topological properties in the graph structure through local neighborhood sampling and feature aggregation of the nodes. By synergistically combining both the local graph structure and the intrinsic attributes of nodes, GraphSAGE skillfully derives highly discriminative node representations.

GraphSAGE uses adaptive sampling to lower computational complexity, allowing efficient handling of larger graphs. It creates node representations that generalize well to unknown nodes, addressing scalability and generalization issues found in traditional graph neural networks. These unique advantages enable GraphSAGE to excel in common graph machine learning tasks.

The GraphSAGE algorithm consists of three main stages: neighbor node sampling, feature aggregation and feature updating. Initially, for a given node, GraphSAGE samples a subset of its neighboring nodes as the targets for aggregation. Random sampling can reduce the computational load and increase the efficiency of the model. The features of the sampled neighboring nodes are then aggregated. The aggregation process is shown in Eq. 7:

$$\begin{array}{*{20}c} {h_{{N_{v} }}^{k} = \text{MEAN}\left( {\left\{ {h_{u}^{k - 1} ,\forall u \in N\left( v \right)} \right\}} \right)} \\ \end{array}$$

(7)

For each node $v$, its initial embedding $h_{v}$ is set to its input feature $x_{v}$. Thereafter, $k$ iterations are performed, where $k$ represents the search depth from the target node $v$. If $k$ is equal to 1, this represents the neighboring nodes of $v$; if the value of $k$ is set to 2, it represents the second order neighbors of the neighboring nodes of vertex $v$. $N\left( v \right)$ represents the set of all neighbouring nodes of node $v$, $u$ is a neighbouring node of node $v$. The node representation $h_{u}^{k - 1}$ generated in the previous iteration is aggregated using the ${\text{MEAN}}$ function, producing the neighboring node representation of the current node $h_{{N_{v} }}^{k}$. Finally, the node features are updated as shown in Eq. 8:

$$h_{v}^{k} = \sigma \left( {W \cdot {\text{CONCAT}}\left( {h_{v}^{k - 1} ,h_{{N_{v} }}^{k} } \right)} \right)$$

(8)

where $W$ denotes a learnable weight matrix and $\upsigma$ denotes an activation function, in this study the RELU activation function is used. The CONCAT function is used to concatenate $h_{v}^{k - 1}$ and $h_{{N_{v} }}^{k}$. After a linear transformation and activation function, the updated feature representation $h_{v}^{k}$ of node $v$ is obtained. The overall process of GraphSAGE is shown in supplementary file 10.

Classifier and training

During the model’s training and classification, we averaged the SM and miRNA features from the graph neural networks. These SM-miRNA association vectors were fed into a multilayer perceptron (MLP) with fully connected layers to learn a non-linear mapping to predict SM-miRNA associations. The MLP was trained using backpropagation and gradient-based optimization to minimize a loss function. The formula for MLP is shown in Eq. 9:

$$Z = W_{output} \cdot \frac{{y_{{sm{ }}} + y_{{miRNA{ }}} }}{2} + b_{output}$$

(9)

where $W_{output}$ denoted the learnable weight matrix and $b_{output}$ denoted the bias vector, the prediction $Z$ was obtained after linear regression.

The model used BCEWithLogitsLoss as its loss function to get the final predictions. This function merges Binary Cross Entropy (BCE) loss with a sigmoid activation function, making it ideal for multi-label classification. This loss function subjects the model’s output predictions to a non-linear mapping by the sigmoid activation function, converting the predictions into probability values between 0 and 1. The formula for the sigmoid activation function is shown in Eq. 10:

$$\text{Sigmoid}\left( x \right) = \frac{1}{{1 + e^{ - x} }}$$

(10)

The BCE loss function was then applied to calculate the difference between the probability value and the target label. This difference value was the loss value, which measured the accuracy of the model’s predictions. The formula for the BCE loss function is shown in Eq. 11:

$$\text{Loss} = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} y_{i} \cdot {\text{log}}\left( {p_{i} } \right) + \left( {1 - y_{i} } \right) \cdot {\text{log}}\left( {1 - p_{i} } \right)$$

(11)

where $y_{i}$ is the target true label 0 or 1, $p_{i}$ is the predicted value calculated by the model and sigmoid function, and $N$ denotes the number of predicted objects.

Configuration and parameters

The MDFGNN-SMMA model was deployed on a CentOS 6.5 platform with an Intel Xeon CPU and an NVIDIA Tesla V100S-PCI graphics card. It used up to 189 GB of RAM and was developed using PyTorch and Python 3.7.13.

In this study, an instance of the ArgumentParser class was utilized to create a command line argument parser, enabling convenient configuration and execution of the model. The BCEWithLogitsLoss function was used as the loss function, effectively combining the sigmoid activation and binary cross entropy loss computations. This allowed a direct comparison between the predicted category probabilities and the true binary labels, facilitating optimization of the model’s classification performance. The Adam optimizer, an adaptive learning rate algorithm, was used to enhance training efficiency and model performance by dynamically adjusting each parameter’s learning rate through first and second-order moment estimation, leading to faster convergence and greater robustness to hyperparameters [32]. To prevent model overfitting, the weight decay coefficient for L2 regularization in the Adam optimizer was set to improve model generalization. Please refer to Supplementary file 17 for details of the ablation experiment with relevant parameters, and the specific hyperparameter settings used for this model are provided in supplementary file 11.

Results

Performance evaluation of the 10-fold cross-validation

To thoroughly assess MDFGNN-SMMA’s performance, the study used 7797 known SM-miRNA associations as positive samples. Since direct negative samples were unavailable, a balanced set was created through stratified sampling, randomly selecting equal numbers of positive and negative samples from unknown associations. To assess the generalizability and stability of the model, one-tenth of the sample set was randomly selected as an independent test set, ensuring equal numbers of positive and negative samples. Furthermore, a 10-fold cross-validation was performed on the remaining nine-tenths of the sample. In each cross-validation fold, the model was trained on the training set and its performance was evaluated on the validation set. Finally, as shown in Fig. 3, the corresponding ROC (Receiver Operating Characteristic) curves and PR (Precision-Recall) curves were plotted based on the results of 10-fold cross-validation. The curves not only display the performance in each validation fold, but also include the average ROC and PR curves over the ten folds, thereby providing a more comprehensive reflection of the overall performance of MDFGNN-SMMA.

The generalization ability of the MDFGNN-AMMA model was assessed using an independent test set, comprising one-tenth of the total samples, which the model had not seen during training. This set was used to evaluate the model’s performance after training with 10-fold cross-validation. The following key metrics were selected to assess the model’s performance: AUC, AUPR, Precision, Recall, and F1-score. Specific 10-fold cross-validation results and a comparison of the validation results on the independent test set are shown in Fig. 4. When evaluated on the independent test set, the MDFGNN-SMMA model demonstrated excellent performance, achieving AUC and AUPR values of 0.9614 and 0.9572, respectively. The MDFGNN-SMMA model demonstrated strong generalization and consistent performance on unknown data, with minimal differences from the average results of 10-fold cross-validation.

Contrast experiments on PCA dimensionality reduction

PCA, an unsupervised feature extraction method, identifies key components that effectively represent original features, creating a concise and robust feature set. It reduces feature dimensions, enhancing model efficiency and mitigating overfitting, which improves the model’s ability to generalize to new data [33]. To assess the impact of PCA on model outcomes, contrast experiments were conducted using dimensions reduced by PCA, set to powers of 2 for computational efficiency. Specifically, the feature reduction dimensions for SM and miRNA were set to 64, 32, and no dimensionality reduction, respectively. The corresponding AUC and AUPR for SM and miRNA after dimensionality reduction to different dimensions are shown in Fig. 5.

In the graph, “None” means PCA wasn’t used for dimensionality reduction. The numbers 32 and 64 indicate the dimensions after applying PCA. The horizontal axis shows PCA dimensions for SMs, and the vertical axis shows PCA dimensions for miRNAs. The model performs best when PCA reduces SMs’ dimensionality to 32, without applying PCA to miRNA features. This indicates that SMs had a high feature dimension, and moderate dimensionality reduction effectively compresses redundant information while preserving essential features. In contrast, miRNA features were represented in a compact 64-dimensional space. Preserving the entirety of miRNA features ensures a comprehensive expression of their intrinsic characteristics. By adopting this selective feature engineering strategy, the MDFGNN-SMMA model effectively captures the intricate associations between SMs and miRNAs, resulting in improved prediction accuracy and enhanced robustness.

Ablation experiments

Ablation experiments using Atom Pairs and MACCS fingerprint features, along with their fusion, are detailed in Table 1. The table highlights the best results in bold. The findings indicate that fusing these features optimizes model performance for SMs, validating the efficacy of multi-source data fusion in predictions. The Atom Pairs fingerprint captures the topological relationships between atoms, while the MACCS fingerprint highlights the backbone structure and functional groups of a molecule. Combining these fingerprints provides a comprehensive representation of chemical structures, enhancing the accuracy of predicting associations between SMs and miRNAs.

Table 1 Comparison results of multi-source data fusion

Full size table

Contrast experiments on combination of graph neural networks

This study conducted comparative experiments to thoroughly assess the effectiveness of combining GAT and GraphSAGE for node feature learning. We compared this combination with other high-performing graph neural network models in node classification and link prediction tasks, such as GAT, GraphSAGE, GCN, GIN, and their various combinations. The experimental results are shown in Table 2, where the maximum values of each indicator are represented in the bold. The combined use of GAT and GraphSAGE outperforms other graph neural network models. GAT uses attention mechanisms to learn node importance, while GraphSAGE efficiently aggregates neighborhood information. Together, they can leverage their strengths to more accurately capture nodes’ hidden semantic features. This study also conducted an analysis of the number of Graph Neural Networks layers, and compared the different feature extraction methods. For specific detail information, please refer to Supplementary file 17.

Table 2 Comparative experiment results on combining GAT and GraphSAGE

Full size table

To further investigate the impact of the binding order between the GAT and GraphSAGE components, we conducted a comparative experiment by reversing their original combination sequence. After this architectural modification, the model achieved AUC and AUPR scores of 0.9542 and 0.9523, respectively. While these results can be considered satisfactory, they remain slightly lower than the performance attained by the original model configuration proposed in this study, which further proves the effectiveness of our model.

Performance evaluation of attention heads

Using multi-head attention, GAT improves node feature learning by attending to various input segments simultaneously, allowing for more detailed information extraction. The number of attention heads significantly impacts model performance. Therefore, this study compared the performance of the model under different attention head numbers through experiments. The specific experimental results are shown in supplementary file 12, where the number of attention heads was set to 1, 2, and 3 respectively. From the experimental results in supplementary file 12, it can be observed that when the number of attention heads was set to 2, the model achieved the best performance for most of the evaluation indicators. This observation suggests that increasing the number of attention heads from 1 to 2 helped the model acquire a more comprehensive representation of the node features. However, when the number of attention heads was further increased to 3, most of the evaluation indicators showed a decrease, which was probably due to the problem of overfitting. Based on the above findings, this study finally decided to set the number of attention heads to 2. This hyperparameter setting not only enables the model to learn a more comprehensive representation of node features, but also avoids overfitting problems, thus ensuring the robustness of the model in practical applications.

Vector function for constructing SM-miRNA pairs

After obtaining node vectors of the SMs and the miRNAs through two state-of-the-art graph neural networks, SM-miRNA pair vectors were then computed using vector functions. We investigate five common functions, Concat, Hadamard, Average, Minus, and Absolute Minus, which merged an SM vector and one miRNA vector into one SM-miRNA pair vector. To evaluate the effectiveness of these functions, we used 10-fold cross-validation for comparative analysis. Supplementary file 13 provides a detailed representation of these five functions and their comparative results, where the maximum values of each indicator are shown in bold and underlined black. The results demonstrate that the choice of vector function has a significant impact on the model performance. By analyzing the comparative results, researchers can select the most appropriate vector function to capture the complex interactions between SMs and miRNAs, thereby optimizing the overall model performance.

As shown in supplementary file 13, the performance of the Average function is superior to the other four functions. The Average function optimally balanced information in SM and miRNA feature vectors, reducing volatility and noise effects for more stable and robust predictions. This underscores the importance of selecting suitable vector operations for combining node-level features into pair-level representations.

Model contrast experiment

To comprehensively evaluate the performance of the MDFGNN-SMMA model in predicting SM-miRNA associations, this study conducted an in-depth comparative analysis was performed in this study by comparing the MDFGNN-SMMA model with the four state-of-the-art SM-miRNA association prediction models RWR [34], GISMMA [12], RFSMMA [15] and SMMA-HNRL [18]. The comparative experimental results of the five models are shown in Table 3, where the maximum values of each indicator are shown in the bold. The experimental results show that the MDFGNN-SMMA model achieved the best predictive performance across the evaluated metrics.

Table 3 Comparative experiment results on different models

Full size table

The model’s excellent performance can be attributed to two key factors. Firstly, it cleverly integrated Atom Pairs fingerprint features and MACCS molecular fingerprint features. These features comprehensively captured the chemical and physical characteristics of the SMs, providing the model with a rich source of information. By integrating these features, a deeper understanding of SM properties was achieved, resulting in more accurate predictions of SM-miRNA associations. The MDFGNN-SMMA model cleverly integrates GAT and GraphSAGE, two advanced graph neural networks. GAT’s attention mechanism enhances the model’s ability to capture complex node relationships and detailed features, improving the representation of local and global interactions in graph data [35]. Conversely, the GraphSAGE component aggregated neighboring node features, preserving rich contextual and structural information. Importantly, it employed inductive learning to generate effective feature representations for unknown nodes. The integration of both graph neural networks maximized their strengths, boosting the MDFGNN-SMMA model’s performance in predicting SM-miRNA associations.

Case studies

To further evaluate the practical applicability of the MDFGNN-SMMA model, three small molecules (SMs) with close associations to cancer treatment were selected for case studies in this work. The chosen SMs were Cisplatin (DB00515), 5-Fluorouracil (DB00544) and Doxorubicin (DB00997).

For the evaluation process, known SM-miRNA associations were used as positive samples, and an equal number of unknown associations were selected as negative samples. Initially, all associations containing the three SMs of interest were removed from the dataset, leaving the remaining samples for model training. Subsequently, the association vectors for these three clinically relevant SMs and their corresponding miRNAs were input into the trained MDFGNN-SMMA model, and the prediction scores were calculated. The potential SM-miRNA associations were then ranked in descending order according to their prediction scores. To verify the validity of the MDFGNN-SMMA model’s predictions, we manually consulted the PubMed and RNAInter database to validate the top-ranked predicted associations. The specific results of this validation process are presented in Tables 4, 5 and 6 (The full results refer to supplementary files 14–16).

Table 4 Validation of the top 50 predicted miRNAs related to Cisplatin (DB00515)

Full size table

Table 5 Validation of the top 50 predicted miRNAs related to 5-Fluorouracil (DB00544)

Full size table

Table 6 Validation of the top 50 predicted miRNAs related to Doxorubicin (DB00997)

Full size table

Cisplatin is a platinum-containing anticancer drug that is widely used to treat a variety of malignant tumor types [36]. Its primary mechanism of action involves incorporation into the helical structure of DNA molecules, disrupting their normal function and ultimately leading to the death of tumor cells. Cisplatin is frequently employed as a key component in combination chemotherapy regimens due to its broad spectrum of anticancer activity, potent effects, synergistic interactions with many other anticancer drugs, and lack of cross-resistance to other cytotoxic agents. Table 4 presents the top 50 miRNA associations predicted by the MDFGNN-SMMA model for the small molecule Cisplatin. Upon manual validation against the existing literature and the RNAInter database, it was found that 8 out of the top 10 predicted miRNAs, 24 out of the top 30 predicted miRNAs, and 42 out of the top 50 predicted miRNAs were confirmed to have documented associations with Cisplatin.

5-Fluorouracil is a chemotherapeutic agent that is widely used to treat a variety of cancers, including those originating in the digestive system, breast, and cervix [37, 38]. This drug belongs to the class of thymidylate synthase inhibitors, exerting its anticancer effects by inhibiting cell proliferation. Once inside the cells, 5-Fluorouracil is converted into 5-fluorodeoxyuridine monophosphate, which then inhibits the enzyme thymidylate synthase. This inhibition blocks the conversion of deoxyuridine monophosphate to deoxythymidine monophosphate, disrupting DNA synthesis. Moreover, 5-Fluorouracil can form nucleosides that interfere with RNA and protein synthesis. Table 5 shows the top 50 miRNA associations associated with 5-Fluorouracil, 6 miRNAs of the top 10, 21 miRNAs of the top 30 and 36 miRNAs of the top 50 were documented and confirmed by the literature and RNAInter database.

Doxorubicin is a widely used and highly effective antineoplastic agent for the treatment of a variety of cancer types [39]. Its primary mechanism of action involves intercalation into the DNA of tumor cells, which subsequently inhibits both DNA and RNA synthesis. This disruption of nucleic acid metabolism leads to the suppression of topoisomerase II activity and the generation of free radicals, ultimately resulting in the inhibition of tumor cell proliferation. Additionally, Doxorubicin has been shown to alter the structural integrity of the cell membrane, thereby affecting overall cellular function. Table 6 presented the top 50 miRNA associations associated with Doxorubicin, 7 miRNAs from the top 10, 21 miRNAs from the top 30 and 36 miRNAs from the top 50 were documented and confirmed by the literature and the RNAInter database.

In summary, the case studies on Cisplatin, 5-Fluorouracil, and Doxorubicin highlight the MDFGNN-SMMA model’s strong ability to predict SM-miRNA associations. Most top-ranked predictions were validated by existing literature and the RNAInter database, confirming many high-probability novel associations with empirical evidence. Specifically, for Cisplatin, 7 out of the top 10 predicted miRNAs, 21 out of the top 30 predicted miRNAs, and 36 out of the top 50 predicted miRNAs were documented in the literature and the RNAInter database. Similarly high validation rates were observed for the top predictions associated with 5-Fluorouracil and Doxorubicin.

These findings indicate that the MDFGNN-SMMA model can accurately identify clinically relevant SM-miRNA associations, even for widely used chemotherapeutic agents.

Discussion

Accurate SM-miRNA association prediction is crucial for understanding disease mechanisms and drug development. The MDFGNN-SMMA model excels in predicting SM-miRNA associations, but there’s room for improvement. A major issue is the difficulty in accurately labeling negative samples due to current biological experimental limitations, which hinders the model’s optimal performance. Future studies should explore more effective strategies for negative sample screening in order to enhance the generalization capability of the proposed approach. The study mainly uses one-dimensional data like SMILES for small molecules and base sequences for miRNAs. Future research should explore adding multidimensional features, such as structural and functional details, to deep learning models for a more comprehensive understanding. This may enrich input representation has the potential to significantly improve the predictive power of the models. In conclusion, the MDFGNN-SMMA model integrates two sophisticated graph neural network algorithms, GAT and GraphSAGE, to harness the distinct advantages of each method. Although this integration has enhanced performance, it has also introduced challenges associated with increased model complexity, prolonged training durations, and heightened memory usage. To mitigate these issues, future research should aim to optimize the model architectures to decrease both temporal and spatial complexity, while preserving or potentially enhancing the models’ predictive capabilities.

Conclusions

In this study, a deep learning model called MDFGNN-SMMA was proposed, which leveraged multi-source data fusion and the graph neural networks to predict the SM-miRNA associations. The attention mechanism inherent to GAT enables the model to more accurately capture the dependency relationships between nodes, thereby facilitating precise feature aggregation. Conversely, the sampling and aggregation strategy employed by GraphSAGE allows for the learning of node embeddings with enhanced generalization capabilities. The integration of these complementary graph neural network architectures, coupled with the multi-source data fusion approach, empowers MDFGNN-SMMA to make robust and accurate predictions of SM-miRNA associations. This methodological advancement holds significant promise for furthering our understanding of the complex regulatory networks governing disease pathogenesis and drug response.

Availability of data and materials

The source code and dataset of MDFGNN-SMMA are publicly available at https://github.com/MDFGNN/MDFGNN-SMMA.

Abbreviations

miRNAs:: MicroRNAs
SM:: Small molecule
ncRNAs:: Non-coding RNAs
RISC:: RNA-induced silencing complex
MDFGNN-SMMA:: Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association
SMs:: Small molecules
SM-miRNA:: Small Molecule-miRNA
MLP:: Multilayer perceptron
MACCS:: Molecular ACCess System
GAT:: Graph Attention Network
GraphSAGE:: Graph Sample and Aggregated Network
GCN:: Graph Convolutional Network
GIN:: Graph Isomorphism Network
BCE:: Binary Cross Entropy
ROC:: Receiver Operating Characteristic
PR:: Precision–Recall

References

Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281–97.
Article CAS PubMed Google Scholar
Rupaimoole R, Slack FJ. MicroRNA therapeutics: towards a new era for the management of cancer and other diseases. Nat Rev Drug Discov. 2017;16(3):203–22.
Article CAS PubMed Google Scholar
Chakraborty C, Sharma AR, Sharma G, Doss CGP, Lee SS. Therapeutic miRNA and siRNA: moving from bench to clinic as next generation medicine. Mol Ther Nucleic Acids. 2017;8:132–43.
Article CAS PubMed PubMed Central Google Scholar
Hayes J, Peruzzi PP, Lawler S. MicroRNAs in cancer: biomarkers, functions and therapy. Trends Mol Med. 2014;20(8):460–9.
Article CAS PubMed Google Scholar
Krol J, Loedige I, Filipowicz W. The widespread regulation of microRNA biogenesis, function and decay. Nat Rev Genet. 2010;11(9):597–610.
Article CAS PubMed Google Scholar
Fu Z, Wang L, Li S, Chen F, Au-Yeung KK, Shi C. MicroRNA as an important target for anticancer drug development. Front Pharmacol. 2021;12:736323.
Article CAS PubMed PubMed Central Google Scholar
Hanna J, Hossain GS, Kocerha J. The potential for microRNA therapeutics and clinical research. Front Genet. 2019;10:478.
Article CAS PubMed PubMed Central Google Scholar
Liu X, Wang S, Meng F, Wang J, Zhang Y, Dai E, Yu X, Li X, Jiang W. SM2miR: a database of the experimentally validated small molecules’ effects on microRNA expression. Bioinformatics. 2013;29(3):409–11.
Article CAS PubMed Google Scholar
Kang J, Tang Q, He J, Li L, Yang N, Yu S, Wang M, Zhang Y, Lin J, Cui T, Hu Y, Tan P, Cheng J, Zheng H, Wang D, Su X, Chen W, Huang Y. RNAInter v4.0: RNA interactome repository with redefined confidence scoring system and improved accessibility. Nucleic Acids Res. 2022;50(D1):D326–32.
Article CAS PubMed Google Scholar
Li L, Wu P, Wang Z, Meng X, Zha C, Li Z, Qi T, Zhang Y, Han B, Li S, Jiang C, Zhao Z, Cai J. NoncoRNA: a database of experimentally supported non-coding RNAs and drug targets in cancer. J Hematol Oncol. 2020;13(1):15.
Article CAS PubMed PubMed Central Google Scholar
Li J, Lei K, Wu Z, Li W, Liu G, Liu J, Cheng F, Tang Y. Network-based identification of microRNAs as potential pharmacogenomic biomarkers for anticancer drugs. Oncotarget. 2016;7(29):45584–96.
Article PubMed PubMed Central Google Scholar
Guan NN, Sun YZ, Ming Z, Li JQ, Chen X. Prediction of potential small molecule-associated microRNAs using graphlet interaction. Front Pharmacol. 2018;9:1152.
Article CAS PubMed PubMed Central Google Scholar
Chen X, Zhou C, Wang CC, Zhao Y. Predicting potential small molecule-miRNA associations based on bounded nuclear norm regularization. Brief Bioinform. 2021;22(6):bbab328.
Article PubMed Google Scholar
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol. 2022;23(1):40–55.
Article CAS PubMed Google Scholar
Wang CC, Chen X, Qu J, Sun YZ, Li JQ. RFSMMA: a new computational model to identify and prioritize potential small molecule-MiRNA associations. J Chem Inf Model. 2019;59(4):1668–79.
Article CAS PubMed Google Scholar
Wang CC, Zhu CC, Chen X. Ensemble of kernel ridge regression-based small molecule-miRNA association prediction in human disease. Brief Bioinform. 2022;23(1):bbab431.
Article PubMed Google Scholar
Jiang W, Chen X, Liao M, Li W, Lian B, Wang L, Meng F, Liu X, Chen X, Jin Y, Li X. Identification of links between small molecules and miRNAs in human cancers based on transcriptional responses. Sci Rep. 2012;2:282.
Article PubMed PubMed Central Google Scholar
Li J, Lin H, Wang Y, Li Z, Wu B. Prediction of potential small molecule-miRNA associations based on heterogeneous network representation learning. Front Genet. 2022;13:1079053.
Article CAS PubMed PubMed Central Google Scholar
Hussain S, Ferzund J, Ul-Haq R. Prediction of drug target sensitivity in cancer cell lines using apache spark. J Comput Biol. 2019;26(8):882–9.
Article CAS PubMed Google Scholar
Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, Pon A, Cox J, Chin NEL, Strawbridge SA, Garcia-Patino M, Kruger R, Sivakumaran A, Sanford S, Doshi R, Khetarpal N, Fatokun O, Doucet D, Zubkowski A, Rayat DY, Jackson H, Harford K, Anjum A, Zakir M, Wang F, Tian S, Lee B, Liigand J, Peters H, Wang RQR, Nguyen T, So D, Sharp M, da Silva R, Gabriel C, Scantlebury J, Jasinski M, Ackerman D, Jewison T, Sajed T, Gautam V, Wishart DS. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024;52(1):D1265–75.
Article CAS PubMed Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE. PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373-d1380.
Article PubMed Google Scholar
Kozomara A, Birgaoanu M, Griffiths-Jones S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47(D1):D155-d162.
Article CAS PubMed Google Scholar
Dai E, Yang F, Wang J, Zhou X, Song Q, An W, Wang L, Jiang W. ncDR: a comprehensive resource of non-coding RNAs involved in drug resistance. Bioinformatics. 2017;33(24):4010–1.
Article CAS PubMed Google Scholar
Carhart RE, Smith DH, Venkataraghavan R. Atom pairs as molecular features in structure-activity studies: definition and applications. J Chem Inf Comput Sci. 1985;25(2):64–73.
Article CAS Google Scholar
He K. Pharmacological affinity fingerprints derived from bioactivity data for the identification of designer drugs. J Cheminform. 2022;14(1):35.
Article PubMed PubMed Central Google Scholar
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS ONE. 2021;16(10):e0258693.
Article CAS PubMed PubMed Central Google Scholar
Zhou S, Sun W, Zhang P, Li L. Predicting pseudogene-miRNA associations based on feature fusion and graph auto-encoder. Front Genet. 2021;12:781277.
Article CAS PubMed PubMed Central Google Scholar
Xie J, Wang M, Xu S, Huang Z, Grant PW. The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Front Genet. 2021;12:684100.
Article CAS PubMed PubMed Central Google Scholar
Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on meta-paths for lncRNA-disease association prediction. Brief Bioinform. 2022;23(1):bbab407.
Article PubMed Google Scholar
Keicher M, Burwinkel H, Bani-Harouni D, Paschali M, Czempiel T, Burian E, Makowski MR, Braren R, Navab N, Wendler T. Multimodal graph attention network for COVID-19 outcome prediction. Sci Rep. 2023;13(1):19539.
Article CAS PubMed PubMed Central Google Scholar
Chami I, Ying R, Ré C, Leskovec J. Hyperbolic graph convolutional neural networks. Adv Neural Inf Process Syst. 2019;32:4869–80.
PubMed PubMed Central Google Scholar
Yaqub M, Jinchao F, Zia MS, Arshid K, Jia K, Rehman ZU, Mehmood A. State-of-the-art CNN optimizer for brain tumor segmentation in magnetic resonance images. Brain Sci. 2020;10(7):427.
Article PubMed PubMed Central Google Scholar
Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.
PubMed PubMed Central Google Scholar
Lv Y, Wang S, Meng F, Yang L, Wang Z, Wang J, Chen X, Jiang W, Li Y, Li X. Identifying novel associations between small molecules and miRNAs based on integrated molecular networks. Bioinformatics. 2015;31(22):3638–44.
Article CAS PubMed Google Scholar
Gu W, Gao F, Lou X, Zhang J. Discovering latent node Information by graph attention network. Sci Rep. 2021;11(1):6967.
Article PubMed PubMed Central Google Scholar
Ghosh S. Cisplatin: the first metal based anticancer drug. Bioorg Chem. 2019;88:102925.
Article CAS PubMed Google Scholar
Longley DB, Harkin DP, Johnston PG. 5-fluorouracil: mechanisms of action and clinical strategies. Nat Rev Cancer. 2003;3(5):330–8.
Article CAS PubMed Google Scholar
Wigmore PM, Mustafa S, El-Beltagy M, Lyons L, Umka J, Bennett G. Effects of 5-FU. Adv Exp Med Biol. 2010;678:157–64.
Article CAS PubMed Google Scholar
Tacar O, Sriamornsak P, Dass CR. Doxorubicin: an update on anticancer molecular action, toxicity and novel drug delivery systems. J Pharm Pharmacol. 2013;65(2):157–70.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 62072154 & 62202330 and the Hebei Province Key Research and Development Plan Project under Grant No. 22342001D.

Author information

Authors and Affiliations

School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300401, China
Jianwei Li, Xukun Zhang, Bing Li & Ziyu Li
Beijing Institute of Heart Lung and Blood Vessel Diseases, Beijing Anzhen Hospital of Capital Medical University, Beijing, 101100, China
Zhenzhen Chen

Authors

Jianwei Li
View author publications
You can also search for this author inPubMed Google Scholar
Xukun Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Bing Li
View author publications
You can also search for this author inPubMed Google Scholar
Ziyu Li
View author publications
You can also search for this author inPubMed Google Scholar
Zhenzhen Chen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

JL and ZC conceived, led the project, evaluated the methods, suggested improvements and analyzed the results. JL and XZ conducted the experiments and wrote the manuscript. BL and ZL collected, organized data and modified the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Zhenzhen Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary file 1

Supplementary file 2

Supplementary file 3

Supplementary file 4

Supplementary file 5

Supplementary file 6

Supplementary file 7

Supplementary file 8

Supplementary file 9

Supplementary file 10

Supplementary file 11

Supplementary file 12

Supplementary file 13

Supplementary file 14

Supplementary file 15

Supplementary file 16

Supplementary file 17

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, J., Zhang, X., Li, B. et al. MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks. BMC Bioinformatics 26, 13 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06040-4

Download citation

Received: 22 October 2024
Accepted: 06 January 2025
Published: 13 January 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06040-4

MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks

Abstract

Background

Results

Conclusions

Background

Materials and methods

MDFGNN-SMMA model

Datasets

The Atom Pairs fingerprint feature vectors of SMs

The MACCS molecular fingerprint feature vectors of SMs

The K-mer feature vectors of miRNAs

Fusion of the feature vectors

Calculation of the similarity matrices

GAT and GraphSAGE framework

Classifier and training

Configuration and parameters

Results

Performance evaluation of the 10-fold cross-validation

Contrast experiments on PCA dimensionality reduction

Ablation experiments

Contrast experiments on combination of graph neural networks

Performance evaluation of attention heads

Vector function for constructing SM-miRNA pairs

Model contrast experiment

Case studies

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us