- Research
- Open access
- Published:
MIPPIS: protein–protein interaction site prediction network with multi-information fusion
BMC Bioinformatics volume 25, Article number: 345 (2024)
Abstract
Background
The prediction of protein–protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.
Results
Addressing these challenges, we propose a novel protein–protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein’s primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.
Conclusion
Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F1, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.
Introduction
The interactions between proteins or with other biological structures directly impact the realization of physiological functions within the organism [1,2,3]. Indeed, protein interactions play a vital role in almost all cellular activities, such as DNA transcription and translation, RNA splicing, catalytic reactions, and more [4, 5]. Therefore, analyzing and understanding the sequence and structural characteristics, as well as the physicochemical mechanisms, of protein interactions are crucial for comprehending cellular functions and the operational mechanisms of living organisms.
Currently, methods for predicting Protein–Protein Interaction (PPI) sites include both biological experimental methods and computational approaches. Biological experimental methods, such as two-hybrid assay and mass spectrometry (MS), are often time-consuming and resource-intensive [6]. Fortunately, over the past two decades, various computational prediction methods [7, 8] have complemented biological techniques to some extent. Existing computational methods for predicting PPI sites can be categorized into two types: sequence-based methods, which consider the relationships between amino acid sequences on a given protein chain, and structure-based methods, which consider the spatial relationships between amino acids, crucial for determining protein functionality.
For sequence-based methods, the protein is represented as an ordered sequence. The order of the sequence reflects the specific relationships between amino acid residues, which form the protein’s primary structure. Convolutional Neural Network (CNN) [9], Recurrent Neural Network (RNN) [10] and Long Short-Term Memory network(LSTM) [11] have been applied to handle protein sequence data. DELPHI [12] introduced three novel features (high-scoring segment pair, position information and 3-mer embedding) and implemented an ensemble framework with a CNN and a RNN component. DLPred [13] proposed a sequence-based PPI site prediction method using a simplified LSTM network. In summary, these sequence-based models have achieved excellent performance in their prediction tasks due to their ability to consider the global information contained within protein sequences. However, sequence-based methods have limitations, as they overlook the spatial structural imformation of proteins, which are crucial for protein functions. To address this, structure-based methods have been developed.
For structure-based methods, incorporating amino acid residue structures into the corresponding network structure is essential, and graph learning provides an effective means [14, 15]. Amino acid residues can be modeled as nodes in a graph, with their structural interactions forming the edges. Most GCN-based models [16] adopt shallow architectures, limiting their ability to extract information from high-order neighbors and making the oversmoothing problem [17]. However, recent advancements, such as residual connections and identity mappings [18], have effectively addressed the oversmoothing problem, enabling the application of deep graph learning to PPI site prediction tasks [14]. RGN [19] has an structure that combine a residue-based graph convolutional network and graph attention network to further extract the deeper feature. And EquiPPIS [20] employed symmetry-aware graph convolutions that transform equivariantly with translation, rotation, and reflection in 3D space, providing richer representations for molecular data compared to invariant convolutions. Structure-based methods focus on learning the spatial relationships between amino acid residues. However, structure-based methods do not effectively capture the global information of proteins and overlook the high-level semantic information in protein sequences.
To overcome the limitations of sequence-based and structure-based models, we incorporated both structure-based methods and sequence-based methods to extract complementary features from amino acid residues. Additionally, our network structure utilized the ProtT5 [21] large language model to capture high-level semantic features, which was incorporated as a supplementary channel alongside Bi-LSTM and GCN channels.
For amino acid features, we utilized the original protein sequence, evolutionary information, and secondary structure [9]. Unlike previous deep learning methods, our approach considered the interrelationships between these features, such as the influence of secondary structure on evolutionary information, using a Multi-Layer Perceptron (MLP).
Based on this analysis, we proposed a novel protein–protein interaction prediction method. We extracted amino acid features from the GCN, Bi-LSTM, and ProtT5 channels, fused these features, and fed them into the prediction layer(MLP) for binary PPI site prediction.
Materials and evaluation metrics
Three benchmark datasets were utilized to evaluate the model performance including Dset_186 [22], Dset_72 [22], and Dset_164 [23], which are publicly available and widely recognized from prior research. All of these datasets were curated based on a compilation of known protein–protein complexes in the Protein Data Bank(PDB) [24, 25]. A six-step refinement process was applied to each dataset, involving the exclusion of structures with over 30% missing residues, removal of chains with identical UniprotKB/Swiss-Prot accessions, elimination of transmembrane proteins, exclusion of oligomeric structures (beyond dimeric), and removal of proteins with buried surface. We amalgamated three datasets into a unified dataset. Subsequently, BLASTClust [26] was utilized to eliminate redundant proteins displaying sequence similarities exceeding 25%. This process yielded 395 protein chains, of which 335 were selectively allocated for training (Train_335), leaving the remaining 60 chains designated for independent testing (Test_60). To assess the robustness of our model and the impact of conformational changes, we used the corresponding unbound structures for proteins in the independent Test_60. Specifically, out of the 60 proteins, 31 have known monomeric structures in PDB, forming an additional unbound test set (UBtest_31). The specific details of these databases are presented in Table 1.
To thoroughly and comprehensively assess the efficacy of the proposed approach, we employed a range of evaluation metrics, including accuracy (ACC), precision, recall, F1, Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). The formulas for calculating these metrics are provided below:
where true positives (TP) and true negatives (TN) represent the correctly identified number of interacting and non-interacting sites, respectively. False positives (FP) and false negatives (FN) represent the incorrectly predicted number of interacting and non-interacting sites, respectively. AUROC and AUPRC are threshold-independent metrics, providing an overall evaluation of a model’s performance. This threshold was determined by maximizing the F1 for each model.
Method
Model architecture
The comprehensive network architecture of the proposed model is shown in Fig. 1. In order to learn protein features more comprehensively, we designed three channels, GCN channel, Bi-LSTM channel, and ProT5 channel, to compose the whole model architecture. The input of the GCN channel and Bi-LSTM channel are based on the amino acid embedding module, which describes the amino acid features. The first channel utilized a GCN network with initial residual connections, taking the feature matrix \({\varvec{X}}_1\) of the undirected graph and the adjacency matrix as inputs. The feature matrix \({\varvec{X}}_1\) represents the combination of the relevant attribute features of amino acid residues, including PSSM, HMM, and DSSP features, and their relational features. This channel yielded information about the spatial structure of amino acids. The second channel employed a Bi-LSTM network, taking the feature matrix \({\varvec{X}}_2\) composed of one-hot encoded amino acid sequences as input. This channel extracted features related to the primary structure sequence of amino acids. The third channel was a ProtT5 protein large language model channel, which acquired high-level semantic features and obtained a more comprehensive amino acid embedding representation through pre-training on a large dataset of protein sequences. Finally, the features from these three channels were concatenated and fed into the prediction layer (MLP) to predict whether an amino acid residue is a PPI site.
Amino acids attribute features
Given that the secondary structure of a protein plays a crucial role in influencing the protein’s function and the protein’s function can impact binding sites with other proteins or compounds, we incorporated attribute feature Dictionary of Protein Secondary structure(DSSP), representing secondary structure information of amino acids, as a component of the GCN channel node feature matrix \({\varvec{X}}_1 \in \mathbb{R}^{n \times 68}\). The program DSSP [27] is a tool utilized to calculate the protein’s secondary structure, each residue in the protein sequence was represented by a 14-dimensional vector, denoted as DSSP. 9 dimensions correspond to the secondary structure states encoded as a one-hot vector. Another 4 dimensions are derived by applying sine and cosine transformations to the peptide backbone torsion angles, PHI and PSI. The final dimension represents the relative solvent accessibility, calculated from the solvent accessible surface area.
Additionally, the evolutionary information of amino acid residues is closely associated with the binding tendencies of proteins. Therefore, attributes such as PSSM and HMM, capturing the evolutionary information of amino acid residues, were included in \({\varvec{X}}_1\). PSSM is generated by running PSI-BLAST v2.10.1 [26] to search the query sequence against the UniRef90 database [6] with three iterations and an E-value of 0.001. For each amino acid residue, there was a 20-dimensional PSSM feature vector representation after the above operations. The Hidden Markov Model(HMM) profile is produced by running HHblits v3.0.3 [28] to align the query sequence against the UniClust30 database [29] with default parameters. Similarly, each amino acid residue would be represented by a 20-dimensional HMM feature vector. Both values in the PSSM and HMM were normalized to scores between 0 and 1 using Formula (6), where z represents the original value, and Min and Max are the minimum and maximum values of the feature type in the training set.
The relational features \({\varvec{Y}}\), extracted by the proposed Multi-Layer Perceptron module and described in Section 3.3, were subsequently incorporated into \({\varvec{X}}_1\).
The inherent protein sequence serves as a precise representation of each amino acid and its corresponding positional information, thereby encapsulating the primary structure of the protein. The Bi-LSTM channel extracted the primary structure features of amino acids along the protein chain. The majority of proteins can be delineated through the representation of the 20 fundamental amino acids, thus forming a 20-dimensional one-hot encoding for each amino acid and constructing the feature matrix \({\varvec{X}}_2 \in \mathbb{R}^{n \times 20}\). Here, \(n\) denotes the length of the protein sequence. There are two amino acids feature matrices, as shown in Table 2.
The multi-layer perceptron module for relational features
Our MLP module for relational features extraction was depicted with a streamlined structure in Fig. 2, consisting of an input layer, a hidden layer, and an output layer. Specifically, for a protein chain containing \(n\) amino acid residues, \({\varvec{X}}_0\) constituted the attribute feature matrix (DSSP, PSSM, HMM) of amino acids. After passing through the 2-layer MLP module, a relational feature matrix \({\varvec{Y}}\) was obtained for the amino acids. This relational feature matrix \({\varvec{Y}}\) was then concatenated with \({\varvec{X}}_0\) to form a new feature matrix \({\varvec{X}}_1\), which served as the input for the graph convolutional channel.
In our network framework, the attribute features of amino acid residues, such as DSSP, PSSM, and HMM, were concatenated as inputs to a Multi-Layer Perceptron (MLP), generating the novel relational features \({\varvec{Y}}\), to explore their interrelationships.
where \({\varvec{X}}^{(l)}\) is the input of the \((l+1)\)-th layer in MLP, \({\varvec{X}}^1\) is the input to the second layer; \({\varvec{X}}^{(l+1)}\) is the output of the \((l+1)\)-th layer in MLP, \({\varvec{X}}^2\) represents the output of the second layer; \({\varvec{W}}^{(l)}\) is the weight matrix, \({\varvec{W}}^1\) denotes the weight matrix of the second layer; \({\varvec{b}} \in \mathbb{R}\) is the bias term; Relational feature matrix \({\varvec{Y}} \in \mathbb{R}^{n \times m_1}\) is the output of the MLP; \({\varvec{X}}_0 \in \mathbb{R}^{n \times m}\) is the attribute features of amino acid residues; \({\varvec{X}}_1 \in \mathbb{R}^{n \times m_2}\) is the node feature of GCN channel; \(n\) is the length of the protein; \(m\), \(m_1\), \(m_2\) are the dimensions of input for MLP, the relational features, and the input for the graph convolutional channel, respectively, with \(m_2\) being the sum of \(m\) and \(m_1\); \({\varvec{\sigma }}\) is the ReLU function; and \(\Vert\) denotes the concatenation operation.
The graph convolutional network (GCN) with initial residual connections
Graph Convolutional Network (GCN) consider the relationships between each node and its adjacent nodes, facilitating information propagation within a graph. A protein chain containing n amino acids could be described as the undirected graph G(V, E). Amino acids constitute the nodes V of the graph, and the spatial relationships between amino acid nodes are represented in the form of edges E. An adjacency matrix, denoted as A, is used to abstractly represent these spatial relationships. The connectivity between amino acids is quantified by calculating the Euclidean distance between their C\(\alpha\) atoms. The determination of connectivity is based on a predefined cutoff distance. If the distance between the C\(\alpha\) atoms of two amino acids is less than this cutoff value, they are considered connected, and the corresponding entry A\(_ij\) in the adjacency matrix A is set to 1; otherwise, it is set to 0, where i and j represent two amino acids.
Consequently, the process of GCN could obtain novel embedding representations for amino acids. GCNII [18] addressed the oversmoothing problem by introducing initial residual connections and identity mappings. GraphPPIS [14] introduced the above network into protein site prediction. Our GCN exhibited the following structure: 8 hidden layers, each layer with initial residual connections.
where \({\varvec{A}} \in \mathbb{R}^{n \times n}\) is the adjacency matrix; \({\varvec{D}}\) is the diagonal degree matrix of \({\varvec{A}}\); \({\varvec{H}}^{(l+1)}\) and \({\varvec{H}}^{(l)}\) are the hidden states after and before the convolution operation of the \((l+1)\)-th layer; \({\varvec{H}}^{(0)} \in \mathbb{R}^{n \times m_2}\) is the input of the GCN channel; \(\alpha\) and \(\beta _l\) are hyperparameters; \({\varvec{I}}_n\) is an identity matrix; \({\varvec{W}}^{(l)}\) is the weight matrix; \(l\) is the number of hidden layers; \(n\) is the length of the protein; \(m_2\) is 68; \(\sigma\) is the ReLU function. The use of initial residual connections combines the representation \({\varvec{PH}}^{(l)}\) with the first layer \({\varvec{H}}^{(0)}\). The purpose of these initial residual connections is to ensure that the final representation of each node retains at least a portion of the input features. Additionally, an identity matrix \({\varvec{I}}_n\) is added to the weight matrix \({\varvec{W}}^{(l)}\) of the \(l\)-th layer. This process captured spatial structural relationships between amino acid residues, updating the feature representations \({\varvec{F}}_{\text{GCN}}\in \mathbb{R}^{n \times 256}\) of amino acids in the GCN channel.
Bi-LSTM
The primary structure information of a protein forms the foundation for studying more complex structure and functional characteristics of proteins. Bi-LSTM allows the network to capture information for each point in different directions, providing a better understanding of dependencies within the sequence. In this study, a Bi-LSTM with 2 hidden layers was employed.
where \({\varvec{x}}_i\) represents the input at amino acid position \(i\), \({\varvec{W}}\) denotes the weights of the Bi-LSTM cell gates, \({\varvec{h}}_i\) and \({\varvec{h}}'_i\) are the outputs from the forward and backward passes of the Bi-LSTM, respectively. The memory cell state \(C_i\) captures the long-term dependencies in the input sequence, and \(\tanh (C_i)\) is the activation of the memory cell state. The output gate \({\varvec{og}}_i\) maintains the information from both directional passes.
In the first layer of the Bi-LSTM, the value of input is equal to \({\varvec{X}}_2 \in \mathbb{R}^{n \times 20}\). After 2 layers of learning in the Bi-LSTM, further extraction of the primary structure information for sequence of amino acids was achieved, resulting in a new representation of amino acid features \({\varvec{F}}_{\text{Bi-LSTM}}\in \mathbb{R}^{n \times 64}\).
ProtT5
Due to the limited size of the dataset used during training, the feature information obtained may not be sufficiently comprehensive. The high-level semantic features of protein representations derived from a large dataset serves as a valuable complement in evolutionary information, structural information, and other valuable feature information. So we employ the ProtT5-XL-UniRef50 protein large language model, a widely utilized model in the field [21], referred to as ProtT5. ProtT5 is based on the t5-3b model and underwent pretraining on an extensive corpus of protein sequences in a self-supervised manner, resulting in the feature representation \({\varvec{F}}_{\text{ProtT5}}\in \mathbb{R}^{n \times 1024}\).
The multi-layer perceptron for prediction
A Multi-Layer Perceptron (MLP) was employed to predict the interaction probabilities of all \(n\) amino acid residues within the protein, based on the concatenated feature \({\varvec{F}}_{\text{GCN}}\), \({\varvec{F}}_{\text{Bi-LSTM}}\), and \({\varvec{F}}_{\text{ProtT5}}\).
where \({\varvec{X}}_c^{(l+1)}\) represents the output of the \((l+1)\)-th layer after applying the activation function \(\sigma\); \({\varvec{X}}_c^{(l)}\) is the input of the \((l+1)\)-th layer; \({\varvec{W}}_c^{(l)}\) denotes the weight matrix of the \((l+1)\)-th layer; \({\varvec{b}}_c\) is the bias term.
The input to the first layer, denoted as \({\varvec{X}}_c^{(0)}\), is the concatenation of \({\varvec{F}}_{\text{GCN}}\), \({\varvec{F}}_{\text{Bi-LSTM}}\), and \({\varvec{F}}_{\text{ProtT5}}\). The activation function \(\sigma\) used in the first layer is ReLU. The output \(\in \mathbb{R}^{n \times 2}\) of the final layer is the predicted class probabilities, with the activation function \(\sigma\) being Softmax.
Results and discussion
Experiment details
For the training dataset, a 5-fold cross-validation was conducted to dertermine hyperparameters. The final model was trained using the complete training dataset. Specific configurations are presented in Table 3.
Performance comparison with other methods
We compared the performance of our model with other protein–protein interaction site (PPIS) prediction methods on the independent test set Test_60, including PSIVER [22], ProNA2020 [30], SCRIBER [31], DLPred [13], DELPHI [12], EnsemPPIS [32], DeepPPISP [9], SPPIDER [33], MaSIF-site [34], GraphPPIS [14], RGN [19] and EquiPPIS [20]. The first six methods are sequence-based, while the latter six utilize protein structure information.
As shown in Table 4, our model achieves superior performance in terms of Accuracy (ACC), Precision, F1, Matthews Correlation Coefficient (MCC), and Area Under the Precision-Recall curve (AUPRC). In imbalanced datasets, ACC is not a critical evaluation metric as it may be influenced by the predictions of the dominant class (negative samples). Given that dataset is imbalanced with a majority of negative samples, we pay closer attention to metrics such as MCC, AUROC, and AUPRC. Among the six sequence-based methods, EnsemPPIS generally demonstrates the best performance. Our model outperforms EnsemPPIS comprehensively, particularly in the key metrics MCC, AUROC, and AUPRC, with the improvements of 10.5%, 7.6%, and 11.9% respectively. Across the other five structural models, the effectiveness of EquiPPIS’s symmetry-aware graph convolutions becomes evident. Our proposed model outperforms EquiPPIS in MCC and AUPRC with the improvements of 0.5%, and 0.4%. Integrating data from three channels achieved the fusion of feature information from three perspectives: primary protein structure, amino acid spatial structure, and protein high-level semantic information. This integration results in superior predictive performance compared to the aforementioned twelve models.
Multi-channel performance
We proposed MIPPIS for predicting PPI sites, considering various aspects of amino acid information on the protein and employed multiple channels for feature extraction. To investigate the contributions of different channels to the model performance, we conducted channel ablation experiments, comparing the model effects of different channel combinations. The considered channel combinations in the experiments include the first channel (GCN channel), the second channel (Bi-LSTM channel), the third channel (ProtT5 channel), the combination of GCN channel and Bi-LSTM channel, the combination of the Bi-LSTM channel and ProtT5 channel, the combination of the Bi-LSTM channel and ProtT5 channel, and the combination of all three channels. These models share the same network architecture.
As indicated in Table 5, the ProtT5 channel attains the highest performance among the single channels, illustrating that effective amino acid embedding representations can be achieved through extensive data training. Simultaneously, compared to the Bi-LSTM channel representing the primary structure of amino acids along the protein chain, the GCN channel also performs well, with an AUROC of 0.717 and an AUPRC of 0.323. This validates that amino acid spatial structure information has a more significant impact on PPI sites compared to the primary structure information. In two-channel combinations, the ProtT5 channel and the GCN channel achieve the best performance, indicating that this combination more effectively extracts amino acid spatial structure features and high-level semantic features. Due to the imbalanced nature of our dataset, we focus more on MCC, AUPRC and AUROC. Our three-channel model achieves the best performance on all these metrics, except for the Recall. This imbalance significantly impacts the Recall because Recall is calculated as the number of true positives divided by the sum of true positives and false negatives. With fewer positive samples, the likelihood of higher false negatives increases, leading to a lower Recall. The F1 provides a balance between Precision and Recall. On the F1, our model achieves the best performance, demonstrating its effectiveness in handling the imbalanced data. Ultimately, such outstanding results robustly demonstrate the feasibility of the combination of three channels, leveraging the extraction of amino acid spatial structure (GCN channel), amino acid primary structure (Bi-LSTM channel), and ProtT5 channel with high-level semantic information, each complementing the others.
Model performance of robustness
To evaluate the robustness of our model, we employed a 5-fold cross-validation (CV) and an independent testing method to assess the Area Under the Curve (AUROC) and Area Under the Precision-Recall Curve (AUPRC). Specifically, in Train_335 and Test_60, our model achieves AUPRC values of 0.499 and 0.471, and AUROC values of 0.818 and 0.802, respectively. As shown in Fig. 3, the consistency of our model’s performance in both cross-validation and independent testing demonstrates its robustness. Considering that our method’s training set was initially curated based on native complex structures, it becomes pertinent to investigate the implications of incorporating unbound structures on predictive efficacy. We conducted comparative experiments on the dataset UBtest-31. As shown in Figs. 4, 5 and Table 6, our proposed model achieved the best AUPRC and Matthews Correlation Coefficient (MCC) performance. The experimental results on the UBtest_31 dataset not only validate the effectiveness of our model but also confirm its robustness.
Multi-layer perceptron module performance for relational features
In this study, to delve into the interrelations among amino acid attribute features, we employed the Multi-Layer Perceptron module, generating the relational features. This relational features, along with HMM, PSSM, and DSSP, was utilized as the amino acid feature for the GCN channel. To investigate the impact of the relational features, we conducted two sets of experiments, comparing the final predictive performance of the model. One set involved our proposed model presented in this paper, while the other set excluded the Multi-Layer Perceptron module extracting relational features from the graph convolutional channel. As shown in Table 7, the results reveal that the inclusion of the Multi-Layer Perceptron module for extracting relational features significantly enhanced MCC, AUROC, and AUPRC by 0.5%, 0.1%, and 0.2%, respectively. This favorably demonstrates the role of the Multi-Layer Perceptron module in extracting relational features.
Different feature performances
The impact of different features (PSSM, DSSP, HMM) on the model performance varies. To thoroughly investigate the contribution of feature combinations to the final prediction results, we conducted a series of feature ablation experiments. We arranged and combined these three features, creating six different combinations to construct comparative models. As shown in Table 8, when using a single feature, DSSP outperforms PSSM and HMM, indicating that secondary structure information has a greater impact on PPI sites than evolutionary information to some extent. For features containing evolutionary information, HMM performs better when used individually, demonstrating that HMM can more effectively characterize evolutionary information. Combining these two features, as long as the combination includes DSSP, almost all evaluation metrics improves, reaffirming the influence of DSSP on PPI sites. The performance of using only PSSM features is the poorest. However, in the two-feature combinations, DSSP and PSSM exhibits the best performance, with corresponding AUROC and AUPRC values of 0.795 and 0.455, indicating that PSSM has a certain effect. When using all features, the model achieves the best performance, with specific metrics being AUROC 0.802 and AUPRC 0.471. The detailed results are presented in the Table 8.
The effect of protein distance map
As discussed in Section 3.4, the construction of the adjacency matrix A involves determining the connectivity between nodes in the amino acid graph based on a predefined cutoff Euclidean distance. The chosen cutoff value dictates the inclusion of amino acids within the spatial proximity of the target amino acid. The minimum cutoff is 3.8Å, as it corresponds to the shortest chemical bond length between two C\(\alpha\) atoms. The maximum cutoff is chosen as \(\infty\), which indicates all the other amino acids are connected with the targeted amino acid.
As shown by the blue line in the Fig. 6, with an increasing cutoff value, our model’s AUPRC rapidly increases. This is attributed to introducing edges with greater informational content into the protein graph, and the model achieves optimal performance at a cutoff value of 14 Å. As the cutoff distance continues to increase, the performance slowly decreases, indicating that excessively large cutoff values introduce redundant information. Simultaneously, there exists a second computational approach where protein distance maps are transformed into continuous matrices with values ranging from 0 to 1 [19]. This is achieved by employing a normalization formula.
The yellow line in the Fig. 6 illustrates the performance of the continuous distance map, showing a consistent trend with the discrete distance map. We opt for the discrete discrete distance map for expeditious calculations.
Conclusion
In this study, we proposed a multi-channel protein–protein interaction site prediction network. Specifically, we utilized the Graph Convolutional Network (GCN) channel for spatial structural characteristics, the Bidirectional Long Short-Term Memory (Bi-LSTM) channel for primary structure sequence features, and the ProtT5 protein large language model for high-level semantic features derived from extensive training data. The integration of amino acid features from three different perspectives enables our model to achieve the best overall performance, surpassing other sequence and structure models discussed in the article. However, there is room for improvement in our model, such as incorporating physicochemical properties of amino acids.
Availibility of data and materials
The datasets supporting the conclusions of this article are available in the Github repository, https://github.com/DKY121212/MIPPIS/tree/master/Dataset.
Abbreviations
- PPI:
-
Protein–protein interaction
- PSSM:
-
Position-specific scoring matrix
- HMM:
-
Hidden Markov model
- DSSP:
-
Dictionary of protein secondary structure
- GCN:
-
Graph convolutional network
- Bi-LSTM:
-
Bidirectional long short-term memory
- MS:
-
Mass spectrometry
- CNN:
-
Convolutional neural network
- RNN:
-
Recurrent neural network
- LSTM:
-
Long short-term memory network
- MLP:
-
Multi-layer perceptron
- PDB:
-
The protein data bank
- ACC:
-
Accuracy
- MCC:
-
Matthews correlation coefficient
- AUROC:
-
Area under the receiver operating characteristic curve
- AUPRC:
-
Area under the precision-recall curve
- TP:
-
True positives
- TN:
-
True negatives
- FP:
-
False positives
- FN:
-
False negatives
- PPIS:
-
Protein–protein interaction site
- CV:
-
Cross-validation
References
De Las Rivas J, Fontanillo C. Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6(6): e1000807.
Li X, Li W, Zeng M, Zheng R, Li M. Network-based methods for predicting essential genes or proteins: a survey. Brief Bioinform. 2020;21(2):566–83.
Xia L, Xu L, Pan S, Niu D, Zhang B, Li Z. Drug-target binding affinity prediction using message passing neural network and self supervised learning. BMC Genom. 2023;24(1):557.
Zhang J, Kurgan L. Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief Bioinform. 2018;19(5):821–37.
Pan S, Xia L, Xu L, Li Z. SubMDTA: drug target affinity prediction based on substructure extraction and multi-scale features. BMC Bioinform. 2023;24(1):334.
Shoemaker BA, Panchenko AR. Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol. 2007;3(3): e42.
Xu L, Pan S, Xia L, Li Z. Molecular property prediction by combining LSTM and GAT. Biomolecules. 2023;13(3):503.
Niu D, Xu L, Pan S, Xia L, Li Z. SRR-DDI: a drug–drug interaction prediction model with substructure refined representation learning based on self-attention mechanism. Knowl Based Syst. 2024;285: 111337.
Zeng M, Zhang F, Wu FX, Li Y, Wang J, Li M. Protein–protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. 2020;36(4):1114–20.
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA-and protein-binding residues in protein chains. Brief Bioinform. 2019;20(4):1250–68.
Shi H, Gao S, Tian Y, Chen X, Zhao J. Learning bounded context-free-grammar via LSTM and the transformer: difference and the explanations. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36; 2022. p. 8267–8276.
Li Y, Golding GB, Ilie L. DELPHI: accurate deep ensemble model for protein interaction sites prediction. Bioinformatics. 2021;37(7):896–904.
Zhang B, Li J, Quan L, Chen Y, Lü Q. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing. 2019;357:86–100.
Yuan Q, Chen J, Zhao H, Zhou Y, Yang Y. Structure-aware protein–protein interaction site prediction using deep graph convolutional network. Bioinformatics. 2022;38(1):125–32.
Zhang L, Niu D, Zhang B, Zhang Q, Li Z. FSRM-DDIE: few-shot learning methods based on relation metrics for the prediction of drug–drug interaction events. Appl Intell. 2024;p. 1–14.
Wang S, Liang D, Wang J, Dong K, Zhang Y, Liang H, et al. FraHMT: a fragment-oriented heterogeneous graph molecular generation model for target proteins. J Chem Inf Model. 2024;64(9):3718–32.
Li Q, Han Z, Wu XM. Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32; 2018.
Chen M, Wei Z, Huang Z, Ding B, Li Y. Simple and deep graph convolutional networks. In: International conference on machine learning. PMLR; 2020. p. 1725–1735.
Wang S, Chen W, Han P, Li X, Song T. RGN: residue-based graph attention and convolutional network for protein–protein interaction site prediction. J Chem Inf Model. 2022;62(23):5961–74.
Roche R, Moussad B, Shuvo MH, Bhattacharya D. E (3) equivariant graph neural networks for robust and accurate protein–protein interaction site prediction. PLoS Comput Biol. 2023;19(8): e1011435.
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021;44(10):7112–27.
Murakami Y, Mizuguchi K. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics. 2010;26(15):1841–8.
Dhole K, Singh G, Pai PP, Mondal S. Sequence-based prediction of protein–protein interaction sites with L1-logreg classifier. J Theor Biol. 2014;348:47–54.
Hwang H, Pierce B, Mintseris J, Janin J, Weng Z. Protein–protein docking benchmark version 3.0. Prot Struct Funct Bioinform. 2008;73(3):705–9.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym Origin Res Biomol. 1983;22(12):2577–637.
Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9(2):173–5.
Mirdita M, Von Den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45(D1):D170–6.
Qiu J, Bernhofer M, Heinzinger M, Kemper S, Norambuena T, Melo F, et al. ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence. J Mol Biol. 2020;432(7):2428–43.
Zhang J, Kurgan L. SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics. 2019;35(14):i343–53.
Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, et al. A transformer-based ensemble framework for the prediction of protein–protein interaction sites. Research. 2023;6:0240.
Porollo A, Meller J. Prediction-based fingerprints of protein–protein interactions. Prot Struct Funct Bioinform. 2007;66(3):630–45.
Gainza P, Sverrisson F, Monti F, Rodola E, Boscaini D, Bronstein MM, et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods. 2020;17(2):184–92.
Acknowledgements
Not applicable.
Funding
This research is supported by the Natural Science Foundation of China (Nos. 62202498, 62272479, 62372469), Shandong Provincial Natural Science Foundation (No. ZR2021QF023), Shandong Province Youth Innovation and Technology Program Innovation Team (2023KJ070).
Author information
Authors and Affiliations
Contributions
SW and KD developed the algorithm, did the computation, and wrote the manuscript. KD, SW and DL designed the project, collected the data and revised the manuscript. YZ, XL and TS revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, S., Dong, K., Liang, D. et al. MIPPIS: protein–protein interaction site prediction network with multi-information fusion. BMC Bioinformatics 25, 345 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-05964-7
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-05964-7