HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Zhao, Kai; Ji, Zhuocheng; Zhang, Linlin; Quan, Na; Li, Yuheng; Yu, Guanglei; Bi, Xuehua

doi:10.1186/s12859-025-06122-3

Research
Open access
Published: 22 April 2025

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Kai Zhao¹,
Zhuocheng Ji¹,
Linlin Zhang²,
Na Quan¹,
Yuheng Li¹,
Guanglei Yu^3,4 &
…
Xuehua Bi^3,4

BMC Bioinformatics volume 26, Article number: 110 (2025) Cite this article

651 Accesses
1 Altmetric
Metrics details

Abstract

Background

Understanding the relationships between proteins and specific disease phenotypes contributes to the early detection of diseases and advances the development of personalized medicine. The acquisition of a large amount of proteomics data has facilitated this process. To improve discovery efficiency and reduce the time and financial costs associated with biological experiments, various computational methods have yielded promising results. However, the lack of rich and reliable protein-related information still presents challenges in this process.

Results

In this paper, we propose an ensemble prediction model, named HPOseq, which predicts human protein-phenotype relationships based only on sequence information. HPOseq establishes two base models to achieve objectives. One directly extracts internal information from amino acid sequences as protein features to predict the associated phenotypes. The other builds a protein-protein network based on sequence similarity, extracting information between proteins for phenotype prediction. Ultimately, an ensemble module is employed to integrate the predictions from both base models, resulting in the final prediction.

Conclusion

The results of 5-fold cross-validation reveal that HPOseq outperforms seven baseline methods for predicting protein-phenotype relationships. Moreover, we conduct case studies from the points of phenotype annotation and protein analysis to verify the practical significance of HPOseq.

Peer Review reports

Introduction

Proteins play pivotal roles in biological systems, encompassing tasks such as catalyzing enzymes, transmitting signals, and upholding cell structures [1, 2]. Gaining insight into the proteins linked to specific disease is instrumental in revealing the root causes of these conditions [3, 4]. In the era of precision medicine, the analysis of proteins associated with patient-specific phenotypes contributes to early disease diagnosis and personalized treatment [5,6,7]. The traditional biological experimental methods although capable of providing reliable relationships between proteins and disease phenotypes, require a significant amount of time and expenditure.

With the advancement of biological and computer technologies, numerous databases have been established [8,9,10], offering a robust data foundation for the analysis of proteins and diseases. UniProt [11] is a global protein database designed to provide comprehensive, accurate, and reliable protein information. The Human Phenotype Ontology (HPO) [9] is a resource aimed at systematically describing and standardizing human disease phenotypes. In addition to providing a common and standardized vocabulary to describe clinical features and phenotype information related to human diseases, it also offers validated relationships between proteins and phenotypes [12]. Based on these databases, with the help of modeling techniques and computer power, a number of computational methods have been proposed to predict the related phenotypes of proteins [13, 14]. These methods leverage protein information from two aspects: the information obtained from individual protein sequences and the similarity or correlation information provided between different proteins.

Protein information, including the amino acid sequence [1], domains [14], structure [15], and function [13], provides a comprehensive understanding of a protein’s inherent biological attributes. This knowledge is widely applied in diverse bioinformatics challenges, such as drug-target prediction [16], disease genes prediction [17], protein function prediction [1], and protein structure prediction [18]. Tasks without prior association information can be accomplished by employing this type of information. Li et al. [19]. designed a protein sequence-based computational model to effectively predict drug-target interactions using a rotating forest classifier. Researchers have already utilized this idea to infer disease phenotypes related with proteins. For example, Doğan et al. [13]. utilized the gene ontology (GO) associated with the target amino acid sequences as input, constructing a reliable HPO-GO association mapping to obtain co-annotation scores between HPO and GO terms. They employed these scores to predict proteins that possess GO terms but lack HPO terms. Similarly, Liu et al. [14]. proposed a learning ranking framework integrating GO, amino acid alignment order, and domain information, to generate a ranked list of HPO terms. These approaches highlight the reliability of intra-protein as a valuable data source for characterizing proteins.

In addition to the intrinsic properties of individual proteins, researchers also place significant emphasis on the information derived from the relationships between proteins. Inter-protein information refers to the interactions and associations among multiple proteins, including similarities [20] and interactions between proteins [21]. Most biological traits result from the coordinated actions of multiple proteins, rather than being solely attributed to a specific protein [22]. Based on this idea, protein-protein interaction (PPI) is often utilized in the prediction of protein-phenotype relationships. For example, Gao et al. [23]. proposed the HPOAnnotator model, which utilizes non-negative matrix decomposition to aggregate information from multiple PPIs, thus enabling the prediction of protein-HPO annotations. Liu et al. [24]. employed a graph convolutional network to aggregate protein information in PPI and utilized a neural network to generate prediction scores. They demonstrated that inter-protein information can provide useful information for protein-phenotype relationship prediction.

Despite the significant progress made by these methods, the number of phenotypically annotated proteins is still limited [25]. Since most proteins currently only have sequence information available, other information is either unavailable or requires additional biological experiments, which limits the use of these methods [13, 14]. Nowadays, the UniProt database [11] comprises over 249 million amino acid sequences. However, within these sequences, only 1.85% have reliable functional annotations, and approximately 23.56% have experimentally validated PPI information [26]. This vast unknown knowledge poses significant challenges for predicting the pathogenicity of proteins.

The ensemble model effectively mitigates the overfitting problem encountered with individual classifiers by combining results from different classifiers [27]. For example, Yu [28] et al. used the LightGBM model to learn protein features, thus improving the accuracy of peptide and protein functional identification. Li [29] improved the robustness and accuracy of the model in the protein interaction site prediction task by integrating 12 protein features through an ensemble network. Ensemble methods leverage the diversity of multiple models to reduce overfitting and improve prediction performance [30].

Inspired by these methods, we propose an ensemble deep learning method, named HPOseq, for predicting the relationships between proteins and HPO terms solely based on amino acid sequences. Our approach represents protein features from both intra-sequence and inter-sequence perspectives. Specifically, we develop two distinct base models to make separated predictions. Firstly, we employ 1D CNN to integrate contextual information, providing local feature representation within the amino acid sequence for proteins. Then, the amino acid sequence similarities, calculated by using BLAST, are utilized to construct a network. Variational Graph Autoencoder (VGAE) is used on the network to generate low-dimensional vectors of proteins, and a neural network is employed for prediction. Finally, we use an ensemble module to integrate the predictions from the two base models, resulting in the final protein-HPO prediction. Our experimental results validate that HPOseq provides multiple perspectives for predicting protein-phenotype relationships, mitigating the limitations of single-model predictions and enhancing the overall accuracy of the model.

Materials and methods

Dataset

The Human Phenotype Ontology database (https://hpo.jax.org/app/) provides relationships between human phenotype ontology terms and genes. We download the database released in October 2021 release and use the mapping tool provided by the UniProt database to map genes to their corresponding proteins. To ensure the high quality of the dataset, we selectively include proteins listed in Swiss-Prot. Following the true path rule of the directed acyclic graph of HPO, we annotate each protein with the HPO terms of its descendant nodes. To ensure the reliability of our model’s outputs, we employ a filtering approach similar to that of Liu et al. [24], excluding HPO terms associated with fewer than 10 genes. The final dataset includes 4647 proteins, 4575 HPO terms, and 717,516 relationships between them.

The proposed methods

The ensemble model HPOseq solely utilizes amino acid sequences to predict protein-phenotype relationships, consisting of three modules: (1) Prediction based on intra-sequence features extracting features within the protein amino acid sequence to predict phenotypes; (2) Prediction based on inter-sequence features extracting features between sequences for protein-phenotype relationship prediction; (3) Ensemble module integrating predictions from two baseline models to achieve a fused outcome. The framework of HPOseq is shown in Fig. 1.

Prediction based on intra-sequence features

Each protein has a distinctive amino acid sequence, and the sequence’s specific structure influences the protein’s functionality. First, we download the amino acid sequences of proteins from the Uniprot database. Over 99% of the amino acid sequences are less than or equal to 2000 in length [11]. To standardize the sequence length for each protein, followed by preprocessing strategy [1], we retain the first 2000 amino acids for sequences exceeding 2000 in length. For sequences shorter than 2000, we pad with zeros to achieve a length of 2000. Next, we employ a triplet method to encoding the standardized length amino acid sequences [31]. Each protein sequence is composed of 20 amino acids. With 20 amino acids forming combinations of three, plus one triple of all zeros, there are a total of 8001 combinations to form a dictionary. We use a window of size 3 to scan the amino acid sequence, encoding it as 1 or 0 based on whether it is present in the dictionary. Finally, we obtain the sequence feature representation, denoted as $S\in {\mathbb {R}}^{1998}$, for each protein.

1D-CNN is an effective method to extract features from sequences [32]. We employ 1D-CNN to capture contextual and semantic knowledge within the amino acid sequence of protein. Feature is obtained by sliding a convolutional kernel, denoted as $f(\cdot )$, along the input sequence, calculating the weighted sum of local features. The encoding of the $i\text {-}th$ position in the sequence can be defined as follows:

$$\begin{aligned} C(i)=\sum _{m=0}^{K-1}S(i+m)\cdot f(m) , \end{aligned}$$

(1)

where f(m) denotes the convolutional kernel of size K.

To capture multi-scale features from the sequence, we incorporate three layers of 1D-CNNs for feature extraction with convolutional kernel sizes of 256, 128, and 64, respectively. We apply max-pooling layer with a window size of 64 to reduce dimensionality and utilize flattening layer to transform these matrices into two-dimensional feature vectors. Finally, we employ a two-layer fully connected neural network (FNN) to predict the association between human proteins and phenotypes, and the formula of the FNN is as follows:

$$\begin{aligned} Y_{intra} = sigmoid(W^{2}(ReLU(W^{1} C+b^{1}))+b^{2}) , \end{aligned}$$

(2)

where Y is the output of the FNN, $W^{1}$ and $b^{1}$ are the corresponding weights and biases of the first layer. Here, we use the output of the last layer as the prediction score of the model $Y_{intra}$.

Prediction based on inter-sequence features

In order to capture the correlation and influence among different sequences, we construct a graph using sequences similarity defined as $G=\{V,E,X\}$, where V is the node set of proteins, E describes the relationships between nodes in the graph, and X is the node attribute matrix. We first utilize the BLAST tool [33] to calculate the similarities of protein sequences. The more similar two sequences are, the lower their E-value calculated by BLAST. Then, we set a cutoff cut for the similarity score between nodes in the graph. For sequences i and j, if their E-value is below the cutoff cut, then there exists an edge $e_{ij} \in E$ between nodes i and j in the graph G. The adjacency matrix $A \in {\mathbb {R}}^{n\times n}$ can be defined using similarity scores, where the value in the $i\text {-}th$ row and $j\text {-}th$ column is defined as follows:

$$\begin{aligned} {a_{ij}} = {\left\{ \begin{array}{ll} 1-e_{ij},& {\text {if}}\ e_{ij}<cut \\ {0,}& {\text {otherwise,}} \end{array}\right. } \end{aligned}$$

(3)

where cut is the optimal cutoff obtained through experiments, and $e_{ij}$ is the E-value of sequence i and sequence j.

We use the CT method [34] to encode the sequences as node features in the network. The CT method divides 20 amino acids into 7 groups based on their dipole characteristics and side chain volume. Therefore, the dimension of the CT feature vector is 7 $\times$ 7 $\times$ 7, totaling 343 dimensions. In the encoding process, we treat every three consecutive amino acids in the protein sequence as a triad. The frequency of each triad’s occurrence is counted and used as the feature value for that triad. Finally, the protein sequences of different lengths into fixed-length vector representations. The attribute matrix of nodes in the graph G can be defined as: $X \in {\mathbb {R}}^{|V|\times 343}$.

VGAE [35] is an unsupervised feature extraction technique widely utilize for network data, consisting of an encoder and a decoder. We employ a two-layer Graph Convolutional Network (GCN) to learn the latent representation of nodes in the graph. The two-layer GCN is defined as follows:

$$\begin{aligned} {\bar{X}} = GCN(X,A)= {\widetilde{A}}ReLU({\widetilde{A}}XW^{0} ) W^{1} , \end{aligned}$$

(4)

where ${\widetilde{A}} = D^{-\frac{1}{2}} AD^{-\frac{1}{2} }$ is the symmetric normalized adjacency matrix, and W is the weight. We use a two-layer GCN as an encoder, computed by the following equation:

$$\begin{aligned} q(Z |X,A)=\prod _{i=1}^{n}q(z_{i}|X,A) , \end{aligned}$$

(5)

$$\begin{aligned} q(z_{i} |X,A)=N(z_{i}|\mu _{i},diag(\sigma _{i}^{2} )) . \end{aligned}$$

(6)

Then, we can get the latent variables z as follows:

$$\begin{aligned} z_{i} = \mu _{i}+ \sigma _{i}\epsilon , \end{aligned}$$

(7)

where $\mu$ and $\sigma$ correspond to the mean and variance of the Gaussian distribution of z and $\epsilon \sim N(0,1)$.

We reconstruct the latent variable Z into the adjacency matrix using the inner product decoder. This decoder computes the probability of edges between nodes through inner products:

$$\begin{aligned} p(A|Z)=\prod _{i=1}^{n} \prod _{j=1}^{n}p(A_{ij}|z_{i}z_{j} ) , \end{aligned}$$

(8)

$$\begin{aligned} p(A_{ij}|z_{i},z_{j} )=sigmoid(z_{i}^{\top }z_{j}) . \end{aligned}$$

(9)

We compare the network generated by the decoder with the original network and calculate the distance between network distributions as an error to train the model:

$$\begin{aligned} Loss_{CE} = {\mathbb {E}} _{q(Z|X,A)}[\log {p(A|Z)}] , \end{aligned}$$

(10)

$$\begin{aligned} Loss = Loss_{CE}-KL[q(Z|X,A)||p(Z)] . \end{aligned}$$

(11)

$Loss_{CE}$ is defined as the cross entropy function, which calculates the difference between the original network and the generated network. $KL[q(\cdot )||p(\cdot )]$ is defined as the KL divergence between $q(\cdot )$ and $p(\cdot )$.

Finally, we utilize the Z obtained from VGAE as the feature representation for the sequences. With this representation, we employ three layers of FNN to generate the prediction results $Y_{inter}$ for the inter-sequence feature model, the computation method is identical to the FNN used in the intra-sequence feature model.

Ensemble module

Inspired by Cai et al. [30], we use an ensemble module based on a weighted fusion strategy. This model adds a mask matrix to the FNN. By masking irrelevant neurons during the learning process, it is ensured that the model is not affected by other HPO annotation results when predicting the target HPO. The predicted outputs from the two base models are fed sequentially into this module as inputs. These inputs are weighted and fused through a hidden layer to derive the probability associated with each HPO term. Computed by the following equation:

$$\begin{aligned} \begin{aligned} P = ReLU(W_{intra}Y_{intra}^{HPOn}+W_{inter}Y_{inter}^{HPOn}) , \end{aligned} \end{aligned}$$

(12)

where P is the output vector of the weight classifier.

In this context, Y denotes the output vector of the weighted classifier. Specifically, $Y_{intra}^{HPOn}$ and $Y_{inter}^{HPOn}$ represent the prediction results of the $n\text {-}th$ HPO term derived from the two base models, respectively. Ultimately, the aggregated results from the weighted fusion process are conveyed to the output layer of the model. In this final stage, we generate predicted scores for the respective HPO terms.

$$\begin{aligned} Score = sigmoid(WP+b) . \end{aligned}$$

(13)

For the training of our model, we adopt the gradient descent algorithm as the foundational optimization technique. Additionally, we employ cross-entropy as the loss function to quantitatively measure the disparity between our model’s predicted results and the actual labels.

Experiment settings

In our study, we conduct 5-fold cross-validation to evaluate the performance of our method. The dataset containing 4647 proteins is divided into 5 parts, with each part consisting of 742 proteins. In each fold, one part is used as the test set, and the remaining proteins are used as the training set for model training. During the experiment, we use grid search method to determine the optimal hyperparameters for each sub model. For prediction in both the prediction based on intra-sequence features and prediction based on inter-sequence features, we set the batch size to 1024 and the number of epochs to 80. To avert overfitting, dropout techniques are applied to both the input and hidden layers with dropout rates of 0.3. In the ensemble module, the batch size is set to 1024 and the number of epochs is set to 3500. We employ the Adam optimizer for model optimization, using the sigmoid function for fitting the results.

In this paper, we use two commonly used metrics for evaluating such issues, AUPR and $F_{max}$, to assess the performance of the models. AUPR refers to the area under the Precision-Recall curve, which shows the relationship between Precision and Recall at different thresholds. The $F_{max}$ is the harmonic mean of Precision and Recall, used to comprehensively evaluate the accuracy of classification models, which is defined as follows:

$$\begin{aligned} rc(t)=\frac{1}{n} \sum _{i=1}^{n}rc_{i}(t), \end{aligned}$$

(14)

$$\begin{aligned} pr(t)=\frac{1}{m(t)} \sum _{i=1}^{m(t)}pr_{i}(t), \end{aligned}$$

(15)

$$\begin{aligned} F_{max}=max_{t}\left\{ \frac{2 pr(t) rc(t)}{pr(t)+rc(t)} \right\} , \end{aligned}$$

(16)

where $t\in [0,1]$ represents the threshold value, pr(t) and rc(t) represent the precision and recall corresponding to t, respectively. The calculation of evaluation metrics in this paper is based on a protein-centric approach.

Experiments and results

Performance comparision with baselines

We compare our model with seven baseline methods. Among them, HPOLabeler [14], HPODNets [24], Naive [36], and Blast [33] are methods for predicting protein-phenotype relationships. HPOLabeler and HPODNets are methods that utilize multiple data types to extract protein features. Naïve uses the number of occurrences of HPO terms as the HPO annotation score for all proteins. The Blast selects all annotations of the most similar sequence generated by the Blast tool as predictive annotations. Due to the similar technical implementations used in gene ontology (GO) prediction and phenotype ontology prediction, we select three GO prediction methods, DeepGoPlus [1], DeepFRI [37], and Struct2GO [38], as baselines. Like our method HPOseq, these methods only employ amino acid sequences as input data. To ensure a fair comparison, we replace the PPI network component of the HPOLabeler and HPODNets methods with the sequence similarity network. This allows us to evaluate models’ performance in predicting protein-phenotype relationships solely using sequence-based information, without additional protein-based information.

The comparison results of the 5-fold cross-validation experiment between HPOseq and the baseline methods are shown in Fig. 2. HPOseq achieves the best overall performance among all methods, with AUPR and $F_{max}$ score values of 0.3244 and 0.3869, respectively, improving by 1.7 and 1.8% over the second-best method. HPOseq, HPOLabeler, and HPODNets extract feature representations of proteins from data sources to predict the phenotypic annotations of proteins. The performance improvement of HPOseq may be attributed to a more comprehensive feature representation method, which extracts features from both the composition of protein amino acid sequences and sequence similarity. Compared with DeepGoPlus, DeepFRI, and Struct2GO, which also rely solely on amino acid sequences to extract features, HPOseq still has a distinct advantage. This suggests that the feature extraction strategy proposed in this paper is more robust.

Parameter analysis

This paper conducts experimental analysis on important parameters within the model, including the feature embedding dimension of VGAE, the hidden layer dimension of the FNN, and the E-value threshold of similarity network construction.

Embedding dimension analysis

We perform a sensitivity analysis on the feature embedding dimensions of VGAE and the hidden layer dimensions of the FNN. In our experiments, we compare the performance of the model based on inter-sequence features with FNN and VGAE dimensions set to 256, 512, 1024, 2048, and 3072. Figure 3 illustrates the AUPR and $F_{max}$ results for different embedding dimensions in intra-sequence feature-based prediction and inter-sequence feature-based prediction. Our findings indicate that varying the hidden layer dimension has an impact on HPOseq prediction results. Notably, when the hidden layer dimension of both the FNN and VGAE is set to 2048, HPOseq achieves optimal prediction results.

Network construction analysis

The topology of the similarity network changes with different E-value thresholds. To illustrate the impact of different thresholds on model performance, we construct protein similarity networks at various thresholds. As shown in Table 1, the number of edges in the similarity network varies with the E-value threshold in the range of [0,1]. We execute HPOseq on different similarity networks, and the results, as shown in Table 2, indicate that the model achieves the highest AUPR and $F_{max}$ values when the E-value threshold is 0.01.

Table 1 The number of edges in similarity network with different E-value threshold

Full size table

Table 2 Prediction performance of HPOseq with different E-values

Full size table

Ablation study

Module ablation analysis

We conduct ablation experiments to assess the significance of different modules within the HPOseq. In these experiments, we systematically selected various modules from the HPOseq and assessed the contribution of different modules by comparing the prediction AUPR and $F_{max}$.

Figure 4a illustrates the ablation experiment results. The findings demonstrate the approach of integrating multiple features for prediction, followed by fusing the prediction outputs, is significantly superior to the strategy of relying solely on a single sequence feature for prediction. Interestingly, averaging the predictions of different features may not directly improve the final prediction. This may be caused by not flexibly utilizing the advantages of different features in predicting the corresponding HPO annotations. Therefore, the ensemble module can effectively integrate predictions from different features. Experimental results highlight the significance of result fusion in model design while also emphasize the limitations of simple mean fusion.

Comparison of different similarity network construction methods

Different networks can be constructed using various similarity calculation methods. The choice of similarity calculation method affects the quality of features extracted by VGAE. We prioritize the sequence similarity calculation method by comparing the performance of HPOseq by using different similarity calculation methods, including cosine similarity [39], Pearson similarity [40] and Smith-Waterman similarity [41] calculation methods. From Fig. 4b, Pearson similarity and cosine similarity focus on correlation or directional similarity of numerical data, while Smith-Waterman similarity focuses more on optimal local sequence alignment. The similarity networks generated by these methods cannot accurately and comprehensively similarity between amino acid sequences. The BLAST tool, on the other hand, prioritizes sequence global alignment information when calculating similarity between sequences, and can provide high-quality networks for downstream prediction tasks.

Sensitivity analysis of fusion methods

The key to ensemble modeling methods is to improve the accuracy of the final prediction by fusing the prediction results of multiple features. We use weighted summation, FNN, average, AUPR-based weight set and $F_{max}$ weight set respectively to fuse the results and generate the final prediction. By comparing these fusion methods, we aim to identify the most effective way to fuse multiple predictions into one highly accurate prediction.

Figure 4c illustrates the comparative results of the various fusion strategies, highlighting the accuracy of the ensemble module in aggregating and generating final predictions from the various feature predictions. Unlike other result fusion strategies, the ensemble module has the advantage of being able to independently learn the prediction weights for different HPO terms in each feature prediction result. This allows the model to assign maximum weights to the most accurate feature prediction results while mitigating the impact of erroneous prediction models on the results.

Case studies

Protein prediction centered on HPO term

To further validate the practicality of our model, we adopt the approach proposed by Liu et al. [24] to assess the prediction quality. In making predictions using the October 2021 dataset, we focus on proteins associated with pneumonia (HP:0002090) and identify the top ten highest-scoring proteins from newly predicted candidates for validation. Subsequently, we compare these predictions with information from the recent literature and the latest publications in the database. Notably, six of the top ten predictions are validated in the subsequent correlation data, as outlined in Table 3.

Table 3 Top proteins predict as most related to Pneumonia (HP:0002090) with supporting literature

Full size table

The work of Han et al. [42] showcases the augmentation of inflammatory cytokine secretion during Streptococcus pneumoniae infection through UGRP1-PDPN signaling. Disrupting the UGRP1-PDPN interaction has been identified as a prospective therapy for combating Streptococcus pneumoniae. MyD88 is a pivotal bridging protein in Toll-like and IL-1 receptor family signaling, crucial for regulating innate immune responses and inflammation. The mutations in the TRNT1 gene could substantially reduce anti-SARS-CoV-2 antibody levels, thereby elevating the risk of lung infections. McGonagle et al. [43] conclude that a pathological correlation exists between IL6 and pneumonia, and variations in IL6 levels among patients can influence their viral immunity. The CD40LG gene chrX locus mutation is a potential contributor to patients’ interstitial pneumonia and liver injury. Furthermore, Novelli et al. [44] establish an association between HLA-DRB1 and severe neocrown pneumonia symptoms. The validation of our results demonstrates that HPOseq holds promise for the discovery of novel candidate genes/proteins associated with aberrant phenotypes.

HPO term prediction centered on protein

We use the trained model to predict HPO terms for three proteins (V9HW98, A0PJI1, O14997) that are not included in the training data. Meanwhile, we reference the protein gene mapping relationship downloaded from the HPO database in 2021 to map the protein to the corresponding gene and search for literature support from the PubMed database. We rank the newly predicted HPO annotation scores of these three proteins from high to low according to the model, as shown in Table 4.

Table 4 New protein prediction and literature support

Full size table

The YWHAE gene encodes the 14-3-3$\epsilon$ protein, which is a multifunctional molecule involved in many key biological processes. It is closely associated with a variety of diseases, including but not limited to breast cancer, gastric cancer, and colorectal cancer [45]. The involvement of 14-3-3$\epsilon$ in signal transduction, affecting MMP-3 and MMP-13 levels, with CD13 playing a crucial role. Additionally, Ikeda et al. [46] suggested YWHAE as a potential protective gene in schizophrenia. The HIC1 gene, also known as HIC ZBTB transcriptional repressor 1, is a gene that encodes a protein [47]. The proliferation impact of HIC1 absence in tissue-resident mesenchymal promoters, with Protein A0PJI1 being encoded by HIC1. The embryonic origins and potential implications of HIC1 in conditions like age-related macular degeneration. Lastly, the CCDC88A gene, encoding protein O14997, has been linked to various neurological and joint conditions. CCDC88A associated with degenerative joint diseases, its critical role in neural development and disorders like hyperrhythmicity and optic atrophy syndrome. These specific cases confirm the practicality of our method in identifying novel protein HPO term annotations.

Conclusion

Unveiling the relationships between human proteins and phenotypes can assist in disease prevention, diagnosis, and treatment. In this paper, we introduce HPOseq, a deep ensemble model capable of predicting protein-phenotype relationships using only amino acid sequences. We build two sub-models based on amino acid sequences to predict the phenotype annotations of proteins. One represents protein features based on the composition of amino acid sequences. The other sub-model constructs a network using protein sequence similarity and extracts features. Finally, the integrated model is used to fuse the prediction results from these two perspectives to achieve the final prediction goal. Extensive experimental validation has shown that the model proposed in this paper can effectively predict human protein-phenotype relationships.

Although HPOseq significantly enhances the performance of protein-phenotype relationship prediction, we still treat this problem as a traditional multi-label classification task, only discussing in depth the extraction of protein features from amino acid sequences. In fact, the Human Phenotype Ontology database provides abundant phenotype ontology data, which could provide useful information for protein-phenotype relationship prediction. Future research can explore more effective methods to leverage the relationships between HPO term nodes to serve protein-phenotype relationship prediction.

Availability of data and materials

The code and data used in this study are freely downloadable at https://github.com/LabBioMedCoder/HPOseq.

References

Kulmanov M, Hoehndorf R. Deepgoplus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422–9.
Article CAS PubMed Google Scholar
Bao W, Yang B. Protein acetylation sites with complex-valued polynomial model. Front Comput Sci. 2024;18: 183904.
Article CAS Google Scholar
Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms’ function proteins with voting transfer model. Front Microbiol. 2024;14:1277121.
Article PubMed PubMed Central Google Scholar
Bao W, Chen B, Zhang Y. WSHNN: A weakly supervised hybrid neural network for the identification of DNA-protein binding sites. Curr Comput-aided Drug Design. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.2174/0115734099277249240129114123.
Article Google Scholar
Yu G, Zhang L, Zhang Y, Zhou J, Zhang T, Bi X. Prediction and risk stratification from hospital discharge records based on hierarchical sLDA. BMC Med Inform Decis Mak. 2022;22(1):1–12.
Article Google Scholar
Lapitz A, Azkargorta M, Milkiewicz P, Olaizola P, Zhuravleva E, Grimsrud MM, Schramm C, Arbelaiz A, O’rourke CJ, La Casta A, et al. Liquid biopsy-based protein biomarkers for risk prediction, early diagnosis, and prognostication of cholangiocarcinoma. J Hepatol. 2023;79(1):93–108.
Article CAS PubMed PubMed Central Google Scholar
Bi X, Jiang C, Yan C, Zhao K, Zhang L, Wang J (2023) Identifying mirna-disease associations based on simple graph convolution with dropmessage and jumping knowledge. In: International symposium on bioinformatics research and applications, Springer, pp 45–57
Pomaznoy M, Ha B, Peters B. Gonet: a tool for interactive gene ontology analysis. BMC Bioinf. 2018;19(1):1–8.
Article Google Scholar
Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, Baynam G, Brower AM. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49(D1):1207–17.
Article Google Scholar
Franz M, Rodriguez H, Lopes C, Zuberi K, Montojo J, Bader GD, Morris Q. Genemania update 2018. Nucleic Acids Res. 2018;46(W1):60–4.
Article Google Scholar
Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Research 2023;51(D1), 523–531
Bi X, Liang W, Zhao Q, Wang J. Sslpheno: a self-supervised learning approach for gene-phenotype association prediction using protein-protein interactions and gene ontology data. Bioinformatics. 2023;39(11):662.
Article Google Scholar
Doğan T. Hpo2go: prediction of human phenotype ontology term associations for proteins using cross ontology annotation co-occurrences. PeerJ. 2018;6:5298.
Article Google Scholar
Liu L, Huang X, Mamitsuka H, Zhu S. Hpolabeler: improving prediction of human protein-phenotype associations by learning to rank. Bioinformatics. 2020;36(14):4180–8.
Article CAS PubMed Google Scholar
Caldararu O, Blundell TL, Kepp KP. A base measure of precision for protein stability predictors: structural sensitivity. BMC Bioinf. 2021;22:1–14.
Article Google Scholar
Wang L, You Z-H, Chen X, Yan X, Liu G, Zhang W. Rfdt: a rotation forest-based predictor for predicting drug-target interactions using drug structure and protein sequence information. Curr Protein Pept Sci. 2018;19(5):445–54.
Article CAS PubMed Google Scholar
Zhang L, Lu D, Bi X, Zhao K, Yu G, Quan N. Predicting disease genes based on multi-head attention fusion. BMC Bioinf. 2023;24(1):162.
Article CAS Google Scholar
Zhang Y. I-tasser server for protein 3d structure prediction. BMC Bioinf. 2008;9:1–8.
Article Google Scholar
Li Y, Huang Y-A, You Z-H, Li L-P, Wang Z. Drug-target interaction prediction based on drug fingerprint information and protein sequence. Molecules. 2019;24(16):2999.
Article CAS PubMed Central Google Scholar
Gerlt JA, Bouvier JT, Davidson DB, Imker HJ, Sadkhin B, Slater DR, Whalen KL. Enzyme function initiative-enzyme similarity tool (EFI-EST): a web tool for generating protein sequence similarity networks. Biochim Et Biophys Acta-Proteins Proteom. 2015;1854(8):1019–37.
Article CAS Google Scholar
Athanasios A, Charalampos V, Vasileios T, Md Ashraf G. Protein-protein interaction (PPI) network: recent advances in drug discovery. Curr Drug Metab. 2017;18(1):5–10.
Article CAS PubMed Google Scholar
Casey PJ. Protein lipidation in cell signaling. Science. 1995;268(5208):221–5.
Article CAS PubMed Google Scholar
Gao J, Liu L, Yao S, Huang X, Mamitsuka H, Zhu S. Hpoannotator: improving large-scale prediction of HPO annotations by low-rank approximation with HPO semantic similarities and multiple PPI networks. BMC Med Genom. 2019;12(10):1–14.
Google Scholar
Liu L, Mamitsuka H, Zhu S. Hpodnets: deep graph convolutional networks for predicting human protein-phenotype associations. Bioinformatics. 2022;38(3):799–808.
Article CAS PubMed Google Scholar
Liu L, Zhu S. Computational methods for prediction of human protein-phenotype associations: a review. Phenomics. 2021;1(4):171–85.
Article PubMed PubMed Central Google Scholar
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S. The string database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51(D1):638–46.
Article Google Scholar
Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Programs Biomed. 2018;153:1–9.
Article PubMed Google Scholar
Yu H, Luo X. IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models. Brief Bioinform. 2023;24(1):476.
Article Google Scholar
Li Y, Golding GB, Ilie L. Delphi: accurate deep ensemble model for protein interaction sites prediction. Bioinformatics. 2021;37(7):896–904.
Article CAS PubMed Google Scholar
Cai Y, Wang J, Deng L. Sdn2go: an integrated deep learning model for protein function prediction. Front Bioeng Biotechnol. 2020;8:391.
Article PubMed PubMed Central Google Scholar
Du Z, He Y, Li J, Uversky VN. Deepadd: protein function prediction from k-mer embedding and additional features. Comput Biol Chem. 2020;89: 107379.
Article CAS PubMed Google Scholar
Villegas-Morcillo A, Gomez AM, Morales-Cordovilla JA, Sanchez V. Protein fold recognition from sequences using convolutional and recurrent neural networks. IEEE/ACM Trans Comput Biol Bioinf. 2020;18(6):2848–54.
Article Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Article CAS PubMed PubMed Central Google Scholar
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci. 2007;104(11):4337–41.
Article CAS PubMed PubMed Central Google Scholar
Kipf TN, Welling M. Variational graph auto-encoders. Stat. 2016;1050:21.
Google Scholar
Clark WT, Radivojac P. Analysis of protein function and its prediction from amino acid sequence. Proteins Struct Function Bioinf. 2011;79(7):2086–96.
Article CAS Google Scholar
Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, Chandler C, Taylor BC, Fisk IM, Vlamakis H. Structure-based protein function prediction using graph convolutional networks. Nat Commun. 2021;12(1):3168.
Article PubMed PubMed Central Google Scholar
Moyano JM, Gibaja EL, Ventura S. Mlda: a tool for analyzing multi-label datasets. Knowl-Based Syst. 2017;121:1–3.
Article Google Scholar
Lee B, Lee D (2009) Protein comparison at the domain architecture level. In: BMC Bioinformatics, vol. 10, pp 1–9 . BioMed Central
Mana SC, Sasipraba T. Research on cosine similarity and pearson correlation based recommendation models. J Phys Conf Series. 2021;1770: 012014.
Article Google Scholar
Zhao M, Lee W-P, Garrison EP, Marth GT. SSW library: an SIMD smith-waterman C/C++ library for use in genomic applications. PLoS ONE. 2013;8(12):82138.
Article Google Scholar
Han L, Zhang F, Liu Y, Yu J, Zhang Q, Ye X, Song H, Zheng C, Han B. Uterus globulin associated protein 1 (UGRP1) binds podoplanin (PDPN) to promote a novel inflammation pathway during streptococcus pneumoniae infection. Clin Transl Med. 2022;12(6):850.
Article Google Scholar
McGonagle D, Sharif K, O’Regan A, Bridgewood C. The role of cytokines including interleukin-6 in COVID-19 induced pneumonia and macrophage activation syndrome-like disease. Autoimmun Rev. 2020;19(6): 102537.
Article CAS PubMed PubMed Central Google Scholar
Novelli A, Andreani M, Biancolella M, Liberatoscioli L, Passarelli C, Colona VL, Rogliani P, Leonardis F, Campana A, Carsetti R. HLA allele frequencies and susceptibility to COVID-19 in a group of 99 Italian patients. Hla. 2020;96(5):610–4.
Article CAS PubMed PubMed Central Google Scholar
Nefla M, Sudre L, Denat G, Priam S, Andre-Leroux G, Berenbaum F, Jacques C. The pro-inflammatory cytokine 14-3-3$\varepsilon$ is a ligand of cd13 in cartilage. J Cell Sci. 2015;128(17):3250–62.
CAS PubMed PubMed Central Google Scholar
Ikeda M, Hikita T, Taya S, Uraguchi-Asaki J, Toyo-Oka K, Wynshaw-Boris A, Ujike H, Inada T, Takao K, Miyakawa T. Identification of YWHAE, a gene encoding 14-3-3epsilon, as a possible susceptibility gene for schizophrenia. Hum Mol Genet. 2008;17(20):3212–22.
Article CAS PubMed Google Scholar
Scott RW, Arostegui M, Schweitzer R, Rossi FM, Underhill TM. Hic1 defines quiescent mesenchymal progenitor subpopulations with distinct functions and fates in skeletal muscle regeneration. Cell Stem Cell. 2019;25(6):797–813.
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

This work is supported by the National Natural Science Foundation of China (No. 62366052), the Key R&D Program of Xinjiang Uygur Autonomous Region (2022B03023), and the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2024D01C126, 2022D01C427).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
Kai Zhao, Zhuocheng Ji, Na Quan & Yuheng Li
School of Software, Xinjiang University, Urumqi, 830011, China
Linlin Zhang
College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China
Guanglei Yu & Xuehua Bi
School Of Computer Science and Engineering, Central South University, Changsha, 410083, China
Guanglei Yu & Xuehua Bi

Authors

Kai Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Zhuocheng Ji
View author publications
You can also search for this author inPubMed Google Scholar
Linlin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Na Quan
View author publications
You can also search for this author inPubMed Google Scholar
Yuheng Li
View author publications
You can also search for this author inPubMed Google Scholar
Guanglei Yu
View author publications
You can also search for this author inPubMed Google Scholar
Xuehua Bi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xuehua Bi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, K., Ji, Z., Zhang, L. et al. HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences. BMC Bioinformatics 26, 110 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06122-3

Download citation

Received: 14 January 2024
Accepted: 27 March 2025
Published: 22 April 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06122-3

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Abstract

Background

Results

Conclusion

Introduction

Materials and methods

Dataset

The proposed methods

Prediction based on intra-sequence features

Prediction based on inter-sequence features

Ensemble module

Experiment settings

Experiments and results

Performance comparision with baselines

Parameter analysis

Embedding dimension analysis

Network construction analysis

Ablation study

Module ablation analysis

Comparison of different similarity network construction methods

Sensitivity analysis of fusion methods

Case studies

Protein prediction centered on HPO term

HPO term prediction centered on protein

Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us