- Research
- Open access
- Published:
Mammalian piRNA target prediction using a hierarchical attention model
BMC Bioinformatics volume 26, Article number: 50 (2025)
Abstract
Background
Piwi-interacting RNAs (piRNAs) are well established for monitoring and protecting the genome from transposons in germline cells. Recently, numerous studies provided evidence that piRNAs also play important roles in regulating mRNA transcript levels. Despite their significant role in regulating cellular RNA levels, the piRNA targeting rules are not well defined, especially in mammals, which poses obstacles to the elucidation of piRNA function.
Results
Given the complexity and current limitation in understanding the mammalian piRNA targeting rules, we designed a deep learning model by selecting appropriate deep learning sub-networks based on the targeting patterns of piRNA inferred from previous experiments. Additionally, to alleviate the problem of insufficient data, a transfer learning approach was employed. Our model achieves a good discriminatory power (Accuracy: 98.5%) in predicting an independent test dataset. Finally, this model was utilized to predict the targets of all mouse and human piRNAs available in the piRNA database.
Conclusions
In this research, we developed a deep learning framework that significantly advances the prediction of piRNA targets, overcoming the limitations posed by insufficient data and current incomplete targeting rules. The piRNA target prediction network and results can be downloaded from https://github.com/SofiaTianjiaoZhang/piRNATarget.
Background
Piwi-interacting RNAs (piRNAs) are 21–35 nucleotide small noncoding RNAs present in animals [1, 2]. piRNAs were initially found in germline cells as important post-transcriptional regulators in transposon silencing [3, 4]. Subsequently, piRNAs were found to have essential function in the regulation of mRNAs [5, 6] which have been highlighted by their roles in axon regeneration [7], embryonic development [8, 9], memory [10, 11], foraging behavior [12] and human diseases [13, 14].
Previous work has suggested that piRNAs guide PIWI-clade argonaute proteins to recognize and regulate mRNA through complementary base pairing [15,16,17]. However, unlike microRNAs (miRNAs), which direct AGO-clade argonaute proteins to repress the target RNA via 7- nucleotide seed (2-8nt) complementarity recognition [18, 19], the targeting rules of piRNAs seem to be distinct and more intricate. In Caenorhabditis elegans (C. elegans), complementarity in both the seed region (2-7nt) and non-seed region (8-21nt) appears to be essential. Specifically, the targeting rules dicate zero non-GU mismatches and up to two GU pairs in the non-seed region, up to three non-GU mismatches in non-seed, and up to six mismatches in seed and non-seed regions combined [15]. This rule is widely used in nematodes but does not appear to be suitable in evolutionarily divergent species. For instance, the piRNA targeting rules in Drosophila melanogaster demonstrate a lower tolerance for mismatches compared to the rule observed in C. elegans [20]. In mammals, the targeting rule of piRNAs remains more ambiguous. Recently, several mammalian piRNA targeting-related studies were conducted in mice to address the targeting rules. A slicer assay of MIWI (Murine homolog of PIWI)-piRNA complexes indicated that extensive complementarity between nucleotides 2-22nt is required for targeting, while several mismatches at 3′-terminal nucleotides were tolerated [21]. Meanwhile, a study leveraging 5′RACE-seq and CLIP-seq methodologies concluded that MIWI-piRNA complexes targeting mRNAs prefer strict base pairing at the 5′ end of piRNAs, accompanied by a minimal 16nt stretch of continuous complementarity between piRNAs and mRNAs, with less than three mismatches in the subsequent 20nt sequences [22]. A more recent study similarly suggests that MIWI-piRNA complexes favor a longer extent of continuous complementarity, and that extending pairing allows MIWI proteins to tolerate guide-target mismatches. This study also suggests that complementary pairing of the 5′ can accelerate the targeting efficiency, yet is not essential [23].
These diverse conclusions are likely caused by different experimental and inductive methods yet imply the significant complexity in the targeting rules of mammalian piRNAs compared to those of C. elegans. However, based upon these studies, it appears that mammalian piRNA targeting rules have two main features: (1) targeting relies on complementary base pairings, especially the number of consecutive matches at the 5′ end; (2) complementary base pairings at different positions have varying effects on the results, for example, continuous pairing at the 3′ end can increase the tolerance for mismatches at the 5′ end to a certain extent. Additionally, the targeting rules in C. elegans and mice exhibit some degree of overlap, as both organisms require a minimum level of base-pairing between piRNA and mRNA sequences as a fundamental prerequisite for effective targeting. This establishes the foundation for exploring the application of transfer learning from C. elegans to mammals.
Recent structural analyses of the PIWI-piRNA complex have revealed striking similarities in the piRNA targeting rules among mammals, yet notable divergences from nematodes [17, 24]. While piRNA targeting prediction models have been proposed, there is no agreement on their general applicability. One computational model, piRNAPre, employs a support vector machine classifier that relies on CLIP-seq-derived and position-based features. While this model exhibits satisfactory performance in predicting piRNAs and their corresponding targets in mouse testes, the generalizability to other tissues and species is limited by the availability of CLIP-seq-derived features [25]. Therefore, a model that predicts target sites solely based on base sequences is imperative [26]. With the mammalian piRNA targeting patterns inferred from previous experiments, we propose to improve mammalian piRNA targeting prediction by training a devised deep learning model that integrates Convolutional Neural Network (CNN) with hierarchical attention layers. Specifically, CNN is suitable for extracting features from piRNA and mRNA sequences, such as potential preferences for PIWI protein binding motifs, perfectly matched seed regions, and other relevant characteristics. The first attention sub-network (Interaction Layer) focus on base-level analysis, which plays a pivotal role in recognizing the complementary between piRNA and mRNA bases. A second attention sub-network (Global Dominant Base-Pair Matching Extraction Layer) globally identifies dominant base-pair matching positions across the entire sequence, which complements the base-level analysis in first attention sub-network by offering a broader perspective on the sequence-level interactions. Furthermore, given the scarcity of mammal-specific datasets compared to the abundance of C. elegans data, and the observed similarities in piRNA targeting rules between mammals and C. elegans, we have devised a transfer learning strategy. This approach involves pre-training our model on C. elegans data, followed by fine-tuning with mice data. By doing so, we maximize the utilization of existing resources while ensuring the relevance and specificity of our model for mammalian piRNAs.
Methods
Dataset collection and preprocessing
Certain advanced sequencing methodologies facilitate the efficient identification of potential piRNA targeting sites in a high-throughput manner. Notably, CLASH-seq [27] stands out by providing invaluable insights into the intricate piRNA-PIWI-mRNA interaction dynamics. On the other hand, CLIP-seq [28] offers the information of the complex interplay between piRNA-PIWI and mRNA-PIWI complexes, shedding light on their functional associations. Furthermore, 5′-RACE sequencing excels in precisely capturing potential piRNA-mediated cleavage sites, thereby reinforcing the evidence for the direct interaction between piRNA and mRNA.
Considering these characteristics, we mainly collected relevant data focused on these experiments that could contain such piRNA-mRNA sequence pairs and used them to construct training dataset. For example: CLASH, CLIP-Seq, RIP-Seq, etc. Search using the keyword in GEO: (HIWI OR MIWI OR PIWI OR piRNA) AND (CLASH OR CLIP OR RIP). Finally, the datasets as shown in Table 1 were collected. The positive and negative C. elegans data was obtained from the CLASH dataset, which provides sequences of piRNA and its targeting mRNA [29, 30]. The positive mice data were also retrieved from sequences of piRNA and its target mRNA, where these interactions were detected using 5′-RACE sequencing and validated by CLIP-seq [6, 22].
Due to the lack of negative data in the mice training dataset, we employed the following method to simulate an authentic negative dataset. Specifically, we first analyzed the mismatch frequency distributions of positive and negative piRNA-mRNA pairs in C. elegans, calculating the mean and variance differences between them (Fig. 1a). Based on this analysis, we inferred the mismatch frequency characteristics of potential negative piRNA-mRNA interactions in mice (Fig. 1b).
Mismatch Frequency Distribution in piRNA-mRNA pairs in training dataset. a The mismatch frequency distribution of piRNA-mRNA pairs in C. elegans positive and negative dataset. b The mismatch frequency distribution of piRNA-mRNA pairs in mice positive training dataset and estimated negative training dataset. The simulation of the negative dataset commences with an analysis of the mismatch frequency distributions for both positive and negative piRNA-mRNA pairs in the model organism C. elegans. We calculate the mean (\(\mu_{c\_pos}\), \(\mu_{c\_neg}\)) and standard deviation (\(\sigma_{c\_pos}\), \(\sigma_{c\_neg}\)) of the mismatch frequencies for both positive (\(\mu_{c\_pos}\), \(\sigma_{c\_pos}\)) and negative (\(\mu_{c\_neg}\), \(\sigma_{c\_neg}\)) pairs. The mean difference (\(\Delta \mu_{c} = \mu_{c\_neg} - \mu_{c\_pos}\)) and the standard deviation ratio (\(\frac{{\sigma_{c\_neg} }}{{\sigma_{c\_pos} }}\)) between the negative and positive piRNA-mRNA pairs are determined. We assume that these differences are crucial for understanding how the negative interactions deviate from the positive ones. Next, the mean (\(\mu_{m\_pos}\)) and standard deviation (\(\sigma_{m\_pos}\)) of the mismatch frequencies for the positive piRNA-mRNA pairs in mice are calculated. The estimated standard deviation of the negative dataset in mice (\(\sigma_{est\_m\_neg}\)) is calculating by multiplying the standard deviation of the positive dataset by the standard deviation ratio derived from C. elegans:
The mean of the negative dataset in mice (\(\mu_{est\_m\_neg}\))is estimated using the mean and the standard deviation of the positive dataset, adjusted by the ratio of the maximum length of mice piRNA (35nt) to C. elegan piRNA length (21nt):
Based on \(\sigma_{est\_m\_neg}\) and \(\mu_{est\_m\_neg}\), we estimate the mismatch frequency distribution in negative mice piRNA-mRNA pairs.
Next, we collected mRNAs that exhibited unaffected expression levels in the testes of MIWI null mutant mice [21]. To do this, we retrieved the RNA-seq data and applied the differential analysis methodology described in the original publication [21]. The rationale for this selection is that, in these mutants, MIWI-piRNA complexes are dysfunctional. If an mRNA were regulated by piRNAs, its expression level would be expected to change in the absence of functional MIWI-piRNA complexes. Conversely, mRNAs whose expression levels remain stable in the MIWI null mutants are highly unlikely to be targeted by piRNAs. We note that another mouse PIWI protein MILI, while its role in post-transcriptional modification of coding genes is not fully elucidated, is known to have slicer activity for silencing the retrotransposon LINE1 in spermatocytes through piRNA guidance [31]. To mitigate any potential impact from MILI, we analyzed both wide-type and MILI mutant RNA-seq data to identify mRNAs that did not exhibit differential expression in MILI mutants [32]. By interacting this set with the unchanged mRNAs from MIWI mutants, we have been able to select mRNAs that remained unaffected in both mutants. These mRNAs are considered less likely to be targeted by piRNAs, as they do not show differential expression even when the PIWI proteins known to interact with piRNAs are nonfunctional therefore were considered ideal candidates for our negative dataset. Subsequently, we aligned piRNAs to these mRNAs, allowing a certain degree of mismatch, to generate a potential negative training dataset. Finally, we stratified the sampling of the piRNA-mRNA site pairs within this potential negative training dataset according to previously estimated negative piRNA-mRNA mismatch frequency (Fig. 1b), resulting in a reliable simulated negative dataset for our subsequent analyses.
For test data, we utilized the data from an independent experiment that detected piRNAs and their targets using 5′-RACE and CLIP-seq as positive test data [6] (Table 1). Given the data scarcity, we fixed these 99 positive samples and paired it with five different negative sets, generated using the methods described for mice negative training dataset. This approach allowed us to generate five varied test sets, each with the same positive interactions but different negative interactions. This strategy allows us to compute multiple AUCs across these test sets, which gives more comprehensive evaluation of the model’s performance.
Deep learning model construction
In this study, we used Pytorch (1.12.0 + cu113, python 3.9) [33] to construct a deep learning model including CNN layers and hierarchical attention layers (Fig. 2a). The detailed structure is presented in Fig. 2b.
Summary diagrams of the piRNA target prediction model. a The graph outlines the basic framework of our piRNA targeting prediction model, comprising five essential components. Commencing with the Embedding Layer, it transforms base sequences into numerical representations for subsequent processing. Then, the Feature Extraction Layer harnesses the power of 1D-CNN (One-Dimensional Convolutional Neural Network) to extract potential features from both piRNA and mRNA sequences, capturing their inherent characteristics for the next step. Moving forward, the Interaction Layer simulates the interaction between piRNA and mRNA, incorporating attention mechanisms to dynamically assess and emphasize the binding affinity for each base in the piRNA sequence. Following this, the Global Dominant Base-Pair Matching Extraction Layer leverages attention mechanisms again to identify globally significant base-pair matching patterns that are crucial for piRNA targeting. Finally, in the binary classification layer, the model generates piRNA targeting predictions. Notably, during the pre-training phase, the entire network undergoes concerted training. Conversely, in the fine-tuning stage, the initial three components of the model are strategically frozen, permitting a focused refinement of the Global Dominant Base-Pair Matching Extraction Layer and the classification layer. The pre-training and fine-tuning strategy leverage the shared patterns identified across C. elegans and mice, while subsequently refining the model to better suit the intricate specifics of the mammalian task. b Detailed structure of the model including values of parameters used
Embedding
The specificity of piRNA targeting within RNA sequences is primarily governed by regions of base pair complementarity, with a certain degree of mismatch being permissible (however, insertions and deletions are generally not allowed) [17, 23]. By focusing on these complementarity regions, we can capture the essence of piRNA-mRNA interactions while accommodating variations in mRNA lengths and minimizing the number of training parameters in our model. Therefore, the input data for our model consists of piRNAs and their corresponding mRNA target site sequences. To prepare these sequences for input into our deep learning model, we performed a one-hot encoding transformation on the nucleotide characters. This process involves converting each nucleotide in the sequences into a unique numeric vector, allowing the model to interpret the biological data in a format it can process. For a sequence \(S = \left[ {s_{1} , s_{2} , \ldots , s_{n - 1} , s_{n} } \right]\), we have its corresponding embedded matrix \(X\) as
where
This one-hot encoding ensures that each nucleotide is represented by a unique binary vector, facilitating the subsequent layers of the neural network to identify and learn patterns associated with different nucleotides. Following the one-hot encoding, we applied additional pre-processing steps to the embedded matrix \(X\). We standardized the length of all sequences by padding shorter ones with \(\left[ {0, 0,0,0} \right]^{T}\) to match the length of the longest piRNA sequence, essential for batch processing in deep learning models. A masking mechanism was introduced to differentiate between actual sequence data and padding tokens during training, ensuring the attention mechanism focuses only on valid sequence data.
CNN layer
CNN model is suitable for extracting features from piRNA and mRNA sequences, such as potential PIWI protein binding motif preferences, perfect matched seed regions, etc. The embedded piRNA sequence vector and mRNA sequence vector were individually processed through a 1D-CNN operation [34].
\(C_{n} = E \otimes K_{n}\), where \(C_{n} \left( j \right) = \mathop \sum \limits_{q = j - m}^{j + m} \mathop \sum \limits_{1}^{4} K_{n} \left( {q - j + m + 1, r} \right)*E\left( {q,r} \right)\).
E is piRNA/mRNA sequence embedding matrix; \(C_{n} \left( j \right)\) is the jth element of the output matrix \(C_{n}\); \(K_{n}\) is the n-th 1 by (2 m + 1) kernel filter with depth 4, iterating over four ribonucleotides (r) and kernel positions (q).
The 1D-CNN operation is further enhanced by a Squeeze-and-Excitation (SE) Block [35], which utilizes layer weighting to dynamically recalibrate and refine the extracted patterns based on global context.
where \(E_{len}\) is the length of the embedded sequence vector, \(s\left( n \right)\) is the output of 1D- adaptive average pooling over the 1D-CNN output, \(W_{s,1}\) and \(W_{s,2}\) are the trainable weight parameter matrices with a reduction factor of 4, \(\circ\) is the element-wise product between matrices.
Hierarchical attention layers
The base-level attention sub-network (Interaction Layer) was designed to recognize the interaction patterns between piRNA sequences and mRNA sequences. This structure locally determines the significance of binding affinity between each site in the piRNA sequence and every site in the mRNA sequence.
where \(H_{t}\) is the output of t-th head, for \(t = 1,...,k\). P and M represent the extracted motif feature matrices of piRNA and mRNA sequences from the previous step, respectively. \(W_{t}^{{Q_{1} }}\),\(W_{t}^{{K_{1} }}\),\(W_{t}^{{V_{1} }}\),\(W^{{H_{1} }}\) are the matrices of trainable weights.
Considering base pairing at different positions contributes unequally to the piRNA-mRNA target interaction, we implemented a sequence-level attention sub-network (Global Dominant Base-Pair Matching Extraction Layer) to globally extract the dominance base-pair matching positions across the entire sequence. \(O\) represents the output from the previous layer:
Each attention subnetwork is further followed by a Position-Wise Feed-Forward Network (FFN) Block:
where x is layer normalized \(H^{1}\) or \(H^{2}\).
Pre-training and fine-tuning
To address the data scarcity in mice piRNA-targeting training dataset, we propose a two-stage approach that leverages pre-training on C. elegans data and subsequent fine-tuning using mice data. This approach aims to harness the knowledge learned from C. elegans, whose data is more readily accessible and exhibits shared characteristics with mice, thereby bolstering the performance of our mammal-specific model. Specifically, we initially employed a tenfold cross-validation strategy to train the entire deep learning model using C. elegans data. The model parameters were optimized by minimizing the cross-entropy loss with Adam optimizer. The grid method was utilized to search best hyper-parameters across the split validation sets over 100 iterations. The final hyperparameters chosen for the devised model are as follows: learning rate: 0.001; learning scheduler: reduce-on-plateau, patience 5, factor 0.1, minimum learning rate 1e-6; Adam optimizer; batch size: 64; training epoch: 200 (without early stopping); label smoothing: 0.1; CNN output dimension: 64; attention heads: 4; dropout rates: 0.3 for dropout layers after the SE block, 0.1 for attention sub-networks; 0.5 for dropout layers in FFN blocks, 0.3 for dropout layers in the classification sub-network.
Subsequently, to prevent overfitting and to facilitate the model's adaptation to the new mice dataset while preserving the general features learned from the C. elegans data, we implemented a selective fine-tuning strategy. Specifically, we froze the parameters of the layers preceding the Global Dominant Base-Pair Matching Extraction Layer, which captures more generic sequence features, and fine-tuned only the remaining parameters that are more sensitive to specific dataset characteristics (Fig. 2). This approach ensures that the model retains the beneficial knowledge acquired during pre-training while adjusting to the nuances of the new data, thus balancing the risk of overfitting and enhancing the model's generalization capabilities. We adopted the same hyperparameters as the pre-training part, except for the number of epochs: 100 (without early stopping).
Investigation of the targets of piRNAs
To investigate the targets of millions of piRNAs, we aligned piRNAs [36] to the transcript sequences [37] with bwa (version 0.7.17-r1188) aln -n 6 -N -o 0 [38]. Then, aligned piRNA-mRNA pairs were predicted by our devised deep learning model. We utilized this methodology to compute the potential targets of all mouse and human piRNAs listed in piRNAclusterDB 2.0 [36], and uploaded the results to GitHub for public access.
In the investigation of gastric cancer cells, the small RNA-seq, HIWI-RIP-Seq were retrieved from PRJNA633677. Trim_galore (version 0.6.10) was utilized for quality control and adapter trimming. The refined RIP-seq reads were aligned to the genome with HISAT2 (version 2.2.1) [39]. The resulting fragment distributions were visualized using Drukbam (https://github.com/StephanHolgerD/DrukBam). Gene enrichment analyses were performed utilizing clusterProfiler (version 4.6.2) [40].
Results
The performance of the devised deep learning model
To assess the performance of our deep learning model, we compared it with several established methods using test data. In pirScan, the process requires perfect complementarity in the seed region (2-7nt) with zero non-GU and ≤ 2 GU mismatches, and high complementarity in the non-seed region (8-21nt) with ≤ 2 non-GU and ≤ 3 GU mismatches, allowing no more than 6 mismatches overall [41]. The Zhang’s rule requires a minimal 16nt continuous complementarity between piRNAs and mRNAs, and demands the base pairing at the 5′ end of piRNAs with mismatches of less than 3 in the subsequent 19-nucleotide sequence [22]. The RNA-binding energy were calculated by RNAcofold (2.6.4) [42]. As shown in Fig. 3, the fine-tuned devised model (AUC mean = 0.957) outperforms pre-trained devised model (AUC mean = 0.836), piRScan (AUC mean = 0.716), Zhang (AUC mean = 0.692) and RNA-binding energy (AUC mean = 0.734).
Comparison of the devised model with existing methods. The fine-tuned model presented here was compared to the pre-trained and other existing models to benchmark its performance. Bar graph illustrating the mean Area Under the Curve (AUC) values for the model's performance across five test sets. Each bar represents the average AUC, with the error bars indicating the standard deviation (SD), which quantifies the variability of the AUC scores and highlights the consistency of our model's predictions
In the proposed network architecture, the attention mechanism is tailored to mimic the site-by-site base-pair interaction between piRNA and mRNA sequences, while simultaneously capturing global base-pair binding patterns. Given that prior research has highlighted the preference for longer stretches of continuous complementary pairings in the piRNA targeting process, and that extensive matches towards the 3′ end can bolster the resilience to mismatches occurring at the 5′ end, we seek to determine whether our model can accurately detect such patterns [22, 24]. We present two examples in Fig. 4, piRNA Mmus_2680332 and its cognate targeting site on the mRNA Pdss2, along with piRNA Mmus_2441329 and its cognate targeting site on the mRNA RMER17A2. In base-level attention matrix, significant weight values were allocated to the base-pairing occurrences between piRNA and mRNA, while the sequence level attention matrix offered an adjustment of prediction based on the overall base-pair binding situation. Specifically, in the instance of piRNA Mmus_2680332 and its cognate targeting site on mRNA Pdss2, the initial region displays a remarkable degree of complementarity, while the subsequent sequence segments exhibit fewer matches. The sequence-level attention mechanism to assign greater significance to the initial region, thus facilitating enhanced refinement and adjustment during the predictive analysis. Similarly, in the instance of piRNA Mmus_2441329 and mRNA RMER17A2, despite the lack of prominent long-distance matches in the initial sections, the subsequent regions exhibit a strong degree of complementarity, thereby enhancing the tolerance for mismatches in the 5′ end. The sequence-level attention mechanism redirects greater weight to these latter regions, facilitating focused refinement and adjustment in the overall predictive analysis.
Visualization of the attention weight matrix. In these matrices, darker positions signify higher attention scores, indicating increased focus by the model. Left panel presents the base-level attention matrix generated by base-level attention sub-network (x-axis: piRNA Bases; y-axis: mRNA Bases). Notably, the model demonstrates an enhanced attention response when a complementary base pairing occurs between a piRNA and an mRNA at corresponding positions, resulting in a notable increase in the attention score. Right panel presents sequence-level attention matrix, offering an adjustment of prediction based on the overall binding situation (axis: piRNA-mRNA complex sites). This view integrates information across the entire sequence to ensure a balanced allocation of attention, optimizing the overall matching result within the broader context. a Mmus_2680332 and its targeting mRNA Pdss2. b Mmus_2441329 and its targeting mRNA RMER17A2
piRNAs from gastric cancer cells in human
The human PIWI-protein, HIWI, has been confirmed to exhibit significant overexpression in gastric cancer cell line SNU-1. This overexpression has been implicated in promoting critical cellular processes in gastric cancer, including cell proliferation, migration, tumorigenesis and metastasis [43]. We postulate that during cancer progression, the PIWI protein interacts with piRNA, thereby executing a crucial role in the targeted cleavage of specific mRNAs by the PIWI-piRNA complex. To investigate this hypothesis, we conducted the analysis of expression profiles of small RNA-seq, RNA-seq and HIWI-RIP-Seq in both wide-type SNU-1 cells and HIWI knockout SNU-1 cells. Utilizing our deep learning model, we predicted 426 mRNA targets of the 1873 piRNAs identified through small RNA-seq analysis. Many of these mRNA are from histone charperones genes, and gene enrichment analysis revealed that these predicted mRNA targets are enriched in nucleosome assembly-related pathways (Fig. 5a), a finding that aligns with the growing body of research highlighting the significance of nucleosome assembly in cancer biology [44]. The piRNA-guided cleavage-related degradation pattern was observed in numerous targeted mRNAs (Fig. 5b).
The analysis of the piRNA targets predicted by our model in gastric cell (SNU-1). a The biological process gene ontology enrichment analysis of mRNA targeted by piRNAs in gastric cells. This figure illustrates the enrichment patterns of genes whose mRNAs are specifically targeted by piRNAs in gastric cells. GeneRatio is the ratio between genes of interest in the gene set and total genes of interest. Dot color represents the adjusted p-value (Benjamini–Hochberg method) of enrichment analysis. b The Hsap-543472 guided cleavage-related degradation pattern of H2BC15 mRNA. This figure contrasts the distribution of sequencing reads from H2BC15 mRNA in the context of HIWI-RIP-Seq experiments performed on SNU-1 cancer cells, comparing conditions with (Left Panel) and without (Right Panel) HIWI protein expression. The left panel shows wild-type SNU-1 cells, where HIWI overexpression enhances piRNA-guided cleavage of H2BC15 mRNA, leading to significant degradation. In contrast, the right panel displays HIWI knockout SNU-1 cells, where the absence of HIWI results in reduced degradation of H2BC15 mRNA, highlighting the role of the HIWI-piRNA complex in regulating mRNA stability
Discussion
In this study, we carefully selected and integrated appropriate deep learning sub-networks based on the existing knowledge of piRNA targeting rules. This approach enabled us to develop a model that effectively captures the intricate nuances of piRNA-mediated interactions, subsequently leading to accurate predictions of piRNA targets. Another strength of our approach lies in the utilization of transfer learning. This approach allowed us to leverage the extensive understanding of piRNA biology in C. elegans, where data is more abundant, and apply that knowledge to the prediction of piRNA targets in mammals. The conservation of certain piRNA targeting mechanisms across species facilitated this transfer, enabling us to overcome the challenge of limited mammalian data. In practical application, our model complements initial predictions of piRNA alignment results made with alignment algorithms. Researchers can then input the derived piRNA-mRNA pairs into our model to assess the likelihood of piRNA targeting specific mRNAs. This approach may reveal the roles of differentially expressed piRNAs in disease pathogenesis and biological processes. By pinpointing piRNA targets within disease-related pathways, our model assists in identifying potential therapeutic targets and developing biomarkers, which are crucial for early disease detection and prognosis.
Previous research suggested the link between HIWI overexpression and gastric cancer development, yet the precise role of piRNAs in this context remains a mystery [43]. By integrating our model's predictive outcomes with rigorous validation using techniques like HIWI-RIP-Seq, we gained a new understanding of the piRNA landscape within gastric cancer cells. Our results revealed that in these cancerous cells, overexpressed HIWI, under the guidance of piRNAs, participates in the post-transcriptional cleavage of mRNAs, especially those histone chaperones associated with nucleosome function. The dysregulation of these chaperones can lead to changes in chromatin structure, thereby influencing gene expression patterns that are often aberrant in cancer cells. Our findings, coupled with the existing literature, suggest that the piRNAs we have identified may be integral to the regulatory mechanisms governing nucleosome assembly in gastric cancer, potentially impacting the progression and prognosis of the disease [45]. This enrichment of nucleosome assembly pathways in our gene set analysis provides a compelling rationale for further exploration of intricate interplay between piRNA and mRNAs in gastric cancer, particularly in the context of nucleosome dynamics and the influence of histone chaperone.
However, it is important to acknowledge that our study has limitations. The scarcity of high-quality, comprehensive mammalian datasets for piRNA-mRNA interactions remains a challenge, which may impact the performance of our model. Despite these limitations, our work represents an important step forward in the understanding of piRNA biology and its potential applications in disease research by providing a novel tool to this field. By continuing to refine our model and explore its applications in other contexts, we hope to uncover new insights into the complex networks of piRNAs and their roles in health and disease.
Conclusions
In summary, we developed a model for predicting piRNA targets. This model utilizes deep learning sub-networks, which are specifically designed for the intricate targeting patterns of piRNA that have been elucidated through extensive prior experimentation. Additionally, we employed a pre-training and fine-tuning transfer learning approach to mitigate the insufficient mammalian training dataset. The application of this model to explore the potential role of piRNA in gastric cancer cells, revealing their possible involvement in nucleosome-related processes.
Availability of data and materials
The original data utilized in this study can be sourced from the supplementary files from the research projects detailed in Table 1. The positive and negative datasets, which were either sorted from the original data or generated through our simulation as detailed in the Methods section, as well as the code for our deep learning model and the corresponding results, are all available for download at the following URL: https://github.com/SofiaTianjiaoZhang/piRNATarget.
Abbreviations
- piRNA:
-
Piwi-interacting RNA
- miRNA:
-
MicroRNA
- C. elegans :
-
Caenorhabditis elegans
- MIWI:
-
Murine homolog of PIWI
- 5′RACE-seq:
-
5′ Rapid amplification of cDNA ends sequencing
- CLIP-seq:
-
Cross-linking immunoprecipitation sequencing
- CLASH:
-
Cross-linking, ligation and sequencing of hybrids
- CNN:
-
Convolutional neural network
- SE block:
-
Squeeze-and-excitation block
- FFN block:
-
Position-wise feed-forward network block
- AUC:
-
Area under the curve
- HIWI:
-
Human homolog of PIWI
- RIP-seq:
-
RNA immunoprecipitation sequencing
References
Wang X, Ramat A, Simonelig M, Liu M-F. Emerging roles and functional mechanisms of PIWI-interacting RNAs. Nat Rev Mol Cell Biol. 2023;24(2):123–41.
Thomson T, Lin H. The biogenesis and function of PIWI proteins and piRNAs: progress and prospect. Annu Rev Cell Dev. 2009;25:355–76.
Siomi MC, Sato K, Pezic D, Aravin AA. PIWI-interacting small RNAs: the vanguard of genome defence. Nat Rev Mol Cell Biol. 2011;12(4):246–58.
Parhad SS, Tu S, Weng Z, Theurkauf WE. Adaptive evolution leads to cross-species incompatibility in the piRNA transposon silencing machinery. Dev Cell. 2017;43(1):60–70.
Peng JC, Lin H. Beyond transposons: the epigenetic and somatic functions of the Piwi-piRNA mechanism. Curr Opin Cell Biol. 2013;25(2):190–4.
Goh WSS, Falciatori I, Tam OH, Burgess R, Meikar O, Kotaja N, Hammell M, Hannon GJ. piRNA-directed cleavage of meiotic transcripts regulates spermatogenesis. Genes Dev. 2015;29(10):1032–44.
Kim KW, Tang NH, Andrusiak MG, Wu Z, Chisholm AD, Jin Y. A neuronal piRNA pathway inhibits axon regeneration in C. elegans. Neuron. 2018;97(3):511–9.
Barckmann B, Pierson S, Dufourt J, Papin C, Armenise C, Port F, Grentzinger T, Chambeyron S, Baronian G, Desvignes J-P. Aubergine iCLIP reveals piRNA-dependent decay of mRNAs involved in germ cell development in the early embryo. Cell Rep. 2015;12(7):1205–16.
Du W, Yang W, Xuan J, Gupta S, Krylov S, Ma X, Yang Q, Yang B. Reciprocal regulation of miRNAs and piRNAs in embryonic development. Cell Death Differ. 2016;23(9):1458–70.
Rajasethupathy P, Antonov I, Sheridan R, Frey S, Sander C, Tuschl T, Kandel ER. A role for neuronal piRNAs in the epigenetic control of memory-related synaptic plasticity. Cell. 2012;149(3):693–707.
Leighton LJ, Wei W, Marshall PR, Ratnu VS, Li X, Zajaczkowski EL, Spadaro PA, Khandelwal N, Kumar A, Bredy TW. Disrupting the hippocampal Piwi pathway enhances contextual fear memory in mice. Neurobiol Learn Mem. 2019;161:202–9.
Lee D, Yang H, Kim J, Brady S, Zdraljevic S, Zamanian M, Kim H. Paik Y-k, Kruglyak L, Andersen EC: the genetic basis of natural variation in a phoretic behavior. Nat Commun. 2017;8(1):273.
Zhang T, Wong G. Dysregulation of human somatic piRNA expression in Parkinson’s disease subtypes and stages. Int J Mol Sci. 2022;23(5):2469.
Li M, Yang Y, Wang Z, Zong T, Fu X, Aung LHH, Wang K. Wang J-x, Yu T: Piwi-interacting RNAs (piRNAs) as potential biomarkers and therapeutic targets for cardiovascular diseases. Angiogenesis. 2021;24:19–34.
Zhang D, Tu S, Stubna M, Wu W-S, Huang W-C, Weng Z, Lee H-C. The piRNA targeting rules and the resistance to piRNA silencing in endogenous genes. Science. 2018;359(6375):587–92.
Vourekas A, Zheng Q, Alexiou P, Maragkakis M, Kirino Y, Gregory BD, Mourelatos Z. Mili and Miwi target RNA repertoire reveals piRNA biogenesis and function of Miwi in spermiogenesis. Nat Struct Mol Biol. 2012;19(8):773–81.
Anzelon TA, Chowdhury S, Hughes SM, Xiao Y, Lander GC, MacRae IJ. Structural basis for piRNA targeting. Nature. 2021;597(7875):285–9.
Bartel DP. Metazoan MicroRNAs. Cell. 2018;173(1):20–51.
Kim D, Sung YM, Park J, Kim S, Kim J, Park J, Ha H, Bae JY, Kim S, Baek D. General rules for functional microRNA targeting. Nat Genet. 2016;48(12):1517–26.
McEnany J, Meir Y, Wingreen NS. piRNAs of Caenorhabditis elegans broadly silence nonself sequences through functionally random targeting. Nucleic Acids Res. 2022;50(3):1416–29.
Reuter M, Berninger P, Chuma S, Shah H, Hosokawa M, Funaya C, Antony C, Sachidanandam R, Pillai RS. Miwi catalysis is required for piRNA amplification-independent LINE1 transposon silencing. Nature. 2011;480(7376):264–7.
Zhang P, Kang J-Y, Gou L-T, Wang J, Xue Y, Skogerboe G, Dai P, Huang D-W, Chen R, Fu X-D. MIWI and piRNA-mediated cleavage of messenger RNAs in mouse testes. Cell Res. 2015;25(2):193–207.
Gainetdinov I, Vega-Badillo J, Cecchini K, Bagci A, Colpan C, De D, Bailey S, Arif A, Wu P-H, MacRae IJ, et al. Relaxed targeting rules help PIWI proteins silence transposons. Nature. 2023;619(7969):394–402.
Li Z, Li Z, Zhang Y, Zhou L, Xu Q, Li L, Zeng L, Xue J, Niu H, Zhong J et al. Mammalian PIWI-piRNA-target complexes reveal features for broad and efficient target silencing. Nat Struct Mol Biol 2024.
Yuan J, Zhang P, Cui Y, Wang J, Skogerbø G, Huang DW, Chen R, He S. Computational identification of piRNA targets on mouse mRNAs. Bioinformatics. 2016;32(8):1170–7.
Zhang T, Chen L, Li R, Liu N, Huang X, Wong G. PIWI-interacting RNAs in human diseases: databases and computational models. Brief Bioinform. 2022;23(4):bbac217.
Kudla G, Granneman S, Hahn D, Beggs JD, Tollervey D. Cross-linking, ligation, and sequencing of hybrids reveals RNA–RNA interactions in yeast. Proc Natl Acad Sci. 2011;108(24):10010–5.
Ule J, Jensen KB, Ruggiu M, Mele A, Ule A, Darnell RB. CLIP identifies Nova-regulated RNA networks in the brain. Science. 2003;302(5648):1212–5.
Yang T-H, Shiue S-C, Chen K-Y, Tseng Y-Y, Wu W-S. Identifying piRNA targets on mRNAs in C. elegans using a deep multi-head attention network. BMC Bioinform. 2021;22:1–23.
Shen E-Z, Chen H, Ozturk AR, Tu S, Shirayama M, Tang W, Ding Y-H, Dai S-Y, Weng Z, Mello CC. Identification of piRNA binding sites reveals the argonaute regulatory landscape of the C. elegans germline. Cell. 2018;172(5):937–51.
De Fazio S, Bartonicek N, Di Giacomo M, Abreu-Goodger C, Sankar A, Funaya C, Antony C, Moreira PN, Enright AJ, O’Carroll D. The endonuclease activity of Mili fuels piRNA amplification that silences LINE1 elements. Nature. 2011;480(7376):259–63.
Manakov SA, Pezic D, Marinov GK, Pastor WA, Sachidanandam R, Aravin AA. MIWI2 and MILI have differential effects on piRNA biogenesis and DNA methylation. Cell Rep. 2015;12(8):1234–43.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32.
Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process. 2021;151: 107398.
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018;7132–7141.
Rosenkranz D, Zischler H, Gebert D. piRNAclusterDB 2.0: update and expansion of the piRNA cluster database. Nucleic Acids Res. 2021;50(D1):D259–64.
Martin FJ, Amode MR, Aneja A, Austine-Orimoloye O, Azov Andrey G, Barnes I, Becker A, Bennett R, Berry A, Bhai J, et al. Ensembl 2023. Nucleic Acids Res. 2022;51(D1):D933–41.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997 2013.
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (Camb). 2021;2(3):100141.
Wu WS, Huang WC, Brown JS, Zhang D, Song X, Chen H, Tu S, Weng Z, Lee HC. pirScan: a webserver to predict piRNA targeting sites and to avoid transgene silencing in C. elegans. Nucleic Acids Res. 2018;46(W1):W43-w48.
Bernhart SH, Tafer H, Mückstein U, Flamm C, Stadler PF, Hofacker IL. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol. 2006;1(1):3.
Shi S, Yang Z-Z, Liu S, Yang F, Lin H. PIWIL1 promotes gastric cancer via a piRNA-independent mechanism. Proc Natl Acad Sci. 2020;117(36):22390–401.
Burgess RJ, Zhang Z. Histone chaperones in nucleosome assembly and human disease. Nat Struct Mol Biol. 2013;20(1):14–22.
Zhao Z, Cai Z, Jiang T, Han J, Zhang B. Histone chaperones and digestive cancer: a review of the literature. Cancers. 2022;14(22):5584.
Acknowledgements
We express our gratitude to researchers who have shared their data online. This project benefits from the computing resources from the Big Data Center for Biomedical Research in Wuyi University.
Funding
This study was supported by the grant 508170020342 from Wuyi University. This research was supported in part by the Faculty of Health Sciences, University of Macau.
Author information
Authors and Affiliations
Contributions
TJZ, LC and GW conceived the research topic. TJZ and HBZ designed and implemented the method. TJZ and GW wrote the manuscript. TJZ and LC performed the evaluation. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
No applicable.
Consent for publication
The authors declare that they consent to publication.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, T., Chen, L., Zhu, H. et al. Mammalian piRNA target prediction using a hierarchical attention model. BMC Bioinformatics 26, 50 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06068-6
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06068-6