Mammalian piRNA target prediction using a hierarchical attention model

Zhang, Tianjiao; Chen, Liang; Zhu, Haibin; Wong, Garry

doi:10.1186/s12859-025-06068-6

Research
Open access
Published: 11 February 2025

Mammalian piRNA target prediction using a hierarchical attention model

Tianjiao Zhang¹^na1,
Liang Chen²^na1,
Haibin Zhu³ &
…
Garry Wong⁴

BMC Bioinformatics volume 26, Article number: 50 (2025) Cite this article

916 Accesses
Metrics details

Abstract

Background

Piwi-interacting RNAs (piRNAs) are well established for monitoring and protecting the genome from transposons in germline cells. Recently, numerous studies provided evidence that piRNAs also play important roles in regulating mRNA transcript levels. Despite their significant role in regulating cellular RNA levels, the piRNA targeting rules are not well defined, especially in mammals, which poses obstacles to the elucidation of piRNA function.

Results

Given the complexity and current limitation in understanding the mammalian piRNA targeting rules, we designed a deep learning model by selecting appropriate deep learning sub-networks based on the targeting patterns of piRNA inferred from previous experiments. Additionally, to alleviate the problem of insufficient data, a transfer learning approach was employed. Our model achieves a good discriminatory power (Accuracy: 98.5%) in predicting an independent test dataset. Finally, this model was utilized to predict the targets of all mouse and human piRNAs available in the piRNA database.

Conclusions

In this research, we developed a deep learning framework that significantly advances the prediction of piRNA targets, overcoming the limitations posed by insufficient data and current incomplete targeting rules. The piRNA target prediction network and results can be downloaded from https://github.com/SofiaTianjiaoZhang/piRNATarget.

Peer Review reports

Background

Piwi-interacting RNAs (piRNAs) are 21–35 nucleotide small noncoding RNAs present in animals [1, 2]. piRNAs were initially found in germline cells as important post-transcriptional regulators in transposon silencing [3, 4]. Subsequently, piRNAs were found to have essential function in the regulation of mRNAs [5, 6] which have been highlighted by their roles in axon regeneration [7], embryonic development [8, 9], memory [10, 11], foraging behavior [12] and human diseases [13, 14].

Previous work has suggested that piRNAs guide PIWI-clade argonaute proteins to recognize and regulate mRNA through complementary base pairing [15,16,17]. However, unlike microRNAs (miRNAs), which direct AGO-clade argonaute proteins to repress the target RNA via 7- nucleotide seed (2-8nt) complementarity recognition [18, 19], the targeting rules of piRNAs seem to be distinct and more intricate. In Caenorhabditis elegans (C. elegans), complementarity in both the seed region (2-7nt) and non-seed region (8-21nt) appears to be essential. Specifically, the targeting rules dicate zero non-GU mismatches and up to two GU pairs in the non-seed region, up to three non-GU mismatches in non-seed, and up to six mismatches in seed and non-seed regions combined [15]. This rule is widely used in nematodes but does not appear to be suitable in evolutionarily divergent species. For instance, the piRNA targeting rules in Drosophila melanogaster demonstrate a lower tolerance for mismatches compared to the rule observed in C. elegans [20]. In mammals, the targeting rule of piRNAs remains more ambiguous. Recently, several mammalian piRNA targeting-related studies were conducted in mice to address the targeting rules. A slicer assay of MIWI (Murine homolog of PIWI)-piRNA complexes indicated that extensive complementarity between nucleotides 2-22nt is required for targeting, while several mismatches at 3′-terminal nucleotides were tolerated [21]. Meanwhile, a study leveraging 5′RACE-seq and CLIP-seq methodologies concluded that MIWI-piRNA complexes targeting mRNAs prefer strict base pairing at the 5′ end of piRNAs, accompanied by a minimal 16nt stretch of continuous complementarity between piRNAs and mRNAs, with less than three mismatches in the subsequent 20nt sequences [22]. A more recent study similarly suggests that MIWI-piRNA complexes favor a longer extent of continuous complementarity, and that extending pairing allows MIWI proteins to tolerate guide-target mismatches. This study also suggests that complementary pairing of the 5′ can accelerate the targeting efficiency, yet is not essential [23].

These diverse conclusions are likely caused by different experimental and inductive methods yet imply the significant complexity in the targeting rules of mammalian piRNAs compared to those of C. elegans. However, based upon these studies, it appears that mammalian piRNA targeting rules have two main features: (1) targeting relies on complementary base pairings, especially the number of consecutive matches at the 5′ end; (2) complementary base pairings at different positions have varying effects on the results, for example, continuous pairing at the 3′ end can increase the tolerance for mismatches at the 5′ end to a certain extent. Additionally, the targeting rules in C. elegans and mice exhibit some degree of overlap, as both organisms require a minimum level of base-pairing between piRNA and mRNA sequences as a fundamental prerequisite for effective targeting. This establishes the foundation for exploring the application of transfer learning from C. elegans to mammals.

Recent structural analyses of the PIWI-piRNA complex have revealed striking similarities in the piRNA targeting rules among mammals, yet notable divergences from nematodes [17, 24]. While piRNA targeting prediction models have been proposed, there is no agreement on their general applicability. One computational model, piRNAPre, employs a support vector machine classifier that relies on CLIP-seq-derived and position-based features. While this model exhibits satisfactory performance in predicting piRNAs and their corresponding targets in mouse testes, the generalizability to other tissues and species is limited by the availability of CLIP-seq-derived features [25]. Therefore, a model that predicts target sites solely based on base sequences is imperative [26]. With the mammalian piRNA targeting patterns inferred from previous experiments, we propose to improve mammalian piRNA targeting prediction by training a devised deep learning model that integrates Convolutional Neural Network (CNN) with hierarchical attention layers. Specifically, CNN is suitable for extracting features from piRNA and mRNA sequences, such as potential preferences for PIWI protein binding motifs, perfectly matched seed regions, and other relevant characteristics. The first attention sub-network (Interaction Layer) focus on base-level analysis, which plays a pivotal role in recognizing the complementary between piRNA and mRNA bases. A second attention sub-network (Global Dominant Base-Pair Matching Extraction Layer) globally identifies dominant base-pair matching positions across the entire sequence, which complements the base-level analysis in first attention sub-network by offering a broader perspective on the sequence-level interactions. Furthermore, given the scarcity of mammal-specific datasets compared to the abundance of C. elegans data, and the observed similarities in piRNA targeting rules between mammals and C. elegans, we have devised a transfer learning strategy. This approach involves pre-training our model on C. elegans data, followed by fine-tuning with mice data. By doing so, we maximize the utilization of existing resources while ensuring the relevance and specificity of our model for mammalian piRNAs.

Methods

Dataset collection and preprocessing

Certain advanced sequencing methodologies facilitate the efficient identification of potential piRNA targeting sites in a high-throughput manner. Notably, CLASH-seq [27] stands out by providing invaluable insights into the intricate piRNA-PIWI-mRNA interaction dynamics. On the other hand, CLIP-seq [28] offers the information of the complex interplay between piRNA-PIWI and mRNA-PIWI complexes, shedding light on their functional associations. Furthermore, 5′-RACE sequencing excels in precisely capturing potential piRNA-mediated cleavage sites, thereby reinforcing the evidence for the direct interaction between piRNA and mRNA.

Considering these characteristics, we mainly collected relevant data focused on these experiments that could contain such piRNA-mRNA sequence pairs and used them to construct training dataset. For example: CLASH, CLIP-Seq, RIP-Seq, etc. Search using the keyword in GEO: (HIWI OR MIWI OR PIWI OR piRNA) AND (CLASH OR CLIP OR RIP). Finally, the datasets as shown in Table 1 were collected. The positive and negative C. elegans data was obtained from the CLASH dataset, which provides sequences of piRNA and its targeting mRNA [29, 30]. The positive mice data were also retrieved from sequences of piRNA and its target mRNA, where these interactions were detected using 5′-RACE sequencing and validated by CLIP-seq [6, 22].

Table 1 Data used for training and testing the devised deep learning model

Full size table

Due to the lack of negative data in the mice training dataset, we employed the following method to simulate an authentic negative dataset. Specifically, we first analyzed the mismatch frequency distributions of positive and negative piRNA-mRNA pairs in C. elegans, calculating the mean and variance differences between them (Fig. 1a). Based on this analysis, we inferred the mismatch frequency characteristics of potential negative piRNA-mRNA interactions in mice (Fig. 1b).

$$\sigma_{est\_m\_neg} = \sigma_{m\_pos} \times \frac{{\sigma_{c\_neg} }}{{\sigma_{c\_pos} }}$$

The mean of the negative dataset in mice ($\mu_{est\_m\_neg}$)is estimated using the mean and the standard deviation of the positive dataset, adjusted by the ratio of the maximum length of mice piRNA (35nt) to C. elegan piRNA length (21nt):

$$\mu_{est\_m\_neg} = \frac{{\Delta \mu_{c} }}{{\sigma_{c\_pos} }} \times \sigma_{m\_pos} \times \frac{35}{{21}} + \mu_{m\_pos}$$

Based on $\sigma_{est\_m\_neg}$ and $\mu_{est\_m\_neg}$, we estimate the mismatch frequency distribution in negative mice piRNA-mRNA pairs.

Next, we collected mRNAs that exhibited unaffected expression levels in the testes of MIWI null mutant mice [21]. To do this, we retrieved the RNA-seq data and applied the differential analysis methodology described in the original publication [21]. The rationale for this selection is that, in these mutants, MIWI-piRNA complexes are dysfunctional. If an mRNA were regulated by piRNAs, its expression level would be expected to change in the absence of functional MIWI-piRNA complexes. Conversely, mRNAs whose expression levels remain stable in the MIWI null mutants are highly unlikely to be targeted by piRNAs. We note that another mouse PIWI protein MILI, while its role in post-transcriptional modification of coding genes is not fully elucidated, is known to have slicer activity for silencing the retrotransposon LINE1 in spermatocytes through piRNA guidance [31]. To mitigate any potential impact from MILI, we analyzed both wide-type and MILI mutant RNA-seq data to identify mRNAs that did not exhibit differential expression in MILI mutants [32]. By interacting this set with the unchanged mRNAs from MIWI mutants, we have been able to select mRNAs that remained unaffected in both mutants. These mRNAs are considered less likely to be targeted by piRNAs, as they do not show differential expression even when the PIWI proteins known to interact with piRNAs are nonfunctional therefore were considered ideal candidates for our negative dataset. Subsequently, we aligned piRNAs to these mRNAs, allowing a certain degree of mismatch, to generate a potential negative training dataset. Finally, we stratified the sampling of the piRNA-mRNA site pairs within this potential negative training dataset according to previously estimated negative piRNA-mRNA mismatch frequency (Fig. 1b), resulting in a reliable simulated negative dataset for our subsequent analyses.

For test data, we utilized the data from an independent experiment that detected piRNAs and their targets using 5′-RACE and CLIP-seq as positive test data [6] (Table 1). Given the data scarcity, we fixed these 99 positive samples and paired it with five different negative sets, generated using the methods described for mice negative training dataset. This approach allowed us to generate five varied test sets, each with the same positive interactions but different negative interactions. This strategy allows us to compute multiple AUCs across these test sets, which gives more comprehensive evaluation of the model’s performance.

Deep learning model construction

In this study, we used Pytorch (1.12.0 + cu113, python 3.9) [33] to construct a deep learning model including CNN layers and hierarchical attention layers (Fig. 2a). The detailed structure is presented in Fig. 2b.

Embedding

The specificity of piRNA targeting within RNA sequences is primarily governed by regions of base pair complementarity, with a certain degree of mismatch being permissible (however, insertions and deletions are generally not allowed) [17, 23]. By focusing on these complementarity regions, we can capture the essence of piRNA-mRNA interactions while accommodating variations in mRNA lengths and minimizing the number of training parameters in our model. Therefore, the input data for our model consists of piRNAs and their corresponding mRNA target site sequences. To prepare these sequences for input into our deep learning model, we performed a one-hot encoding transformation on the nucleotide characters. This process involves converting each nucleotide in the sequences into a unique numeric vector, allowing the model to interpret the biological data in a format it can process. For a sequence $S = \left[ {s_{1} , s_{2} , \ldots , s_{n - 1} , s_{n} } \right]$, we have its corresponding embedded matrix $X$ as

$$X = \left[ {x_{1} , x_{2} , \ldots , x_{n - 1} , x_{n} } \right],$$

where

$$x_{i} = \left\{ {\begin{array}{*{20}l} {\left[ {1, 0,0,0} \right]^{T} ,} \hfill & {s_{i} = \text{"} A\text{"} ;} \hfill \\ {\left[ {0, 1,0,0} \right]^{T} ,} \hfill & {s_{i} = \text{"} T\text{"} ;} \hfill \\ {\left[ {0, 0,1,0} \right]^{T} ,} \hfill & {s_{i} = \text{"} C\text{"} ;} \hfill \\ {\left[ {0, 0,0,1} \right]^{T} ,} \hfill & {s_{i} = \text{"} G\text{"} .} \hfill \\ \end{array} } \right.$$

This one-hot encoding ensures that each nucleotide is represented by a unique binary vector, facilitating the subsequent layers of the neural network to identify and learn patterns associated with different nucleotides. Following the one-hot encoding, we applied additional pre-processing steps to the embedded matrix $X$. We standardized the length of all sequences by padding shorter ones with $\left[ {0, 0,0,0} \right]^{T}$ to match the length of the longest piRNA sequence, essential for batch processing in deep learning models. A masking mechanism was introduced to differentiate between actual sequence data and padding tokens during training, ensuring the attention mechanism focuses only on valid sequence data.

CNN layer

CNN model is suitable for extracting features from piRNA and mRNA sequences, such as potential PIWI protein binding motif preferences, perfect matched seed regions, etc. The embedded piRNA sequence vector and mRNA sequence vector were individually processed through a 1D-CNN operation [34].

$C_{n} = E \otimes K_{n}$, where $C_{n} \left( j \right) = \mathop \sum \limits_{q = j - m}^{j + m} \mathop \sum \limits_{1}^{4} K_{n} \left( {q - j + m + 1, r} \right)*E\left( {q,r} \right)$.

E is piRNA/mRNA sequence embedding matrix; $C_{n} \left( j \right)$ is the jth element of the output matrix $C_{n}$; $K_{n}$ is the n-th 1 by (2 m + 1) kernel filter with depth 4, iterating over four ribonucleotides (r) and kernel positions (q).

The 1D-CNN operation is further enhanced by a Squeeze-and-Excitation (SE) Block [35], which utilizes layer weighting to dynamically recalibrate and refine the extracted patterns based on global context.

$$s\left( n \right) = \frac{1}{{E_{len} }}\mathop \sum \limits_{q = 1}^{{E_{len} }} C_{n}^{ } \left( q \right),$$

$$F = sigmoid \; (ReLU \left( {\left( {sW_{S,1}^{T} } \right)W_{S,2}^{T} } \right) \circ C,$$

where $E_{len}$ is the length of the embedded sequence vector, $s\left( n \right)$ is the output of 1D- adaptive average pooling over the 1D-CNN output, $W_{s,1}$ and $W_{s,2}$ are the trainable weight parameter matrices with a reduction factor of 4, $\circ$ is the element-wise product between matrices.

Hierarchical attention layers

The base-level attention sub-network (Interaction Layer) was designed to recognize the interaction patterns between piRNA sequences and mRNA sequences. This structure locally determines the significance of binding affinity between each site in the piRNA sequence and every site in the mRNA sequence.

$$H_{t}^{1} = softmax\left( {\frac{{\left( {MW_{t}^{{Q_{1} }} } \right)\left( {PW_{t}^{{K_{1} }} } \right)^{T} }}{\sqrt d }} \right)\left( {MW_{t}^{{V_{1} }} } \right),$$

$$H^{1} = concat\left( {H_{1}^{1} ,H_{2}^{1} ,...,H_{k}^{1} } \right)(W^{{H_{1} }} )^{T} ,$$

where $H_{t}$ is the output of t-th head, for $t = 1,...,k$. P and M represent the extracted motif feature matrices of piRNA and mRNA sequences from the previous step, respectively. $W_{t}^{{Q_{1} }}$,$W_{t}^{{K_{1} }}$,$W_{t}^{{V_{1} }}$,$W^{{H_{1} }}$ are the matrices of trainable weights.

Considering base pairing at different positions contributes unequally to the piRNA-mRNA target interaction, we implemented a sequence-level attention sub-network (Global Dominant Base-Pair Matching Extraction Layer) to globally extract the dominance base-pair matching positions across the entire sequence. $O$ represents the output from the previous layer:

$$H_{t}^{2} = softmax\left( {\frac{{\left( {OW_{t}^{{Q_{2} }} } \right)\left( {OW_{t}^{{K_{2} }} } \right)^{T} }}{\sqrt d }} \right)\left( {OW_{t}^{{V_{2} }} } \right),$$

$$H^{2} = concat\left( {H_{1}^{2} ,H_{2}^{2} ,...,H_{k}^{2} } \right)(W^{{H_{2} }} )^{T} .$$

Each attention subnetwork is further followed by a Position-Wise Feed-Forward Network (FFN) Block:

$$FFN\left( x \right) = ReLu\left( {xW_{1} + b_{1} } \right)W_{2} ,$$

where x is layer normalized $H^{1}$ or $H^{2}$.

Pre-training and fine-tuning

To address the data scarcity in mice piRNA-targeting training dataset, we propose a two-stage approach that leverages pre-training on C. elegans data and subsequent fine-tuning using mice data. This approach aims to harness the knowledge learned from C. elegans, whose data is more readily accessible and exhibits shared characteristics with mice, thereby bolstering the performance of our mammal-specific model. Specifically, we initially employed a tenfold cross-validation strategy to train the entire deep learning model using C. elegans data. The model parameters were optimized by minimizing the cross-entropy loss with Adam optimizer. The grid method was utilized to search best hyper-parameters across the split validation sets over 100 iterations. The final hyperparameters chosen for the devised model are as follows: learning rate: 0.001; learning scheduler: reduce-on-plateau, patience 5, factor 0.1, minimum learning rate 1e-6; Adam optimizer; batch size: 64; training epoch: 200 (without early stopping); label smoothing: 0.1; CNN output dimension: 64; attention heads: 4; dropout rates: 0.3 for dropout layers after the SE block, 0.1 for attention sub-networks; 0.5 for dropout layers in FFN blocks, 0.3 for dropout layers in the classification sub-network.

Subsequently, to prevent overfitting and to facilitate the model's adaptation to the new mice dataset while preserving the general features learned from the C. elegans data, we implemented a selective fine-tuning strategy. Specifically, we froze the parameters of the layers preceding the Global Dominant Base-Pair Matching Extraction Layer, which captures more generic sequence features, and fine-tuned only the remaining parameters that are more sensitive to specific dataset characteristics (Fig. 2). This approach ensures that the model retains the beneficial knowledge acquired during pre-training while adjusting to the nuances of the new data, thus balancing the risk of overfitting and enhancing the model's generalization capabilities. We adopted the same hyperparameters as the pre-training part, except for the number of epochs: 100 (without early stopping).

Investigation of the targets of piRNAs

To investigate the targets of millions of piRNAs, we aligned piRNAs [36] to the transcript sequences [37] with bwa (version 0.7.17-r1188) aln -n 6 -N -o 0 [38]. Then, aligned piRNA-mRNA pairs were predicted by our devised deep learning model. We utilized this methodology to compute the potential targets of all mouse and human piRNAs listed in piRNAclusterDB 2.0 [36], and uploaded the results to GitHub for public access.

In the investigation of gastric cancer cells, the small RNA-seq, HIWI-RIP-Seq were retrieved from PRJNA633677. Trim_galore (version 0.6.10) was utilized for quality control and adapter trimming. The refined RIP-seq reads were aligned to the genome with HISAT2 (version 2.2.1) [39]. The resulting fragment distributions were visualized using Drukbam (https://github.com/StephanHolgerD/DrukBam). Gene enrichment analyses were performed utilizing clusterProfiler (version 4.6.2) [40].

Results

The performance of the devised deep learning model

To assess the performance of our deep learning model, we compared it with several established methods using test data. In pirScan, the process requires perfect complementarity in the seed region (2-7nt) with zero non-GU and ≤ 2 GU mismatches, and high complementarity in the non-seed region (8-21nt) with ≤ 2 non-GU and ≤ 3 GU mismatches, allowing no more than 6 mismatches overall [41]. The Zhang’s rule requires a minimal 16nt continuous complementarity between piRNAs and mRNAs, and demands the base pairing at the 5′ end of piRNAs with mismatches of less than 3 in the subsequent 19-nucleotide sequence [22]. The RNA-binding energy were calculated by RNAcofold (2.6.4) [42]. As shown in Fig. 3, the fine-tuned devised model (AUC mean = 0.957) outperforms pre-trained devised model (AUC mean = 0.836), piRScan (AUC mean = 0.716), Zhang (AUC mean = 0.692) and RNA-binding energy (AUC mean = 0.734).

In the proposed network architecture, the attention mechanism is tailored to mimic the site-by-site base-pair interaction between piRNA and mRNA sequences, while simultaneously capturing global base-pair binding patterns. Given that prior research has highlighted the preference for longer stretches of continuous complementary pairings in the piRNA targeting process, and that extensive matches towards the 3′ end can bolster the resilience to mismatches occurring at the 5′ end, we seek to determine whether our model can accurately detect such patterns [22, 24]. We present two examples in Fig. 4, piRNA Mmus_2680332 and its cognate targeting site on the mRNA Pdss2, along with piRNA Mmus_2441329 and its cognate targeting site on the mRNA RMER17A2. In base-level attention matrix, significant weight values were allocated to the base-pairing occurrences between piRNA and mRNA, while the sequence level attention matrix offered an adjustment of prediction based on the overall base-pair binding situation. Specifically, in the instance of piRNA Mmus_2680332 and its cognate targeting site on mRNA Pdss2, the initial region displays a remarkable degree of complementarity, while the subsequent sequence segments exhibit fewer matches. The sequence-level attention mechanism to assign greater significance to the initial region, thus facilitating enhanced refinement and adjustment during the predictive analysis. Similarly, in the instance of piRNA Mmus_2441329 and mRNA RMER17A2, despite the lack of prominent long-distance matches in the initial sections, the subsequent regions exhibit a strong degree of complementarity, thereby enhancing the tolerance for mismatches in the 5′ end. The sequence-level attention mechanism redirects greater weight to these latter regions, facilitating focused refinement and adjustment in the overall predictive analysis.

piRNAs from gastric cancer cells in human

The human PIWI-protein, HIWI, has been confirmed to exhibit significant overexpression in gastric cancer cell line SNU-1. This overexpression has been implicated in promoting critical cellular processes in gastric cancer, including cell proliferation, migration, tumorigenesis and metastasis [43]. We postulate that during cancer progression, the PIWI protein interacts with piRNA, thereby executing a crucial role in the targeted cleavage of specific mRNAs by the PIWI-piRNA complex. To investigate this hypothesis, we conducted the analysis of expression profiles of small RNA-seq, RNA-seq and HIWI-RIP-Seq in both wide-type SNU-1 cells and HIWI knockout SNU-1 cells. Utilizing our deep learning model, we predicted 426 mRNA targets of the 1873 piRNAs identified through small RNA-seq analysis. Many of these mRNA are from histone charperones genes, and gene enrichment analysis revealed that these predicted mRNA targets are enriched in nucleosome assembly-related pathways (Fig. 5a), a finding that aligns with the growing body of research highlighting the significance of nucleosome assembly in cancer biology [44]. The piRNA-guided cleavage-related degradation pattern was observed in numerous targeted mRNAs (Fig. 5b).

Discussion

In this study, we carefully selected and integrated appropriate deep learning sub-networks based on the existing knowledge of piRNA targeting rules. This approach enabled us to develop a model that effectively captures the intricate nuances of piRNA-mediated interactions, subsequently leading to accurate predictions of piRNA targets. Another strength of our approach lies in the utilization of transfer learning. This approach allowed us to leverage the extensive understanding of piRNA biology in C. elegans, where data is more abundant, and apply that knowledge to the prediction of piRNA targets in mammals. The conservation of certain piRNA targeting mechanisms across species facilitated this transfer, enabling us to overcome the challenge of limited mammalian data. In practical application, our model complements initial predictions of piRNA alignment results made with alignment algorithms. Researchers can then input the derived piRNA-mRNA pairs into our model to assess the likelihood of piRNA targeting specific mRNAs. This approach may reveal the roles of differentially expressed piRNAs in disease pathogenesis and biological processes. By pinpointing piRNA targets within disease-related pathways, our model assists in identifying potential therapeutic targets and developing biomarkers, which are crucial for early disease detection and prognosis.

Previous research suggested the link between HIWI overexpression and gastric cancer development, yet the precise role of piRNAs in this context remains a mystery [43]. By integrating our model's predictive outcomes with rigorous validation using techniques like HIWI-RIP-Seq, we gained a new understanding of the piRNA landscape within gastric cancer cells. Our results revealed that in these cancerous cells, overexpressed HIWI, under the guidance of piRNAs, participates in the post-transcriptional cleavage of mRNAs, especially those histone chaperones associated with nucleosome function. The dysregulation of these chaperones can lead to changes in chromatin structure, thereby influencing gene expression patterns that are often aberrant in cancer cells. Our findings, coupled with the existing literature, suggest that the piRNAs we have identified may be integral to the regulatory mechanisms governing nucleosome assembly in gastric cancer, potentially impacting the progression and prognosis of the disease [45]. This enrichment of nucleosome assembly pathways in our gene set analysis provides a compelling rationale for further exploration of intricate interplay between piRNA and mRNAs in gastric cancer, particularly in the context of nucleosome dynamics and the influence of histone chaperone.

However, it is important to acknowledge that our study has limitations. The scarcity of high-quality, comprehensive mammalian datasets for piRNA-mRNA interactions remains a challenge, which may impact the performance of our model. Despite these limitations, our work represents an important step forward in the understanding of piRNA biology and its potential applications in disease research by providing a novel tool to this field. By continuing to refine our model and explore its applications in other contexts, we hope to uncover new insights into the complex networks of piRNAs and their roles in health and disease.

Conclusions

In summary, we developed a model for predicting piRNA targets. This model utilizes deep learning sub-networks, which are specifically designed for the intricate targeting patterns of piRNA that have been elucidated through extensive prior experimentation. Additionally, we employed a pre-training and fine-tuning transfer learning approach to mitigate the insufficient mammalian training dataset. The application of this model to explore the potential role of piRNA in gastric cancer cells, revealing their possible involvement in nucleosome-related processes.

Availability of data and materials

The original data utilized in this study can be sourced from the supplementary files from the research projects detailed in Table 1. The positive and negative datasets, which were either sorted from the original data or generated through our simulation as detailed in the Methods section, as well as the code for our deep learning model and the corresponding results, are all available for download at the following URL: https://github.com/SofiaTianjiaoZhang/piRNATarget.

Abbreviations

piRNA:: Piwi-interacting RNA
miRNA:: MicroRNA
C. elegans :: Caenorhabditis elegans
MIWI:: Murine homolog of PIWI
5′RACE-seq:: 5′ Rapid amplification of cDNA ends sequencing
CLIP-seq:: Cross-linking immunoprecipitation sequencing
CLASH:: Cross-linking, ligation and sequencing of hybrids
CNN:: Convolutional neural network
SE block:: Squeeze-and-excitation block
FFN block:: Position-wise feed-forward network block
AUC:: Area under the curve
HIWI:: Human homolog of PIWI
RIP-seq:: RNA immunoprecipitation sequencing

References

Wang X, Ramat A, Simonelig M, Liu M-F. Emerging roles and functional mechanisms of PIWI-interacting RNAs. Nat Rev Mol Cell Biol. 2023;24(2):123–41.
Article PubMed CAS Google Scholar
Thomson T, Lin H. The biogenesis and function of PIWI proteins and piRNAs: progress and prospect. Annu Rev Cell Dev. 2009;25:355–76.
Article CAS Google Scholar
Siomi MC, Sato K, Pezic D, Aravin AA. PIWI-interacting small RNAs: the vanguard of genome defence. Nat Rev Mol Cell Biol. 2011;12(4):246–58.
Article PubMed CAS Google Scholar
Parhad SS, Tu S, Weng Z, Theurkauf WE. Adaptive evolution leads to cross-species incompatibility in the piRNA transposon silencing machinery. Dev Cell. 2017;43(1):60–70.
Article PubMed PubMed Central CAS Google Scholar
Peng JC, Lin H. Beyond transposons: the epigenetic and somatic functions of the Piwi-piRNA mechanism. Curr Opin Cell Biol. 2013;25(2):190–4.
Article PubMed PubMed Central CAS Google Scholar
Goh WSS, Falciatori I, Tam OH, Burgess R, Meikar O, Kotaja N, Hammell M, Hannon GJ. piRNA-directed cleavage of meiotic transcripts regulates spermatogenesis. Genes Dev. 2015;29(10):1032–44.
Article PubMed PubMed Central CAS Google Scholar
Kim KW, Tang NH, Andrusiak MG, Wu Z, Chisholm AD, Jin Y. A neuronal piRNA pathway inhibits axon regeneration in C. elegans. Neuron. 2018;97(3):511–9.
Article PubMed PubMed Central CAS Google Scholar
Barckmann B, Pierson S, Dufourt J, Papin C, Armenise C, Port F, Grentzinger T, Chambeyron S, Baronian G, Desvignes J-P. Aubergine iCLIP reveals piRNA-dependent decay of mRNAs involved in germ cell development in the early embryo. Cell Rep. 2015;12(7):1205–16.
Article PubMed PubMed Central CAS Google Scholar
Du W, Yang W, Xuan J, Gupta S, Krylov S, Ma X, Yang Q, Yang B. Reciprocal regulation of miRNAs and piRNAs in embryonic development. Cell Death Differ. 2016;23(9):1458–70.
Article PubMed PubMed Central CAS Google Scholar
Rajasethupathy P, Antonov I, Sheridan R, Frey S, Sander C, Tuschl T, Kandel ER. A role for neuronal piRNAs in the epigenetic control of memory-related synaptic plasticity. Cell. 2012;149(3):693–707.
Article PubMed PubMed Central CAS Google Scholar
Leighton LJ, Wei W, Marshall PR, Ratnu VS, Li X, Zajaczkowski EL, Spadaro PA, Khandelwal N, Kumar A, Bredy TW. Disrupting the hippocampal Piwi pathway enhances contextual fear memory in mice. Neurobiol Learn Mem. 2019;161:202–9.
Article PubMed CAS Google Scholar
Lee D, Yang H, Kim J, Brady S, Zdraljevic S, Zamanian M, Kim H. Paik Y-k, Kruglyak L, Andersen EC: the genetic basis of natural variation in a phoretic behavior. Nat Commun. 2017;8(1):273.
Article PubMed PubMed Central Google Scholar
Zhang T, Wong G. Dysregulation of human somatic piRNA expression in Parkinson’s disease subtypes and stages. Int J Mol Sci. 2022;23(5):2469.
Article PubMed PubMed Central CAS Google Scholar
Li M, Yang Y, Wang Z, Zong T, Fu X, Aung LHH, Wang K. Wang J-x, Yu T: Piwi-interacting RNAs (piRNAs) as potential biomarkers and therapeutic targets for cardiovascular diseases. Angiogenesis. 2021;24:19–34.
Article PubMed CAS Google Scholar
Zhang D, Tu S, Stubna M, Wu W-S, Huang W-C, Weng Z, Lee H-C. The piRNA targeting rules and the resistance to piRNA silencing in endogenous genes. Science. 2018;359(6375):587–92.
Article PubMed PubMed Central CAS Google Scholar
Vourekas A, Zheng Q, Alexiou P, Maragkakis M, Kirino Y, Gregory BD, Mourelatos Z. Mili and Miwi target RNA repertoire reveals piRNA biogenesis and function of Miwi in spermiogenesis. Nat Struct Mol Biol. 2012;19(8):773–81.
Article PubMed PubMed Central CAS Google Scholar
Anzelon TA, Chowdhury S, Hughes SM, Xiao Y, Lander GC, MacRae IJ. Structural basis for piRNA targeting. Nature. 2021;597(7875):285–9.
Article PubMed PubMed Central CAS Google Scholar
Bartel DP. Metazoan MicroRNAs. Cell. 2018;173(1):20–51.
Article PubMed PubMed Central CAS Google Scholar
Kim D, Sung YM, Park J, Kim S, Kim J, Park J, Ha H, Bae JY, Kim S, Baek D. General rules for functional microRNA targeting. Nat Genet. 2016;48(12):1517–26.
Article PubMed CAS Google Scholar
McEnany J, Meir Y, Wingreen NS. piRNAs of Caenorhabditis elegans broadly silence nonself sequences through functionally random targeting. Nucleic Acids Res. 2022;50(3):1416–29.
Article PubMed PubMed Central CAS Google Scholar
Reuter M, Berninger P, Chuma S, Shah H, Hosokawa M, Funaya C, Antony C, Sachidanandam R, Pillai RS. Miwi catalysis is required for piRNA amplification-independent LINE1 transposon silencing. Nature. 2011;480(7376):264–7.
Article PubMed CAS Google Scholar
Zhang P, Kang J-Y, Gou L-T, Wang J, Xue Y, Skogerboe G, Dai P, Huang D-W, Chen R, Fu X-D. MIWI and piRNA-mediated cleavage of messenger RNAs in mouse testes. Cell Res. 2015;25(2):193–207.
Article PubMed PubMed Central CAS Google Scholar
Gainetdinov I, Vega-Badillo J, Cecchini K, Bagci A, Colpan C, De D, Bailey S, Arif A, Wu P-H, MacRae IJ, et al. Relaxed targeting rules help PIWI proteins silence transposons. Nature. 2023;619(7969):394–402.
Article PubMed PubMed Central CAS Google Scholar
Li Z, Li Z, Zhang Y, Zhou L, Xu Q, Li L, Zeng L, Xue J, Niu H, Zhong J et al. Mammalian PIWI-piRNA-target complexes reveal features for broad and efficient target silencing. Nat Struct Mol Biol 2024.
Yuan J, Zhang P, Cui Y, Wang J, Skogerbø G, Huang DW, Chen R, He S. Computational identification of piRNA targets on mouse mRNAs. Bioinformatics. 2016;32(8):1170–7.
Article PubMed CAS Google Scholar
Zhang T, Chen L, Li R, Liu N, Huang X, Wong G. PIWI-interacting RNAs in human diseases: databases and computational models. Brief Bioinform. 2022;23(4):bbac217.
Article PubMed Google Scholar
Kudla G, Granneman S, Hahn D, Beggs JD, Tollervey D. Cross-linking, ligation, and sequencing of hybrids reveals RNA–RNA interactions in yeast. Proc Natl Acad Sci. 2011;108(24):10010–5.
Article PubMed PubMed Central CAS Google Scholar
Ule J, Jensen KB, Ruggiu M, Mele A, Ule A, Darnell RB. CLIP identifies Nova-regulated RNA networks in the brain. Science. 2003;302(5648):1212–5.
Article PubMed CAS Google Scholar
Yang T-H, Shiue S-C, Chen K-Y, Tseng Y-Y, Wu W-S. Identifying piRNA targets on mRNAs in C. elegans using a deep multi-head attention network. BMC Bioinform. 2021;22:1–23.
Article Google Scholar
Shen E-Z, Chen H, Ozturk AR, Tu S, Shirayama M, Tang W, Ding Y-H, Dai S-Y, Weng Z, Mello CC. Identification of piRNA binding sites reveals the argonaute regulatory landscape of the C. elegans germline. Cell. 2018;172(5):937–51.
Article PubMed PubMed Central CAS Google Scholar
De Fazio S, Bartonicek N, Di Giacomo M, Abreu-Goodger C, Sankar A, Funaya C, Antony C, Moreira PN, Enright AJ, O’Carroll D. The endonuclease activity of Mili fuels piRNA amplification that silences LINE1 elements. Nature. 2011;480(7376):259–63.
Article PubMed Google Scholar
Manakov SA, Pezic D, Marinov GK, Pastor WA, Sachidanandam R, Aravin AA. MIWI2 and MILI have differential effects on piRNA biogenesis and DNA methylation. Cell Rep. 2015;12(8):1234–43.
Article PubMed PubMed Central CAS Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32.
Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process. 2021;151: 107398.
Article Google Scholar
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018;7132–7141.
Rosenkranz D, Zischler H, Gebert D. piRNAclusterDB 2.0: update and expansion of the piRNA cluster database. Nucleic Acids Res. 2021;50(D1):D259–64.
Article PubMed Central Google Scholar
Martin FJ, Amode MR, Aneja A, Austine-Orimoloye O, Azov Andrey G, Barnes I, Becker A, Bennett R, Berry A, Bhai J, et al. Ensembl 2023. Nucleic Acids Res. 2022;51(D1):D933–41.
Article PubMed Central Google Scholar
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997 2013.
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.
Article PubMed PubMed Central CAS Google Scholar
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (Camb). 2021;2(3):100141.
PubMed CAS Google Scholar
Wu WS, Huang WC, Brown JS, Zhang D, Song X, Chen H, Tu S, Weng Z, Lee HC. pirScan: a webserver to predict piRNA targeting sites and to avoid transgene silencing in C. elegans. Nucleic Acids Res. 2018;46(W1):W43-w48.
Article PubMed PubMed Central CAS Google Scholar
Bernhart SH, Tafer H, Mückstein U, Flamm C, Stadler PF, Hofacker IL. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol. 2006;1(1):3.
Article PubMed PubMed Central Google Scholar
Shi S, Yang Z-Z, Liu S, Yang F, Lin H. PIWIL1 promotes gastric cancer via a piRNA-independent mechanism. Proc Natl Acad Sci. 2020;117(36):22390–401.
Article PubMed PubMed Central CAS Google Scholar
Burgess RJ, Zhang Z. Histone chaperones in nucleosome assembly and human disease. Nat Struct Mol Biol. 2013;20(1):14–22.
Article PubMed PubMed Central CAS Google Scholar
Zhao Z, Cai Z, Jiang T, Han J, Zhang B. Histone chaperones and digestive cancer: a review of the literature. Cancers. 2022;14(22):5584.
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

We express our gratitude to researchers who have shared their data online. This project benefits from the computing resources from the Big Data Center for Biomedical Research in Wuyi University.

Funding

This study was supported by the grant 508170020342 from Wuyi University. This research was supported in part by the Faculty of Health Sciences, University of Macau.

Author information

Tianjiao Zhang and Liang Chen have contributed equally to this work.

Authors and Affiliations

School of Pharmacy and Food Engineering, Wuyi University, Jiangmen, 529020, China
Tianjiao Zhang
Department of Computer Science and Technology, College of Mathematics and Computer, Shantou University, Shantou, 515821, China
Liang Chen
Department of Statistics and Data Science, School of Economics, Jinan University, Guangzhou, 510632, China
Haibin Zhu
Faculty of Health Sciences, University of Macau, Taipa, 999078, Macau SAR, China
Garry Wong

Authors

Tianjiao Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Liang Chen
View author publications
You can also search for this author inPubMed Google Scholar
Haibin Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Garry Wong
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

TJZ, LC and GW conceived the research topic. TJZ and HBZ designed and implemented the method. TJZ and GW wrote the manuscript. TJZ and LC performed the evaluation. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Tianjiao Zhang or Garry Wong.

Ethics declarations

Ethics approval and consent to participate

No applicable.

Consent for publication

The authors declare that they consent to publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, T., Chen, L., Zhu, H. et al. Mammalian piRNA target prediction using a hierarchical attention model. BMC Bioinformatics 26, 50 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06068-6

Download citation

Received: 05 September 2024
Accepted: 29 January 2025
Published: 11 February 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06068-6

Mammalian piRNA target prediction using a hierarchical attention model