- Research
- Open access
- Published:
Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data
BMC Bioinformatics volume 25, Article number: 339 (2024)
Abstract
Background
Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.
Results
In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.
Conclusion
The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein.
Introduction
In a biological system, specific cellular activities or functions are often carried out by interactions between genes and their products. These interactions form dynamic, complex networks which can result in changes in higher-level biological systems such as tissues and organs. Constructing and characterising these gene interaction networks are of paramount importance for guiding experimental designs, biomarker identification in diagnostic and prognostic research, target in drug discovery and development, and understanding the biological processes in an organism [1].
In gene network analysis, nodes represent genes or gene products and edges represent pair-wise relationship between them. Data from high-throughput experiments such as RNA sequencing (RNAseq) are often used to construct global gene co-expression networks which might provide insights of gene interaction networks [2]. RNA sequencing uses next-generation sequencing to quantify the amount of RNA molecules in a biological sample and reveal differences in gene expression between different samples [3]. The expression value of each gene in RNAseq data represents average gene expression of a large population of cells [4]. With advances in sequencing technologies, gene expression data are able to be extracted from individual cells in specific cell types [5]. The power and potential of single-cell RNA sequencing (scRNAseq) have been demonstrated in cell development and differentiation research [6, 7]. Computational methods to harness this potential are still in their infancy as methods developed for bulk RNA-seq data are often applied [3].
Differences in properties of scRNAseq data from bulk RNAseq data have been highlighted, including the potential for dropout events which introduce technical noise in the data, stronger overdispersion between individual cells, heterogeneity of sequenced cells due to presence of different cell populations, and different cell states and cell cycles [4]. These differences can hinder performance of existing computational tools for sequencing data analysis [3]. Therefore, computational methods specific for scRNAseq data with high performance are essential to facilitate applications of single-cell RNA sequencing in research.
In scRNAseq data analysis, the term dropout is used to describe the prevalence of excessive zero counts [8]. Dropout events are defined as the situation when expression of a gene is detected in some cells but absent in other cells of the same cell types in the same experiment [9]. Multiple factors are suggested to cause dropout events such as failure in capturing and amplifying mRNAs or limitation in sequencing depth [10]. For example, in a quantitative assessment of scRNAseq methods, a low-expressed gene TERT was only detected by quantitative polymerase chain reaction (qPCR) compared to scRNAseq [5]. Stochastic gene expression or existence of a new underlying cell subpopulation are also potential dropout-event factors [11, 12]. In the current literature, there are debates about whether dropout events occur in scRNAseq protocols using unique molecular identifiers (UMI) or whether the usage of zero-inflated modelling in scRNAseq data analysis improves performance [10, 13].
Covariance or correlation matrices from Pearson or Spearman correlation tests are used widely to generate connectivity graphs of gene co-expression. One main limitation is correlation analysis not only captures pairwise correlation between genes but also associations contributed by third-party or global effects [14]. To overcome this limit, graphical models can be applied to measure the direct interaction between genes or variables. A class of these, called Gaussian Graphical Models, make use of the partial correlation matrix, which can be calculated from the inverse covariance (precision) matrix. However, when the number of samples is comparable or smaller than the number of variables, the sample covariance matrix inherits a lot of estimation errors [15]. To solve this issue, shrinkage approaches can be adopted in which the precision matrix is assumed to be sparse.
In this study, we build on the existing Stein-type shrinkage approach of [16] by integrating a novel zero-inflated negative binomial mixture modelling approach to account for dropout in the data. We also use benchmarking approaches to identify the optimal data-transformation scheme for scRNAseq counts, and use this to construct a workflow for network inference from scRNAseq data. The main contribution of our work is the development of this workflow tailored to scRNAseq data which integrates our novel approach to accounting for dropout in the data. In our results we compare the performance and computational time of Stein-type shrinkage methods to Lasso-type shrinkage methods using simulated scRNAseq data. Finally, our suggested workflow of zero-inflated Stein-type shrinkage is applied in experimental scRNAseq data of Schizosaccharomyces pombe, Saccharomyces cerevisiae, Plasmodium falciparum and Mus musculus.
Methods
Overview
Graphical models are graphs in which nodes are random variables and the presence of edges represent conditional dependence or “direct links” between them [17]. Two random variables, X and Y, are considered to be independent conditionally on variable Z if their conditional probability meets:
Gaussian Graphical Models (GGMs) are a popular form of undirected graphical model in which a random variable X follows a multivariate Gaussian distribution \({\mathcal {N}}(0,\Sigma )\). The covariance matrix \(\Sigma\) is unknown and nonsingular [17].
A GGM is represented by a partial correlation matrix P, as zero values in P correspond to conditional independence between variables [18]. Partial correlation coefficients measure linear relationships between pairs of random variables (i and j) which corrects for the effect from other variables (conditional dependence) [19]. The partial correlation coefficient \(\rho _{ij}\) can be calculated based on \(\Omega\), the inverse of covariance matrix \(\Sigma\):
In practice, the most common covariance estimator is the sample covariance matrix. However, when the number of features (p) exceeds the number of samples (n), the sample covariance matrix becomes singular and non-invertible [20]. In this situation, shrinkage techniques are often adopted to generate an invertible estimated covariance matrix for calculating the partial correlation matrix.
There are two common approaches for estimation of sparse partial correlation matrices, using \(L_1\)/lasso regularization in a maximum likelihood framework (Lasso-type) and linear shrinkage to shrink the sample covariance matrix towards a structured target matrix (Stein-type). In Lasso-type shrinkage, let S be the sample covariance matrix, and \(\Theta\) be the estimated inverse covariance matrix, then the penalized log-likelihood function to maximize is [21]:
where \(\lambda\) is the amount of shrinkage applied and specified by users.
The choice of regularization parameter \(\lambda\) is critical in Lasso-type shrinkage as different \(\lambda\) values lead to different sparsity in the resulting graphs and therefore the scientific conclusions drawn. Multiple methods and algorithms have been developed to choose \(\lambda\) which includes Akaike information criterion (AIC) and Bayesian information criterion (BIC) as standard methods for low-dimensional data. In high-dimensional data when number of features is relatively large compared to the sample size, stability approaches to regularization selection (StARS) or rotation information criterion (RIC) algorithms can be used [22, 23].
On the other hand, Stein-type shrinkage comprises three components, an empirical covariance matrix S, a shrinkage target matrix T and a shrinkage constant \(\alpha\). The estimated covariance matrix is computed through a linear combination [15]:
The identity matrix is often chosen as target matrix, which leads to the shrinkage of the off-diagonal covariance coefficients towards zero, as the identity matrix is only non-zero on the diagonal of the matrix.
Similarly to Lasso-type shrinkage, the shrinkage intensity \(\alpha\) controls the intensity of shrinkage. However, the value of \(\alpha\) is not chosen by the user or selected using a grid search over possible values, but is derived directly by minimising a loss function which calculates the distance between the estimated covariance matrix and the population covariance matrix for a given value of \(\alpha\) [24]. The equations for the optimal values of \(\alpha\) are given in Table 1.
In this study, Lasso-type shrinkage is implemented with the graphical lasso (glasso) [25] and Meinshausen–Buhlmann (mb) [26] algorithms, using the huge R package [27]. Stein-type shrinkage is applied using identity matrix as target matrix and shrinkage intensity (\(\alpha\)) is calculated based on results of the studies [16, 28] which are named as GeneNet and Fisher2011 algorithms, respectively (Table 1, Supplementary Figure S1, S2, S3 & S4).
Data transformation and p-value calculation for stein-type shrinkage workflows
In Gaussian graphical models, normality is a standard assumption about distribution of count data [29]. However, RNAseq data has specific properties such as skewness, extreme values, mean-variance dependency which deviates from the normal distribution assumption [30, 31]. Furthermore, scRNAseq can potentially deviate from a normal distribution due to other factors such as mixtures of cell subpopulations or ambient contamination. To alleviate these effects during scRNAseq data analysis, other developed methods such as cell clustering [32] to separate cells into roughly homogeneous groups to be analysed individually, and cell outlier detection can be applied in a data pre-processing step to ensure the data are amenable to graphical modelling techniques.
To improve the performance of shrinkage methods, data transformation approaches are recommended [30]. Three data transformation methods are compared in terms of enhancing performance of Stein-type shrinkage. We consider log transformation as \(\log (count+1)\), which is highly used in biological data analysis to convert skew distribution towards normally distributed data [30]. We also apply a semiparametric Gaussian copula or nonparanormal transformation, which was developed for high-dimensional graphical modelling by using a Gaussian copula with Winsorized truncation [33]. This algorithm is implemented using huge.npn function from huge R package [27]. Lastly, we also consider the empirical copula \(\Phi ^{-1}(\,ECDF(\,count)\,)\,\) which is similar to the nonparanormal approach and includes a Gaussian copula based on the empirical cumulative distribution of the data without truncation.
Unlike Lasso-type shrinkage methods, Stein-type shrinkage returns an estimated partial correlation matrix which requires significance testing to identify edges and generate a final graph. Without significance testing there is no way to identify non-zero entries of the partial correlation matrix from the estimated matrix, other than using heuristics based on the resulting graph structure, or by setting arbitrary thresholds, both of which could lead to large numbers of false positive predictions, especially for noisy data. Generally, partial correlation coefficients are treated as Pearson’s correlation coefficients and tested using t-statistics [34, 35]. This has been applied in the GeneNet R package for Stein-type shrinkage. To take into account the effect of shrinkage and its intensity on partial correlation matrix, Bernal et al. proposed two new probability densities for significance test which are referred to as shrunkv1 and shrunkv2 in this study [19, 35].
Modifications in shrinkage frameworks for zero-inflated counts
To fully exploit the potential of scRNAseq data, one main challenge to overcome is the presence of excessive zeros in count data due to dropout events. Dropout events in scRNAseq data are defined as the situation where the expression of a gene is detected in some cells but absent in other cells of the same cell type [9]. Therefore, to reduce estimation errors in the sample covariance matrix and stratify biological zeros from dropout zero counts, we propose to use zero-inflated z score calculation based on zero-inflated negative binomial (ZINB) modelling of the count data in shrinkage framework. Details of the process is illustrated in Supplementary Figure S5.
In the first step, a ZINB model is fitted for expression counts of each gene j in cell i (\(Y_{ij}\)) by the L-BFGS-B algorithm using the optim R function to derive maximum likelihood estimates of parameters of the model:
where \(\mu\) is mean, s is vector of constant scaling factors (see Supplementary Figure S5) for each cell and r is the size parameter of the negative binomial distribution. Parameter \(w_j\) balances negative binomial model accounting for true expression counts and delta function. Importantly, the mean of the negative binomial distribution in our zero-inflated negative binomial modelling step is the product of parameters \(\mu\) and scaling factors s. Scaling factors are commonly used in data normalization for sequencing data to account for differences in sequencing depths between samples. They are calculated using the scran R package [36].
For count of gene j in cell i, the delta function returns 0 value for all non-zero counts:
Next, a non-detection rate (\(d_{ij}\)) is calculated based on the optimized ZINB model of each gene. For each count of gene j in cell i:
To select zero counts in the data resulting from dropout events, a threshold t with default value 0.5 is applied. After zero counts potentially resulting from dropout events are identified, a nonparanormal transformation is applied to the count data [33].
Zero-inflated z-scores are then calculated in which z-scores of zero counts identified as dropout events are set to 0. To calculate zero-inflated z-scores of nonparanormal transformed data X of gene j in cell i we use:
Our proposed methodology, referred to below as ZIGeneNet, fits the ZINB mixture model given above to each gene, and then uses nonparanormal transformation of the data and the derived non-detection rates to calculate the zero-inflated z-scores \(Z_{ij}\) of each count which are then used in empirical covariance matrix S calculation of a Stein-type shrinkage approach. The shrinkage coefficient \(\alpha\) is chosen using the methodology of [16]. Significant edges are detected using correlation test statistics from fdrtool R package [34]. The workflow is outlined in figure 1.
Zero-inflated covariance matrix shrinkage workflow. To account for presence of excessive zero counts in scRNAseq data, UMI count data is transformed into zero-inflated z scores for partial correlation matrix estimation. The zero-inflated z-score calculation step can be integrated in all shrinkage workflows, whereby a zero-inflated negative binomial is fitted to stratify zero counts from dropout event and “biological” zero counts. After the fitting, a non-detection rate (\(d_{ij}\)) is estimated in each cell for each gene. Thresholding (default t=0.5) is applied in which counts with \(d_{ij}\ge t\) are considered missing or zero-inflated values. Z scores are calculated for all counts and scores of zero-inflated values are set to 0
Performance evaluation metrics
As gene regulatory networks are sparse relative to the total possible number of edges, the task of gene regulatory network inference has a large class imbalance, with the negative class (non-edges) being much larger than the positive (edges). As a result, for the evaluation of the network inference schemes on simulated data, the Matthews correlation coefficient (MCC) is used, which is a contingency matrix method to calculate the Pearson product-moment correlation coefficient between prediction and reference values [37]. The MCC score is able to take into account the class imbalance and so gives a better measure of performance than metrics like accuracy or ROC curves. The MCC score is calculated as follows:
where TP is the count of true positive predictions, TN is the count of true negative predictions, FP represents the count of false positives, and FN the number of false negative predictions. The MCC ranges from \(-1\) to 1, which indicates the worst and perfect classification respectively. The MCC is commonly applied in binary classification evaluation for imbalanced data [37].
When benchmarking gene network inference methods it is common to rely only on synthetic data, as there is almost never a known ground truth network against which to benchmark networks inferred from experimental data. However to illustrate that the approach we propose is able to predict edges that have some supporting experimental evidence, we choose to report the numbers of predicted edges that exist in the databases listed below, both in raw figures and as positive predictive value (PPV)..
In the experimental data analysis, gene interaction databases (IntAct, STRING, BioGRID) are used as references [38,39,40]. In terms of gene regulatory databases, information of interactions between transcription factors(TFs) and their targets is collected from Pombase (Schizosaccharomyces pombe) [41], YEASTRACT (Saccharomyces cerevisiae) [42], and TFLink (Mus musculus) [43]. An integrated database of TFs and their target for Plasmodium falciparum is unavailable and is not included in the experimental data analysis step.
However it is important to note that the interactions in these databases are collected from multiple different experimental conditions, that are unlikely to occur simultaneously. For instance, interactions of transcription factors to initiate transcription of genes in stress response cannot be expected to be present in cells cultured in growing medium with no stress factors. Therefore, as it is not possible to reliably identify false negative edge predictions in this scenario, we only consider precision or PPV. This measures the fraction of predicted edges in the inferred network that have corresponding experimental evidence of their existence, and does not consider false negatives. PPV is calculated as follows:
Results and discussion
Simulating scrnaseq data from ESCO model
Simulated data for scRNAseq was generated using an algorithm based on the ESCO simulation tool [44]. A simplified simulation version with incorporated partial correlation matrix is illustrated in Supplementary Figure S5. The simplified scRNAseq model is for single cell group. Genes in the gene network are simulated to be expressed in at least 60% of cells. This matches guidelines in experimental scRNAseq data analysis in which genes not expressed in a certain number of cells are excluded for downstream analysis [45]. To simulate regulatory interactions between genes, a correlation matrix is calculated based on a partial correlation matrix provided as an input. Simulation of zero inflation in ESCO is based on mean of gene expression in which genes with higher expression mean are less likely to have zero-inflated counts [44]. This is consistent with current reports about relation between dropout event and level of gene expression [9].
Data transformation and p-value estimation approaches in stein-type shrinkage
Comparative analysis of different approaches in data transformation and p-value calculations on partial correlation matrix. A and B, Three transformation methods were applied on simulated scRNAseq data with p = 200, n = 200, sequencing depth = 50,000 reads and partial correlation matrix was estimated by Fisher2011 algorithm. Compared to no transformation (NoTrans), performance of Fisher2011 was improved most by nonparanormal approach (Nonparanormal) from [33] and log transformation (LogTrans). C Different models to calculate p-value from shrunk partial correlation matrix were applied on simulated data (p = 200, sequencing depth = 50,000 reads, GeneNet shrinkage). Models used were t-statistics, Monte Carlo method (MonteCarlo), and shrunk probability density models for shrunk partial correlation matrices (shrunkv1 and shrunkv2 from [19] and [35], respectively)
Firstly, different methods of data transformation (pre-shrinkage step) and p-value calculation (post-shrinkage step) are explored to construct an optimal Stein-type shrinkage workflow which can be used to implement in gene network analysis of scRNAseq data. Results of the comparison between three transformation methods we consider, versus no transformation in a Stein-type shrinkage workflow when the number of genes and cells are equal to 200 and sequencing depth is 50000 reads in scRNAseq data simulation are highlighted in Fig. 2A&B. The nonparanormal approach outperforms and increases performance of Fisher2011 from MCC \(\simeq\) 0.346 to MCC \(\simeq\) 0.681 while MCC value is improved from 0.642 to 0.683 in GeneNet case. The effect of the three data transformation methods on scRNAseq data is visualized as histograms in Supplementary figure S6 & S7. The strong assumption of normal distribution in Fisher2011 is re-highlighted by the larger difference in performance of Fisher2011 with and without data transformation compared to GeneNet.
Secondly, performance of shrunkv1 and shrunkv2 p-value tests are compared against a significance test using t-statistics and Monte Carlo p-value calculation in scRNAseq simulated data with 200 genes. Ranging from n=60 to n=200, the t-statistics method performs best in all tested methods using GeneNet shrinkage algorithm (Fig. 2C). Therefore, nonparanormal and t-statistics approaches are used in data transformation and p-value calculation steps of the optimised Stein-type shrinkage workflow.
Performance comparison of shrinkage methods on non-zero-inflated data
Performance of Stein-type versus Lasso-type shrinkage estimators in scRNAseq simulated data. Different sequencing depths (25,000, 50,000, 500,000 reads) were simulated in the study (iterations: 50). Used methods: GeneNet [16], Fisher2011 [28], mb - Meinshausen-Buhlmann algorithm [26], glasso - graphical lasso algorithm [25]
Performance of shrinkage algorithms are then analysed on simulated scRNAseq data, applying a nonparanormal transformation. Matthews correlation cofficient (MCC) is used for evaluation. Different sequencing depth (25,000, 50,000, 500,000 reads) are simulated and performance results are summarized in Fig. 3. We find that for low numbers of samples (n = 60 to 200) and high sequencing depth (500 000 reads), Stein-type algorithms have better performance than Lasso-type algorithms. Noticeably, Fisher2011 has slightly lower performance than GeneNet in case of 25 000 and 50 000 depths.
When designing single-cell RNA sequencing experiments, two main considerations are the number of cells and sequencing depth to sequence individual cells [46]. Increasing sequencing depth can potentially provide more information of gene transcription and reduce technical noise [47]. Interestingly, in simulation, when the number of cells is above 1000, performance of all methods approaches plateau, for example, MCC \(\ge\) 0.8 at sequencing depth of 500 000 reads. To ensure the application of graphical models when the number of cells is smaller than the number of analysed genes, high-dimensional data where number of features is larger than number of observations are main focus of this study [48]. Performance of Stein-type shrinkage is also considered in another high-dimensional setting when p = 2000 and n = 20 to 200 (Supplementary Figure S8). Compared to smaller network (p=200), Stein-type shrinkage approaches retain their performance in larger network with p = 2000. At n = 100 in scRNAseq simulation, MCC of both networks (p=200 and p=2000) are around 0.5 and increases towards 0.7 when n = 200.
Regarding computational cost, computing time of each algorithms is measured using workflows from Supplementary Figure S1 & S2 without a data transformation step on normally distributed data. The simulated data is constructed so that the number of cells is half of the number of genes. Results of average computational time after 10 iterations are shown in Table 2. These results highlight that Stein-type shrinkage methods can perform over 1000 times faster the Lasso-type approaches in some cases.
Zero-inflated approach for stein-type sparse inverse covariance estimation
Presence of zero counts in experimental scRNAseq data of cells from the same cell type is highlighted in Supplementary Figure S9 and S10 in which gene mean is calculated based on observed/non-zero counts. The plots illustrate that genes with the same gene mean have different proportion of zero counts. This suggests that there might be a subset of genes being affected by dropouts during single-cell experiments.
To consider the impact of modelling zero-inflation of counts for genes that are not affected by dropout, performance of the zero-inflated Stein-type shrinkage framework is tested in simulated scRNAseq data with and without zero inflation (Fig. 5 & S11). Using GeneNet and Fisher2011 with 200 genes and 100 cells, there is no significant difference in performance of ZI-integrated and non-ZI shrinkage frameworks (p-values of t-test: 0.072 and 0.443, respectively).
The zero-inflated integrated Stein-type shrinkage framework is also compared to an existing graphical model method for sparse gene co-expression network analysis of scRNAseq, scLink [49]. In summary, scLink transforms count data, and uses a Gamma-Normal mixture distribution model to detect missing-value zero counts. Pearson correlation coefficients are calculated based on high-confidence counts and used as input for Lasso-type shrinkage. The Bayesian Information Criterion (BIC) is recommended to choose the value of the regularization parameters [49]. We apply scLink using the built-in functions for data transformation and network estimation (scLink R package), and the optimal graph is chosen by selecting the smallest BIC value. Compared to our Stein-type shrinkage approach, scLink differs mainly in the zero-inflated modelling step (Gamma-Normal versus Negative binomial mixture model), and sparse inverse covariance matrix estimation step (Lasso-type versus Stein-type shrinkage estimator). Performance of scLink versus the proposed ZI shrinkage model is shown in figure 4 using ZI simulated scRNAseq data. In all the cases, GeneNet and Fisher2011 shrinkage framework taking into account zero-inflation outperforms non-ZI shrinkage workflows and scLink.
Implementation in gene network analysis of experimental scrnaseq data
To validate interaction networks estimated from the proposed zero-inflated covariance shrinkage framework, databases curating experimental evidence of gene regulation and protein interactions from fission yeasts Schizosaccharomyces pombe, budding yeasts Saccharomyces cerevisiae, malaria parasite Plasmodium falciparum and house mice Mus musculus are used. We reiterate here that the results presented here are only intended to demonstrate how well the methods considered can recover known edges present in databases, as it is generally not possible to derive a ground truth network for a given experimental scRNAseq dataset.
For gene regulatory networks, databases of transcription factors and their targets are utilised except for Plasmodium falciparum [41,42,43]. For protein-protein interaction networks, IntAct, STRING and BioGRID are used to analysed reported interactions of gene products [38,39,40]. For the STRING database, only interactions having evidences in experiments and databases are considered as reported interactions.
For experimental scRNAseq data, we use high-quality count data of 108 S.pombe cells (strain 972 h-) [50], 127 wild-type S.cerevisiae cells (strain BY4741) [51], 1064 cells of ring-stage P.falciparum (strain DCJ) [52] and 8715 mature oligodendrocytes (MOL1 cluster) of M.musculus [53] are used to contruct gene regulatory networks. Genes with expression over 50 cells in each scRNAseq data set and present in interaction networks of each database are selected. Precision, or positive predictive value (PPV) is adopted as the performance metric in experimental data analysis as only a fraction of reported interactions in database are expected to be present.
In the experimental data analysis, ZI (ZIGeneNet) and non-ZI GeneNet (GeneNet) shrinkage workflows are compared with scLink and Pearson correlation (Pearson). Pearson correlation is widely used to analyse and construct gene co-expression network [54]. The Pearson correlation matrix is calculated directly from count matrix using rcorr function from Hmisc R package [55]. Pearson correlation coefficients with Benjamini-Hochberg adjusted p-values smaller than 0.01 are selected for final graph.
The number of estimated edges which are reported in each database and total estimated edges are illustrated in table 3. Mostly, ZIGeneNet estimates fewer edges than scLink and PearCorr except on the analysis of the S.Pombe data [50]. In the S.pombe and S.cerevisiae scRNAseq data, the sample covariance matrices are singular as the number of genes is larger than the number of cells (570 genes versus 108 cells in S.pombe data and 2608 genes versus 127 cells in S.cerevisiae data). In these cases, Stein-type shrinkage approaches including ZIGeneNet and GeneNet have higher precision compared to scLink and Pearson methods. Specifically, around 20% of estimated edges from ZIGeneNet and GeneNet have been confirmed by other experiments in the database for S.cerevisiae, while positive predictive values are 4.2% and 6.2% using scLink and Pearson methods, respectively. However the scLink and Pearson methods predict more edges, so may be preferred if the goal is detecting as many edges as possible with less concern for precision. In the P.falciparum scRNAseq dataset, the number of cells is relatively large compared to the number of genes (1064 cells versus 1010 genes), and performances of all methods are similar, with approximately 15% precision.
To demonstrate the potential of graphical models in large-scale network inference, we apply Stein-type shrinkage workflows, scLink and Pearson correlation matrix in scRNAseq data from the mouse brain. In the study of [53], harvested brain cells are clustered and annotated using reported biomarkers. To reduce the heterogeneity in cells, network estimation methods were applied on MOL1 cluster which is a subtype of mature oligodendrocytes and has been validated from a different study [56]. Compared to scLink and Pearson correlation methods, Stein-type shrinkage workflows (ZIGeneNet and GeneNet) outperform in term of precision, although scLink and Pearson correlation predict a larger number of edges.
To demonstrate the impact of covariance matrix shrinkage on network sparsity, the four network inference methods (ZIGeneNet, GeneNet, scLink, and Pearson) were applied to 200 highly variable genes (HVGs) from the S.cerevisiae scRNAseq data of [51]. The result is shown in table S1 in which ZIGeneNet has highest precision (\(\simeq\) 30%) compared to 12% precision in other methods. Estimated networks of this analysis from ZIGeneNet and Pearson were visualized in figure 6 for comparison. It can be seen that the network inferred by ZIGeneNet is considerably sparser that that of the Pearson correlation approach.
Estimated networks from zero-inflated covariance shrinkage workflow (ZIGeneNet) versus Pearson correlation workflow with coefficients (r) over 0.6 on 200 highly variable genes of S.cerevisiae scRNAseq data. Edges reported in databases are highlighted in red for transcription factor and gene target interactions, and in blue for protein-protein interactions
Conclusions
Gene regulatory network inference algorithms and allow the extraction of network information from single-cell gene expression data that can be used to further our understanding of molecular networks and potentially act as a screening step to select potential proteins for downstream analysis.
We demonstrate that our proposed approach that uses a network inference method specific to scRNAseq data gives higher performance on simulated data, and when applied to experimental scRNAseq data recovers more interactions that have supporting evidence of their existence present in the databases used, than those that do not explicitly model zero-inflation. We also show that by using Stein type shrinkage and a negative binomial model of scRNAseq data that takes into account zero-inflation of counts, we can improve performance over the previously developed scLink methodology.
Due to the computational speed of Stein type shrinkage approaches, in future work we will explore the possibility of performing large numbers of local network inference tasks in scRNAseq data, to make this approach applicable to data where there are potentially continuous trajectories of cell types present, such as in developmental studies [57]. We also plan to investigate the possibility of including prior information in the target matrices of the shrinkage schemes proposed, so that known edges from databases can be integrated into the analysis. The Stein-type shrinkage method can also integrate with other sequencing analysis methods which can extend its application to heterogeneous samples when multiple cell types exists. For instance, in single-cell sequencing study of malaria parasites, cell clustering shows distinct cell clusters consistently with distinct stages of cell collection [52]. With the ability to recover different cell groups in biological samples, cell clustering can facilitate Stein-type shrinkage to implement local network inference analysis in one cluster in which most of cells are from the same cell group.
Availability of data and materials
The simulation and benchmarking code used to prepare the manuscript is available as a Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel, and an R package implementing the method is also available https://github.com/calathea24/ZINBStein. Data used are publicly available from the Gene Expression Omnibus (GSE122392), NCBI Sequence Read Archive (SRP116718, SRP135960) and ArrayExpress (E-MTAB-6825).
References
Emmert-Streib F, Dehmer M, Haibe-Kains B. Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Front Cell Dev Biol. 2014;2:38.
Ballouz S, Verleyen W, Gillis J. Guidance for RNA-Seq co-expression network construction and analysis: safety in numbers. Bioinformatics. 2015;31(13):2123–30.
Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinf. 2019;20(1):1–16.
Yu X, Abbas-Aghababazadeh F, Chen YA, Fridley BL. Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments. Transl Bioinf Therap Dev. 2021;143–75.
Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, et al. Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods. 2014;11(1):41–6.
Saliba A-E, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-Seq: advances and future challenges. Nucleic Acids Res. 2014;42(14):8845–60.
Hedlund E, Deng Q. Single-cell RNA sequencing: technical advancements and biological applications. Mol Aspects Med. 2018;59:36–46.
Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet. 2021;53(6):770–7.
Qiu P. Embracing the dropouts in single-cell RNA-Seq analysis. Nat Commun. 2020;11(1):1169.
Jiang R, Sun T, Song D, Li JJ. Statistics or biology: the zero-inflation controversy about SCRNA-Seq data. Genome Biol. 2022;23(1):1–24.
Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann SA, Marioni JC. Characterizing noise structure in single-cell RNA-Seq distinguishes genuine from technical stochastic allelic expression. Nat Commun. 2015;6(1):8687.
Jindal A, Gupta P, Jayadeva, Sengupta D. Discovery of rare cells from voluminous single cell expression data. Nat Commun. 2018;9(1):4719.
Kim TH, Zhou X, Chen M. Demystifying “drop-outs’’ in single-cell UMI data. Genome Biol. 2020;21(1):196.
Huang S, Li J, Sun L, Ye J, Fleisher A, Wu T, Chen K, Reiman E, Initiative ADN, et al. Learning brain connectivity of Alzheimer’s disease by sparse inverse covariance estimation. Neuroimage. 2010;50(3):935–49.
Ledoit O, Wolf M. Honey, i shrunk the sample covariance matrix. UPF economics and business working paper 2003;(691).
Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. In: Statistical applications in genetics and molecular biology, vol. 4(1), 2005;
Giraud C. Introduction to high-dimensional statistics. New York: CRC Press; 2021.
Whittaker J. Graphical models in applied multivariate statistics. Chichester: Wiley Publishing; 2009.
Bernal V, Bischoff R, Guryev V, Grzegorczyk M, Horvatovich P. Exact hypothesis testing for shrinkage-based gaussian graphical models. Bioinformatics. 2019;35(23):5011–7.
Ledoit O, Wolf M. The power of (non-) linear shrinking: a review and guide to covariance matrix estimation. J Financ Economet. 2022;20(1):187–218.
Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008;9:485–516.
Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. In: Advances in neural information processing systems, vol. 23, 2010;.
Lysen S. Permuted inclusion criterion: a variable selection technique. Publicly accessible Penn Dissertations, 28, 2009;
Ledoit O, Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Financ. 2003;10(5):603–21.
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–41.
Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso 2006.
Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in r. J Mach Learn Res. 2012;13(1):1059–62.
Fisher TJ, Sun X. Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput Stat Data Anal. 2011;55(5):1909–18.
Mar JC. The rise of the distributions: why non-normality is important for understanding the transcriptome and beyond. Biophys Rev. 2019;11(1):89–94.
Zwiener I, Frisch B, Binder H. Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS ONE. 2014;9(1):85150.
Church BV, Williams HT, Mar JC. Investigating skewness to understand gene expression heterogeneity in large patient cohorts. BMC Bioinf. 2019;20(24):1–14.
Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IW, Ng LG, Ginhoux F, Newell EW. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37(1):38–44.
Liu H, Lafferty J, Wasserman L. The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res. 2009;10(10)
Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinf. 2008;9(1):1–14.
Bernal V, Soancatl-Aguilar V, Bulthuis J, Guryev V, Horvatovich P, Grzegorczyk M. Genenettools: tests for gaussian graphical models with shrinkage. Bioinformatics. 2022;38(22):5049–54.
Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-Seq data with bioconductor. F1000Research 2016;5.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):1–13.
Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U, et al. The intact molecular interaction database in 2012. Nucleic Acids Res. 2012;40(D1):841–6.
Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, Legeay M, Fang T, Bork P, et al. The string database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49(D1):605–12.
Oughtred R, Rust J, Chang C, Breitkreutz B-J, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, et al. The biogrid database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021;30(1):187–200.
Lock A, Rutherford K, Harris MA, Hayles J, Oliver SG, Bähler J, Wood V. Pombase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information. Nucleic Acids Res. 2019;47(D1):821–7.
Teixeira MC, Monteiro PT, Palma M, Costa C, Godinho CP, Pais P, Cavalheiro M, Antunes M, Lemos A, Pedreira T, et al. Yeastract: an upgraded database for the analysis of transcription regulatory networks in saccharomyces cerevisiae. Nucleic Acids Res. 2018;46(D1):348–53.
Liska O, Bohár B, Hidas A, Korcsmáros T, Papp B, Fazekas D, Ari E. Tflink: an integrated gateway to access transcription factor-target gene interactions for multiple species. Database. 2022;2022:083.
Tian J, Wang J, Roeder K. Esco: single cell expression simulation incorporating gene co-expression. Bioinformatics. 2021;37(16):2374–81.
Luecken MD, Theis FJ. Current best practices in single-cell RNA-Seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):8746.
Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–45.
Zhang MJ, Ntranos V, Tse D. Determining sequencing depth in a single-cell RNA-Seq experiment. Nat Commun. 2020;11(1):774.
Salehi H, Gorodetsky A, Solhmirzaei R, Jiao P. High-dimensional data analytics in civil engineering: a review on matrix and tensor decomposition. Eng Appl Artif Intell. 2023;125: 106659.
Li WV, Li Y. sclink: inferring sparse gene co-expression networks from single-cell expression data. Genomics Proteomics Bioinf. 2021;19(3):475–92.
Saint M, Bertaux F, Tang W, Sun X-M, Game L, Köferle A, Bähler J, Shahrezaei V, Marguerat S. Single-cell imaging and RNA sequencing reveal patterns of gene expression heterogeneity during fission yeast growth and adaptation. Nat Microbiol. 2019;4(3):480–91.
Nadal-Ribelles M, Islam S, Wei W, Latorre P, Nguyen M, Nadal E, Posas F, Steinmetz LM. Sensitive high-throughput single-cell rna-seq reveals within-clonal transcript correlations in yeast populations. Nat Microbiol. 2019;4(4):683–92.
Poran A, Nötzel C, Aly O, Mencia-Trinchant N, Harris CT, Guzman ML, Hassane DC, Elemento O, Kafsack BF. Single-cell RNA sequencing reveals a signature of sexual commitment in malaria parasites. Nature. 2017;551(7678):95–9.
Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, Van Der Zwan J, Häring M, Braun E, Borm LE, La Manno G, et al. Molecular architecture of the mouse nervous system. Cell. 2018;174(4):999–1014.
Hou J, Ye X, Feng W, Zhang Q, Han Y, Liu Y, Li Y, Wei Y. Distance correlation application to gene co-expression network analysis. BMC Bioinf. 2022;23(1):1–24.
Hmisc FHJ. Harrell Miscellaneous. R package version 5.1-2, https://hbiostat.org/R/Hmisc/
Marques S, Zeisel A, Codeluppi S, Van Bruggen D, Mendanha Falcão A, Xiao L, Li H, Häring M, Hochgerner H, Romanov RA, et al. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science. 2016;352(6291):1326–9.
Wang X, Choi D, Roeder K. Constructing local cell-specific networks from single-cell data. Proc Natl Acad Sci. 2021;118(51):2113178118.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
TT and DHTV conceptualised the study. DHTV implemented the methods and analysed data. DHTV wrote the initial manuscript draft, TT and DHTV revised and finalised the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Vo, D.H.T., Thorne, T. Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data. BMC Bioinformatics 25, 339 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-05946-9
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-05946-9