Tisslet tissues-based learning estimation for transcriptomics

Miloudi, Ahmed; Al-Qahtani, Aisha; Hashir, Thamanna; Chikri, Mohamed; Bensmail, Halima

doi:10.1186/s12859-024-06025-9

Research
Open access
Published: 25 February 2025

Tisslet tissues-based learning estimation for transcriptomics

Ahmed Miloudi¹^na1,
Aisha Al-Qahtani²^na1,
Thamanna Hashir³,
Mohamed Chikri¹ &
…
Halima Bensmail²

BMC Bioinformatics volume 26, Article number: 65 (2025) Cite this article

649 Accesses
1 Altmetric
Metrics details

Abstract

In the context of multi-omics data analytics for various diseases, transcriptome-wide association studies leveraging genetically predicted gene expression hold promise for identifying novel regions linked to complex traits. However, existing methods for multi-tissue gene expression prediction often fail to account for tissue-tissue expression interactions, limiting their accuracy and effectiveness. This research addresses the challenge of predicting gene expression across multiple tissues by incorporating tissue-tissue expression correlations based on a nonlinear multivariate model. Our findings demonstrate that this model excels in estimating tissue-tissue interactions and accurately predicting missing data. These results have significant implications for multi-omics data analytics and transcriptome-wide association studies, suggesting a novel approach for identifying regions associated with complex traits.

Peer Review reports

Introduction

eQTL (expression Quantitative Trait Loci) studies explore the relationship between genetic variants, such as SNPs (Single Nucleotide Polymorphisms), and gene expression levels. By identifying how specific SNPs influence gene expression, researchers can gain insights into the genetic basis of complex traits and diseases. In the context of transcriptomics, which involves the comprehensive analysis of RNA transcripts, eQTL mapping is particularly valuable (Fig. 1). When considering multiple tissues, incorporating tissue-tissue correlations becomes essential, as gene expression can be influenced by interactions across different tissues. This approach allows for a more holistic understanding of gene regulation and the identification of novel regions associated with complex traits, enhancing the precision and effectiveness of transcriptome-wide association studies. Transcriptome-wide association studies (TWAS) have therefore gained prominence in the field of genomics and genetics for their potential to uncover genetic variants associated with complex traits. These studies leverage gene expression data to bridge the gap between genetic variations and phenotypic traits. Although TWAS holds great promise, the accuracy of gene expression prediction across multiple tissues remains a challenge. Previous research addressed the challenge of multi-tissue gene expression (Grinberg and Wallace, 2021) [1]

Machine learning algorithms have been used to predict eQTL regulation of a gene by analyzing adjacent genetic variants [2, 3]. Researchers have used genetically predicted gene expression models to conduct transcriptome-wide association studies (TWAS) and identify novel areas linked to complex traits. However, many of these regions lack a GWAS relationship within 1Mb.

There are various benefits to such analyses: Leveraging gene expression enriches possible trait-associated SNPs, whereas joint eQTL modeling improves overall association strength and reduces the number of tests from millions to roughly 20,000 genes.

Leveraging shared eQTLs across tissues enhances eQTL discovery and gene expression imputation accuracy, leading to more powerful transcriptome-wide association analyses [4,5,6]. Hu et al. [7] and Molstad et al. [8] developed a penalized regression strategy for collaborative modeling of eQTLs. The penalty encourages shared eQTLs between tissues. Using genotype and expression data from the Genotype-Tissue Expression (GTEx) project, multi-tissue eQTL models significantly increase gene association identification and imputation accuracy compared to single-tissue techniques.

While Hu et al.’s technique [7] does not account for tissue-specific gene expression correlation in combined eQTL modeling, Molstad et al. [8] uses cross-tissue imputation with EM algorithm. Both approach uses linear model to derive their algorithm. Recent research suggests that accounting for metric based on skewness improves variable selection and prediction accuracy and can identify regulators or genes in large patient cohorts. A recent study showed that significant correlation was detected between expression skewness and the top 500 genes corresponding to the most significant differential DNA methylation occurring in the promotor regions in TCGA [9]

Recent research suggests that accounting for tissue-tissue correlation in high-dimensional penalized multivariate response linear regression improves variable selection and prediction accuracy. This phenomena may be explained by a seemingly unrelated regression interpretation of high-dimensional sparse multivariate response linear regression. Moreover, certain tissue types are more difficult to obtain due to biological and cost constraints. Using tissue-tissue correlation with skewness assumption enhances gene expression prediction accuracy, particularly for small data numbers.

In this paper, we propose an approach (TISSLET) which not only impute missing gene expression using cross information from several tissue, but also estimate tissue-tissue correlation. We calculate a joint eQTL weights while imputing missing gene expression using a skewed normal modeling. The technique by which our approach works is straightforward. Measuring expression in one tissue can provide a reliable estimate of expression in the other, especially if their expression is substantially correlated. Ignoring tissue-tissue correlation can significantly reduce gene expression prediction accuracy. Our methodology offers several advantages over previous approaches for multi-tissue joint eQTL mapping, including:

1.
Incorporation of a full tissue-tissue correlation matrix in the model, rather than assuming a diagonal matrix, which reveals cross-tissue expression dependencies that eQTLs alone cannot explain.
2.
Efficient estimation of eQTL weights by modeling cross-tissue associations.
3.
Relaxation of the normality assumption by allowing for a heavy-tailed error distribution, assuming that errors follow a multivariate skewed distribution.

Figure 1 gives an overview of TISSLET method. This figure demonstrates the training of the imputation model. The inputs required for TISSLET is a matrix of gene expression for several tissues. We also use the subjects genotypes with sample matched measured expression. TISSLET’s weights output are calculated based on the CEM algorithm using measured gene expression and provide the weights and covariance structure of tissues. For full details, see the Materials and Methods section.

Materials and methods.

Regression models are commonly used to map the relationship between SNPs and gene expression levels in eQTL studies. Traditional models often assume normality of errors, which may not capture the true distribution of gene expression data. Introducing a skewed model with a cross-tissue expression based on SNP genotypes can provide a more accurate representation by allowing for heavy-tailed error distributions, improving the precision of eQTL mapping and the identification of genetic influences on gene expression.

Let $\textbf{x}_{i} \in {\mathbb {R}}^{p}$ represent the genotypes of p SNPs (both centered and normalized) and let $\textbf{y}_{i} \in {\mathbb {R}}^{q}$ represent the vector of centered and normalized measured expression in q tissues for the $i^{th}$ subject for a specific gene within a certain distance (e.g. 500 kb) or less away from the gene of interest. We assume that gene expression represents a realization of the random vector for the $i^{th}$ subject:

$$\begin{aligned} \textbf{y}_{i} \sim \varvec{\beta }^{t} \textbf{x}_{i} + \varvec{\epsilon }_{i}, \qquad \varvec{\epsilon }_{i} \sim {\mathbb{S}\mathbb{N}}(\varvec{0}, \varvec{\Sigma }, \varvec{\Lambda }). \end{aligned}$$

(2.1)

where ${\mathbb{S}\mathbb{N}}_{q}$ denotes the q-dimensional multivariate skewed normal distribution, where $\varvec{\beta } \in \mathbb {R}^{p \times q}$ is the unknown regression coefficient matrix (i.e., eQTL weights), and $\varvec{\Sigma ^{-1}} \in \mathbb {R}^{q \times q}$ is the cross-tissue error precision (inverse covariance) matrix. We further assume that $\varvec{\epsilon }_{i}$ is independent of $\varvec{\epsilon }_{j}$ for all $i \ne j$.

The skewed distribution ${\mathbb{S}\mathbb{N}}$ can reveal the asymmetric information when the observations such as gene expression are skewed [10]. Asymmetrical distributions have an additional shape parameter $\varvec{\Lambda } \in \mathbb {R}^{q \times q}$ to represent the direction of the asymmetry of the density. If the skewness in observations is ignored, inferences with symmetric distributions may result in biased or even misleading conclusions.

Penalized skew-normal log-likelihood

In appendix (7.2) we outline in details the derivation of the model parameters (weights and cross-correlation matrices) under the normality assumption. This approach uses the ECM algorithm for parameter estimation. Algorithm 1 gives the pseudocode of the approach using normality assumption.

Similarly, using equation (2.1) and assuming that the errors $\epsilon$’s $\overset{\mathrm {i.i.d}}{\sim } {\mathbb{S}\mathbb{N}}(\varvec{0}, \varvec{\Sigma }, \varvec{\Lambda })$, for $1 \le i \le n$, and assuming that gene expression represents a realization of the random vector for the $i^{th}$ subject:

$$\begin{aligned} \begin{aligned} \left( \textbf{y}_{i}|\varvec{\beta }, \varvec{\Lambda }, \varvec{\Sigma } \right)&\sim \mathbb {N}_{p} \left( \varvec{\beta }^{t} \textbf{x}_{i}+\varvec{\Lambda } \varvec{\tau }, \varvec{\Sigma } \right) \\ \varvec{\tau }&\sim \mathbb{T}\mathbb{N}_{p} \left( \varvec{0}, \mathbb {I}_{p} \right) \end{aligned} \end{aligned}$$

where $\textbf{x}_{i}=(\textbf{x}_{i,1}, \ldots , \textbf{x}_{i,p})$ is the vector of covariates (here $\textbf{x}_{i}$ is the $i^{th}$ individus genotype), $\varvec{\beta }^{t} = (\varvec{\beta }_{1}, \ldots , \varvec{\beta }_{p})^{t}$, is an unknown matrix of mean regression coefficient and $\mathbb{T}\mathbb{N}$ is a truncated normal distribution.

On the other hand, since gene expression matrix has random missing values, we consider that $\textbf{y}_{i}'s$ are partially observed with an arbitrary missing pattern. In order to set up estimating equations for multivariate data with possible missing values, we separate $\textbf{y}_{i}(q \times 1)$ into two components ($\textbf{y}_{i}^{o}, \textbf{y}_{i}^{m}$) accordingly, where $\textbf{y}_{i}^{o}(q_{i}^{o} \times 1)$ is the observed component and $\textbf{y}_{i}^{m}((q - q_{i}^{o}) \times 1)$ is the missing component. Further, we introduce two missingness indicator matrices, denoted by $\textbf{O}_{i}$ and $\textbf{M}_{i}$ henceforth, corresponding to $\textbf{y}_{i}$ such that $\textbf{y}_{i}^{o} = \textbf{O}_{i} \textbf{y}_{i}$ and $\textbf{y}_{i}^{m} = \textbf{M}_{i} \textbf{y}_{i}$, respectively. More specifically, $\textbf{O}_{i}(q_{i}^{o} \times q)$ and $\textbf{M}_{i}((q - q_{i}^{o}) \times q)$ are sub-matrices extracted from the rows of an identity matrix of order q, $\textbf{I}_{q}$, corresponding to row positions of $\textbf{y}_{i}^{o}$ and $\textbf{y}_{i}^{m}$ in $\textbf{y}_{i}$, respectively. When $\textbf{y}_{i}$ is fully observed, $\textbf{O}_{i} = \textbf{I}_{q}$ and $\textbf{M}_{i}$ is null. Meanwhile, it is easy to verify that: $\textbf{y}_{i} = \textbf{O}^{T}_{i} \textbf{y}_{i}^{o} + \textbf{M}^{T}_{i} \textbf{y}_{i}^{m}$ and $\textbf{O}^{T}_{i}\textbf{O}_{i} + \textbf{M}^{T}_{i} \textbf{M}_{i} = \textbf{I}_{q}$. Using appendix (7.3) and Bayes Theorem the negative log-likelihood for the observations $\textbf{y}_{1}, \textbf{y}_{2}, \ldots , \textbf{y}_{n}$

$$\begin{aligned} & \mathcal {L}\left( \Theta | \textbf{y}^{o}, \textbf{y}^{m}, \mathbf {\tau } \right) = \log p(\textbf{y}^{m}|\textbf{y}^{o}, \varvec{\tau }) p(\varvec{\tau }|\textbf{y}^{o}) = \nonumber \\ & \frac{1}{2} \sum _{i=1}^{n} \left\{ \log |\varvec{\Omega }|) \; + \left( \textbf{y}_{i} -\varvec{\beta }^{t} \textbf{x}_{i}-\varvec{\Lambda } \varvec{\tau }\right) \varvec{\Omega } \left( \textbf{y}_{i}-\varvec{\beta }^{t} \textbf{x}_{i} -\varvec{\Lambda } \varvec{\tau } \right) + \varvec{\tau }_{i} \varvec{\tau }_{i}^{t} \right\} \end{aligned}$$

(2.2)

Then the expected conditional log likelihood $\textbf{Q}(\varvec{\beta }, \varvec{\Omega }, \varvec{\Lambda })$ is expressed as

$$\begin{aligned} \textbf{E}[\mathcal {L}\left( \varvec{\beta }, \varvec{\Omega }, \varvec{\Lambda }| \textbf{y}^{o}, \textbf{y}^{m}, \varvec{\tau } \right) ] \propto \frac{1}{n} \sum _{i=1}^{n} \texttt {tr} \left\{ \textbf{R}_{i}(\varvec{\beta }, \varvec{\Sigma }) \varvec{\Omega } \right\} -\log |\varvec{\Omega }| \end{aligned}$$

where

$$\begin{aligned} \textbf{R}_{i} \left( \varvec{\beta }, \varvec{\Sigma }, \varvec{\Lambda } \right) = \mathbb {E}\biggl [ \left( \textbf{y}_{i}-\varvec{\xi }-\varvec{\Lambda } \varvec{\tau }_{i})(\textbf{y}_{i}-\varvec{\xi }-\varvec{\Lambda } \varvec{\tau }_{i}\right) ^{t} | \textbf{y}^{o}_{i}, \varvec{\Theta } \biggr ] \end{aligned}$$

(2.3)

Regularization for precision matrix $\varvec{\Omega }$

We regularize the entries of $\varvec{\Omega }$ using $\ell _{1}$ penalties, then the regularized conditional log likelihood $\textbf{Q}$ is expressed as:

$$\begin{aligned} \textbf{Q}(\varvec{\beta }, \varvec{\Omega }, \varvec{\Lambda }) + {\mathcal {R}} ( \varvec{\Omega } ) = \textbf{Q}(\varvec{\beta }, \varvec{\Omega }, \varvec{\Lambda }) + \lambda _{1} \sum _{j=1}^{q} \sum _{k=1}^{p} |\varvec{\Omega }_{j,k}| \end{aligned}$$

For sufficiently large tuning parameter values $\lambda _{1}$, the penalty results in precision matrix estimations $\varvec{\Omega }$ with all off-diagonal entries equal to zero. The penalty assumes that some entries are equal to zero. In multi-tissue joint eQTL mapping, a zero in the (j,k)$^{th}$ entry of $\varvec{\Omega }$ indicates independent expression in the j$^{th}$ and k$^{th}$ tissues, given expression in all other tissues and p SNP genotypes (1). We do this for two reasons: first, when $p > n$ (which is the case for practically every gene we test, more SNPs than observations), without penalizing the diagonals, a perfect fit can occur in the M-step and not in E step (see Algorithm 2): Recent literature, has shown that precision matrix estimators employed in predictive models exhibit this behavior [11]. Next we summarize the main ECM Algorithm 2. Algorithm 3 gives a detailed version of Algorithm in appendix 7.3 ):

Remark 1: Step 4 of Algorithm 2 can be expressed

$$\begin{aligned} \varvec{\Omega }^{(k+1)} = \underset{ \varvec{\Omega } \,\in \, {\mathbb {R}}^{q\times q} }{\mathop {\mathrm {arg\,min}}\limits } \biggl [\texttt {tr} (\textbf{R}^{(k)} \varvec{\Omega }) - \log |\varvec{\Omega }| + \lambda _{1} \sum _{j=1}^{q} \sum _{k=1}^{p} |\varvec{\Omega }_{j,k}| \biggr ] \end{aligned}$$

(2.4)

Conveniently, (2.4) is exactly the optimization problem for computing the $\ell _{1}$-penalized normal log-likelihood precision matrix estimator with input sample covariance matrix $\textbf{R}(.)$. Many efficient algorithms and software packages exist for computing (2.4). We can use our R software BiGQUIC to solve (2.4) which is available at BiGQUIC [12]

Remark 2. Step 5 of Algorithm 2 can be expressed as:

$$\begin{aligned} \varvec{\xi }^{(k+1)} =\underset{ \varvec{\xi } \,\in \, {\mathbb {R}}^{p\times q} }{\mathop {\mathrm {arg\,min}}\limits } \frac{1}{n} \sum _{i=1}^{n} \biggl \{(\textbf{y}^{(k)}_{i}-\varvec{\xi }-\varvec{\Lambda } \varvec{\tau }_{i}) \varvec{\Omega }^{(k+1)} (\textbf{y}^{(k)}_{i}-\varvec{\xi }-\varvec{\Lambda } \varvec{\tau }_{i})^{t} \biggr \} \end{aligned}$$

Using appendix (7.3), we have

$$\begin{aligned}\hat{\varvec{\xi }}^{(k+1)} \,=\, \frac{1}{n} \biggl ( \sum _{i=1}^{n} \hat{\textbf{y}}_{i}^{(k)} - \hat{\varvec{\Lambda }}^{(k)} \sum _{i=1}^{n} \hat{\varvec{\eta }}_{i}^{(k)} \biggr )\end{aligned}$$

where

$$\begin{aligned}\hat{\textbf{y}}_{i} \,=\, \varvec{\Omega }^{-1} \textbf{S}^{oo}_{i} \textbf{y}_{i} + \Bigl (\textbf{I} - \varvec{\Omega }^{-1} \textbf{S}^{oo}_{i} \Bigr ) \Bigl ( \hat{\varvec{\xi }} + \hat{\varvec{\Lambda }} \hat{\varvec{\eta }_{i}} \Bigr ) \end{aligned}$$

Illustrations

Simulation study

In this section, we utilize 80 replications of data generated from multivariate regression, where the sample size is $n=50,150$, $p=22$, and $q=24$. The choice of q is made to align with the dimension of the regression models applied to the GTEx data. In each replication, a sparse matrix $\textbf{B}$ is generated using the element-wise product of three matrices: $\textbf{B}=\textbf{W} \odot \textbf{K} \odot \textbf{Q}$ where $\odot$ is the elementwise product, $(\textbf{W})_{ij} \sim N(0,1)$ and $(\textbf{K})_{ij} \sim Bernouilli(s_{1})$ and each row of $\textbf{Q}$ consists of either all 1’s or all 0’s with a success probability of 1’s equal to $s_{2}$.

By generating $\textbf{B}$ in this manner, we anticipate that $(1-s_{2})p$ predictors will be irrelevant for all q responses, and each predictor will be relevant for $s_{1}q$ of all the response variables. An $n \times p$ predictor matrix $\textbf{X}$ is also generated with rows drawn independently from $N(\textbf{0},\varvec{\Sigma }_{\textbf{X}})$, where $(\varvec{\Sigma }_{\textbf{X}})_{ij}=0.7^{|i-j|}$, following the approach of Yuan and Lin (2007) [13] and Peng et al. (2010) [14]. We consider an AR(1) covariance structure for the scale matrix of the errors, which is $\varvec{\Sigma }= \varvec{\rho }^{|i-j|}$.

Lastly, every row of the error matrix $\textbf{E}$ is independently sampled from a multivariate skew-Normal distribution $\mathbb{S}\mathbb{N}(\textbf{0}, \varvec{\Sigma }, \varvec{\Lambda })$, and the response matrix $\textbf{Y}$ is formed as $\textbf{Y} = \textbf{X} \textbf{B} + \textbf{E}$. To reduce computation time, we independently generate validation data (sample size $n=50$) within each replication to estimate the prediction error for the algorithms, akin to performing a K-fold cross-validation for the algorithm, as described in Rothman et al. (2010) [15].

We consider 36 different combinations of $\varvec{\Lambda }$; $\varvec{\rho }$; $s_{1}$ and $s_{2}$ from the following ranges:(1) $\varvec{\rho } = \{ 0, 0.4, 0.8 \}$, (2) $\varvec{\Lambda }=diag (-1, 1, -1, \ldots ,1)$ or $\mathbbm {1}_{q}$, where $\mathbbm {1}_{q}$ is a column vector of ones, (3) $s_{1} = 0.1, 0.5$, and (4) $s_{2} = 1$. Tuning parameters is selected from the set $\{ 2^{a}, a \in 0; \pm 1, \ldots , \pm 5\}$.

Results on simulation study

In the simulation study, we measure the overall performance of various methods in terms of the mean squared prediction error (PE). We have computed the prediction errors (PE) values of the entropy loss functions of the estimators of $\varvec{\Omega }$ for the 80 simulated datasets using the k-NN, Normal, the G-lasso and the TISSLET algorithm (Fig. 2). To assess the significance of the difference in prediction error between the Normal and Tisslet algorithms, we conducted a Wilcoxon signed-rank test ($R_{i}=Sign (D_{i}) - rank (|D_{i}|)$ where $D_{i}$ is the difference of both predictions. The results indicated a statistically significant difference, with a p-value of 0.0124 at the 5% significance level ($\alpha =0.05$).

While the $\ell _{1}$ loss is used to evaluate the performance of $\varvec{\Omega }$, the sparsity recognition performance of $\varvec{\Omega }$ is measured by the true positive rate (TPR) as well as the true negative rate (TNR) defined as

$$\begin{aligned} & \text {TPR}(\hat{\varvec{\Omega }}, \varvec{\Omega }) = \frac{\# \{ (i,j); \hat{\varvec{\Omega }}_{ij} \ne 0 \; \text {and}\; \varvec{\Omega }_{ij} \ne 0 \}}{ \#\{ (i,j); \varvec{\Omega }_{ij} \ne 0 \}}\\ & \text {TNR}(\hat{\varvec{\Omega }}, \varvec{\Omega }) = \frac{\# \{ (i,j); \hat{\varvec{\Omega }}_{ij} = 0 \; \text {and}\; \varvec{\Omega }_{ij} = 0 \}}{ \#\{ (i,j); \varvec{\Omega }_{ij} = 0 \}}\end{aligned}$$

The corresponding true positive rates (TPR) and true negative rates (TNR) for ${\varvec{\Omega }}$ are reported in Tables 1 and 2. From these tables it is evident that a slightly higher TPR is accompanied by a lower TNR for our algorithm Tisslet algorithm versus Molstad et al. algorithm (normal assumption). We have also compared the numerical performance of the algorithms and the G-Lasso method obtained from Friedman et al. (2022) [16] by setting $\Lambda$ to be $\textbf{0}$ for the 36 combinations. In general, it turns out that for higher error correlations ($\varvec{\rho }=0.8$) the mean square error of these methods is somewhat lower compared to the K-NN method, but the estimators of $\varvec{\Omega }$ are improved considerably. Finally, the computational times of the two algorithms are computed.

The computational times for the four algorithms were evaluated and compared. On average, the CPU time ratios of Normal, K-NN, and Lasso to our algorithm, Tisslet, were 3.79, 4.72 and 2.17 respectively. These computations were performed on the Panther cluster, equipped with 2.5 GHz processors. The differences in computational time are attributed to their computational complexities, which are generally O(np), for all except for K-NN, which has a complexity of $O(kn^{2}p)$. Lasso exhibits a lower ratio compared to the other algorithms because it penalizes small values in the covariance matrix, effectively reducing the time spent on its computation.

Table 1 TPR/TNR for the matrix $\varvec{\Omega }$ averaged over 80 replications with $s_{1}=0.1$, $s_{2}= 1$ and $\lambda = (1; 1; 1;\ldots ;1)^{T}$

Full size table

Table 2 TPR/TNR for the matrix $\varvec{\Omega }$ averaged over 80 replications with $s_{1}=0.5$, $s_{2}= 1$ and $\lambda = (1; 1; 1;\ldots ;1)^{T}$

Full size table

Data generating study for SPATC1L gene

We conducted extensive numerical experiments to examine how the number of shared eQTLs, population R$^{2}$ (also known as narrow-sense heritability), and how tissue-tissue correlation structure affect the performance of various methods for estimating eQTL weights across multiple tissues. To closely replicate the conditions of joint eQTL mapping in GTEx data, we obtained whole-genome sequencing SNP genotype data for all SNPs within 500kb of the SPATC1L gene for 620 subjects from the GTEx dataset [17]. After removing highly correlated SNPs (see Data Preparation section), we were left with $p = 1178$ SNP genotypes. For each replication, we then generated $n = 620$ subjects’ expressions in $q = 29$ tissues. Denoting $\textbf{x}_{i} \in \textbf{R}^{p}$ as the SNP genotypes for the i$^{th}$ subject, we generated $\textbf{y}_{i} \in \mathbb {R}^{q}$ as a realization of the random vector $\varvec{\beta }^{t}_{*} \textbf{x}_{i}+\varvec{\epsilon }$ for $i =[n]$, where $\varvec{\beta } \in \mathbb {R}^{p \times q}$ are the eQTL weights and errors are independent and identically distributed as $\mathbb{S}\mathbb{N}_{q}(0, \varvec{\Lambda }, \varvec{\Omega }^{-1}_{*})$ (Same protocol as [8] and [7]). For five hundred independent replications in each scenario, we randomly divided the $n = 620$ subjects into a training set of size $n_{\text {train}} = 400$, a validation set of size 110, and a testing set of size 110.

To create missingness, in each independent replication, we generated $\textbf{y}$ as follows: first, we created a matrix $\textbf{B} \in \mathbb {R}^{p \times q}$ with entries that were independent $\mathcal {N}(0, 1)$. Then, we generated a matrix $\textbf{S} \in \mathbb {R}^{p \times q}$ to be a matrix whose rows are either all zero or all one: we randomly selected s rows to be nonzero, where $s \in [20]$. Given $\textbf{S}$, we then generated a matrix $\textbf{U} \in \mathbb {R}^{p \times q}$ so that each of the q columns has $20-s$ randomly selected entries equal to one only from entries which are zero in S and all others equal to zero. We calculated $\textbf{y} = \textbf{B} \odot \textbf{S} + \textbf{B} \odot \textbf{U}$, where $\odot$ denotes the element-wise product. This construction ensured that each tissue has twenty total eQTLs, s of which are shared across all tissue types. We considered $s = \{5, 10, 15, 18, 20\}$ in the simulations presented in this section. It is important to note that due to high linkage disequilibrium, many SNPs are highly correlated, resulting in a larger number of SNPs associated with gene expression. Figure 3 gives a summary of the protocol for generating data.

In addition, we randomly introduced missing values to both the training and validation set responses with a missing probability of 0.55, which corresponds to the missing rate in the GTEx gene expression data. For each method, we trained the model on the training data, chose tuning parameters using the validation data, and evaluated prediction and variable selection accuracy on the testing data.

To construct the covariance matrix $\varvec{\Omega }$, we build it to have a block-diagonal structure and to control the R$^{2}$. Specifically, we consider a covariance structure for the scale matrix of the errors, which is $\varvec{\Sigma }_{ij} = \rho ^{|i-j|}, \; \text {for} \; i \in [20,20]$ and $\varvec{\Sigma }_{ij} = (\rho +0.2)^{|i-j|}, \; \text {for} \; i \in [10,10]$.

Results on SPATC1L gene study

Comparative Analysis of Covariance Approaches: We evaluated the s-TISSLET imputation method’s performance at different values of s for imputing missing data generated in section (3.3). Figure 4 shows that despite the varying s values, the imputed distributions closely align with the original distribution. This indicates that the imputation method maintains the overall structure of the gene expression data, which is critical for the reliability of subsequent analyses.

Dataset size and prediction accuracy. In the assessment of the TISSLET framework’s performance, a nuanced relationship between dataset size (N) and R$^2$ scores became apparent (Fig. 5a). An increase in N was correlated with a decline in R$^2$ scores, underscoring the challenges in sustaining prediction accuracy with the expansion of data volume inherent in multi-omics studies. Despite this, the TISSLET framework displayed a remarkable resilience in R$^2$ scores across varying levels of skewness in gene expression data (Fig. 5b), attesting to its robustness and adaptability to the diverse data distributions encountered in multi-omics research. Further illustrating its efficacy, the TISSLET method demonstrated superior predictive accuracy when compared with other imputation methods such as MEANimputer, k-NN and iterative imputation for multi-tissue gene expression prediction (Fig. 5c), reinforcing its suitability for complex genomic analyses.

Impact of the number of causal variants in simulation study: We extended our simulation study to assess the impact of a larger number of causal variants, ranging from 5 to 125, across a wide range of heritability levels. We found that increasing the number of causal variants had very little effect on the predictive performance of TISSLET. We believe this is because the expected imputation accuracy primarily depends on the total heritability explained by the causal SNPs (Fig. 6).

Performance of fitting the skewed normal versus normal on predicted gene: Finally the density plot provides a visual comparison between the distributions of gene expression predictions using two [8] and our method of imputation. The curves are closely aligned and have a similar shape, peaking around a value of 1, which suggests that both methods yield relatively similar predictions in terms of their distribution. However, one curve is slightly shifted towards the right indicating that our approach may predict slightly higher expression values compared to the one that uses Normality assumption (Fig. 5d).

Results of for SPATC1L gene study: In this section, we present results with tissue-tissue correlation for several errors, where $\varvec{\rho } \in \{0; 0.1; 0.3; 0.5; 0.7 \}$ varying. In this setting, we observe that our method, TISSLET performs better than all realistic competitors: only G-Lasso, outperforms slightly our method when $\rho$ is less or equal than 0.015. As one would expect, when expression is nearly uncorrelated ($\rho = 0$), our method TISSLET performs better than all the others 3 methods that does implicitly assumes no tissue-tissue correlation. Remarkably, when $\rho$ is greater than or equal to 0.3, TISSLET outperforms even the normal method which assume normality and missingness. In fact, the prediction accuracy of TISSLET increases as $\rho$ increases, whereas all other methods, which do not explicitly model tissue-tissue correlation, have prediction accuracy remaining constant or slightly decreasing as $\rho$ increases. This demonstrates the benefit of not only accounting for tissue-tissue correlation in multi-tissue joint eQTL mapping but also assume the correct distribution of the gene expression when expression across tissue types can be reasonably assumed to be conditionally dependent (Fig. 7).

Furthermore, the heatmap of gene expressions presented in Fig. 8 illustrates a detailed example of a specific gene. In this case, out of the eight eQTLs (expression quantitative trait loci) that have been identified for this gene, seven of these eQTLs are consistently shared across all 29 examined tissues, demonstrating a uniform regulatory effect. Additionally, one of the eQTLs is shared across 28 of the 29 tissues, indicating a high level of commonality with a single tissue exception. This comprehensive visualization underscores the widespread influence of these eQTLs on gene expression across a broad range of tissue types, highlighting their significant role in the regulatory network.

Conclusion and discussion

Our results suggest that the proposed TISSLET framework can robustly handle missing data in multi-tissue gene expression studies. The framework’s ability to retain accuracy across different imputation parameters, such as ‘k’, and its superior performance compared to other imputation methods, underscores its potential in improving complex trait predictions in multi-omics studies. The consistency of R² scores regardless of skewness and dataset size provides evidence of the model’s stability. However, the observed decline in prediction accuracy with increasing dataset size suggests that further optimization of the framework may be necessary to manage the complexities of large-scale multi-omics data.

In conclusion, our findings advocate for the TISSLET framework’s application in multi-omics studies to improve the prediction of complex traits. Its ability to accurately account for tissue-tissue correlations and withstand data distribution asymmetries positions it as a powerful tool in the advancement of personalized medicine. As the complexity of multi-omics studies grows, the development of robust statistical machine learning like TISSLET is crucial for unveiling the genetic foundations of complex traits and advancing our understanding of gene expression dynamics.

While the TISSLET framework marks a significant advancement in gene expression prediction across multiple tissues, this study demonstrates its precision and stability, showcasing its high accuracy in predictions. However, the framework is still in its foundational stages and requires further refinement. The next steps involve optimizing the model to enhance its performance, particularly for large-scale datasets, thereby fully realizing its potential in complex multi-omics research and personalized medicine.

Availability of data and materials

Not Applicable.

References

Grinberg NF, Wallace C. Multi-tissue transcriptome-wide association studies. Genetic Epidemiology. 2021;45(3):324–37.
Article CAS PubMed Google Scholar
Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Consortium G, Nicolae DL, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics. 2015;47(9):1091–8.
Article CAS PubMed PubMed Central Google Scholar
Manor O, Segal E. Genoexp: a web tool for predicting gene expression levels from single nucleotide polymorphisms. Bioinformatics. 2015;31(11):1848–50.
Article CAS PubMed PubMed Central Google Scholar
Lage K, Hansen NT, Karlberg EO, Eklund AC, Roque FS, Donahoe PK, Szallasi Z, Jensen TS, Brunak S. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proceedings of the National Academy of Sciences. 2008;105(52):20870–5.
Article CAS Google Scholar
Bai Y, Wang X, Hou J, Geng L, Liang X, Ruan Z, Guo H, Nan K, Jiang L. Identification of a five-gene signature for predicting survival in malignant pleural mesothelioma patients. Frontiers in Genetics. 2020;11:899.
Article CAS PubMed PubMed Central Google Scholar
Zhang N, Li P-c, Liu H, Huang T-c, Liu H, Kong Y, Dong Z-c, Yuan Y-h, Zhao L-l, Li J-h. Water and nitrogen in-situ imaging detection in live corn leaves using near-infrared camera and interference filter. Plant Methods. 2021;17:1–11.
Article CAS Google Scholar
Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, Yu Z, Li B, Gu J, Muchnik S, et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nature genetics. 2019;51(3):568–76.
Article CAS PubMed PubMed Central Google Scholar
Molstad AJ, Sun W, Hsu L. A covariance-enhanced approach to multi-tissue joint eqtl mapping with application to transcriptome-wide association studies. The annals of applied statistics. 2021;15(2):998.
Article PubMed PubMed Central Google Scholar
Church BV, Williams HT, Mar JC. Investigating skewness to understand gene expression heterogeneity in large patient cohorts. BMC bioinformatics. 2019;20:1–14.
Article Google Scholar
Hosking JRM, Wallis JR, Wood EF. Estimation of the generalized extreme-value distribution by the method of probability-weighted moments. Technometrics. 1985;27(3):251–61.
Article Google Scholar
Neal RM, Hinton GE. A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, 1998;pp. 355–368. Springer, ???
Hsieh C-J, Sustik MA, Dhillon IS, Ravikumar PK, Poldrack R. Big & quic: Sparse inverse covariance estimation for a million variables. Advances in neural information processing systems 2013;26
Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94(1):19–35.
Article Google Scholar
Peng C. Estimating and testing quantile-based process capability indices for processes with skewed distributions. Journal of Data Science. 2010;8(2):253–68.
Article Google Scholar
Rothman AJ, Levina E, Zhu J. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19(4):947–62.
Article PubMed PubMed Central Google Scholar
Friedman J TR. Hastie T: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2007;9(3)
Morgan A, Vuckovic D, Krishnamoorthy N, Rubinato E, Ambrosetti U, Castorina P, Franzè A, Vozzi D, La Bianca M, Cappellani S, et al. Next-generation sequencing identified spatc1l as a possible candidate gene for both early-onset and age-related hearing loss. European Journal of Human Genetics. 2019;27(1):70–9.
Article CAS PubMed Google Scholar
Cai W.L. Tony, Luo X. A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106(494):594–607.
Article CAS Google Scholar
Elanbari M, Rawi R, Ceccarelli M, Bouhali O, Bensmail H. Advanced computation of a sparse precision matrix hadap: A hadamard-dantzig estimation of a sparse precision matrix. In: Proceedings of the Sixth International Conference on Computational Logics, Algebras, Programming, Tools, and Benchmarking 2015. Citeseer

Download references

Acknowledgements

We extend our gratitude to Dr. Molstad from the University of Minnesota for his invaluable assistance in providing access to his code and offering detailed explanations.

Funding

Not Applicable.

Author information

Ahmed Miloudi and Aisha Al-Qahtani These authors equally contributed to this work.

Authors and Affiliations

Faculty of Medicine and Pharmacy-FUSMBA, Fes, Morocco
Ahmed Miloudi & Mohamed Chikri
Qatar Computing Research Institute, HBKU, Doha, Qatar
Aisha Al-Qahtani & Halima Bensmail
Carnegie Mellon University-Qatar, Doha, Qatar
Thamanna Hashir

Authors

Ahmed Miloudi
View author publications
You can also search for this author inPubMed Google Scholar
Aisha Al-Qahtani
View author publications
You can also search for this author inPubMed Google Scholar
Thamanna Hashir
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed Chikri
View author publications
You can also search for this author inPubMed Google Scholar
Halima Bensmail
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

The mathematical model was developed by H.B., writing the theoretical part and overview revision of the paper. M.C. Revised the paper. The author A.M. worked on running the mathematical model and producing the results. The author T.H. worked on producing the results in the earlier stage. The author A.A. worked on writing the results and discussion part with H.B. and A.M.; The corresponding author for submitting the article is A.A..

Corresponding authors

Correspondence to Aisha Al-Qahtani, Mohamed Chikri or Halima Bensmail.

Ethics declarations

Ethics approval and consent to participate

Not Applicable.

Competing interests

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Definition and notation

$\varvec{\beta }_{.,m_{i}}= \textbf{M}_{i} \varvec{\beta }_{.,i}^{t} \quad \quad \quad \quad \; \varvec{\beta }_{.,o_{i}} = \textbf{O}_{i} \varvec{\beta }_{.i}^{t}$
$\varvec{\Sigma }_{o_{i}} = \textbf{O}_{i} \varvec{\Sigma } O_{i}^{t} \quad \quad \quad \quad \; \; \varvec{\Sigma }_{o_{i}} \in \textbf{R}^{(q_{0} \times q_{0})}$
$\varvec{\Sigma }_{m_{i}} = \textbf{M}_{i} \varvec{\Sigma } \textbf{M}_{i}^{t} \quad \quad \quad \; \; \varvec{\Sigma }_{m_i} \in \textbf{R}^{(q-q_{0}) \times (q-q_{0})}$
$\varvec{\Sigma }_{o_{i},m_{i}} = \textbf{O}_{i} \varvec{\Sigma } \textbf{M}_{i}^{t} \quad \quad \quad \varvec{\Sigma }_{o_i,m_i} \in \textbf{R}^{(q_{0} \times (q-q_{0}))}$
$\varvec{\Sigma }_{m_{i},o_{i}} = \textbf{M}_{i} \varvec{\Sigma } \textbf{O}_{i}^{t} \quad \quad \quad \varvec{\Sigma }_{m_i,o_i} \in \textbf{R}^{(q-q_{0}) \times q_{0}}$
$\textbf{y}_{i}^{m_i} \sim \textbf{N}_{q-q_{0}} (\varvec{\beta }_{.,m_i}^{T} \textbf{x}_{i}, \varvec{\Sigma }_{m_{i}})$
$\textbf{y}_{i}^{o_i} \sim \textbf{N}_{q_{0}} (\varvec{\beta }_{.,o_i}^{T} \textbf{x}_{i}, \varvec{\Sigma }_{o_{i}})$

Penalized normal log-likelihood

Using Aaron et al. (2020), then can summarize the derivation of EM for estimating missing gene expression $\hat{y}_{i}$, eQTL weights $\beta$ and cross-tissue correlation $\Omega$.

$$\begin{aligned} \textbf{y}_{i}^{m_i}|y_{i}^{o_i}, \varvec{\Theta } \sim \textbf{N} (\mathbf {\mu _{i}}, \textbf{V}_{i}) \end{aligned}$$

where

$$\begin{aligned} \varvec{\mu }_{i}= & \varvec{\beta }_{.,m_i}^{T} \textbf{x}_{i} + \varvec{\Sigma }_{m_i, o_i} \varvec{\Sigma }_{o_i}^{-1} \left( \textbf{y}^{o_i} - \varvec{\beta }_{.,o_i}^{T} \mathbf {x_{i}} \right) \end{aligned}$$

(5.1)

$$\begin{aligned} \textbf{V}_{i}= & \varvec{\Sigma }_{m_i} - \varvec{\Sigma }_{m_i,o_i} \varvec{\Sigma }_{o_i}^{-1} \varvec{\Sigma }_{o_i,m_i} \end{aligned}$$

(5.2)

Then the negative log-likelihood of the conditional multivariate normal distribution for $\textbf{y}_{i, m_i} \mid \textbf{y}_{i, o_i}$ is proportional to

$$\begin{aligned} \frac{1}{n} \sum _{i} \left( \mathbf {y_i}^{o_i}-\varvec{\beta }_{.,o_i}^{T} \textbf{x}_{i} \right) ^{T} \left( \mathbf {y_i}^{m_i} - \varvec{\beta }_{.,m_i}^{T} \textbf{x}_{i} \right) \varvec{\Sigma }^{-1} \left( \mathbf {y_i}^{o_i}-\varvec{\beta }_{.,o_i}^{T} \textbf{x}_{i} \right) ^{T} \left( \mathbf {y_i}^{m_i} - \varvec{\beta }_{.,m_i}^{T} \textbf{x}_{i} \right) ^{T} + \log |\varvec{\Sigma }| \end{aligned}$$

The negative log likelihood is useful in the EM steps because we needed to calculate the expectation of the log-likelihood $Q(\Theta )$:

$$\begin{aligned} \textbf{Q}(\varvec{\Theta }) = \texttt {tr} \left\{ \textbf{S}_{(\varvec{\beta }, \varvec{\Sigma })} \varvec{\Omega } \right\} -\log \det \varvec{\Omega } \end{aligned}$$

We regularize the entries of $\varvec{\beta }$ and $\varvec{\Omega }$ using ${l}_{1}$ penalties, and estimate them by minimizing the penalties likelihood

$$\begin{aligned} \textbf{Q}(\varvec{\Theta }) = \texttt {tr} \left\{ \textbf{S}_{(\varvec{\beta }, \varvec{\Sigma })} \varvec{\Omega } \right\} -\log \det \varvec{\Omega } + \lambda _{1}\Vert \Omega \Vert _{1} + \lambda _{2} \sum _{jk} |\beta _{jk}| \end{aligned}$$

where the empirical covariance matrix

$$\begin{aligned} & \textbf{S}( _{(\varvec{\beta }, \varvec{\Sigma })} = \frac{1}{n} \sum _{i=1}^{n} \varvec{\Gamma }_i \nonumber \\ & \varvec{\Gamma }_i = \begin{pmatrix} \varvec{\Gamma }_{o_i,o_i} & \quad & \varvec{\Gamma }_{o_i,m_i} \\ \varvec{\Gamma }_{m_i, o_i} & \quad & \varvec{\Gamma }_{m_i, m_i} \end{pmatrix} \nonumber \\ & \varvec{\Gamma }_{o_i,o_i}= (\textbf{y}_{i, o_i}^{T} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , o_i} )^T (\textbf{y}_{i, o_i}^{T} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , o_i}) \end{aligned}$$

(5.3)

$$\begin{aligned} & \varvec{\Gamma }_{m_i,m_i}= (\varvec{\mu }_{i, m_i}^{T} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , m_i} )^T (\varvec{\mu }_{i, m_i}^{T} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , m_i}) + \textbf{V}_{m_i} \end{aligned}$$

(5.4)

$$\begin{aligned} & \varvec{\Gamma }_{m_i, o_i}= (\textbf{y}_{i, o_i}^{T} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , o_i} )^T (\varvec{\mu }_{i} - \textbf{x}_{i}^{T} \varvec{\beta }_{\cdot , m_i}) \end{aligned}$$

(5.5)

Then the algorithm is summarized as the following: