Skip to main content

BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data

Abstract

Background

Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.

Results

BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.

Conclusions

The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the “Findable, Accessible, Interoperable, and Reusable” data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.

Peer Review reports

Background

The arise of high-throughput technologies has enabled the generation of vast amounts of data on multiple levels of biological organization, as observed in the autoimmunity field with the European PRECISESADS database [1, 2], which collected multiomics data in more than 2000 individuals suffering from seven autoimmune diseases and controls. This revolution brought new tools for analyzing single omics with high efficiency. The most common packages are Deseq2 [3], EdgeR [4], and Limma [5] for transcriptomics RNA sequencing, while ChAMP [6] and IMA [7] R are packages for methylomics analysis. For metabolomics, given the complexity of metabolomics workflows, particularly untargeted approaches, various tools have been developed to address the different stages of the process (e.g. peak deconvolution, alignment, normalization, data curation, statistical analyses, peak annotation, etc.) [8]. To meet these needs, numerous tools are covering one or several workflow stages, available as both R packages (e.g. XCMS [9], batchCorr [10], notame [11], MetID [12], etc.) and user friendly software platforms (Metaboanalyst [13], mzMine [14], MS-DIAL [15], Sirius [16], etc.). This wide range of tools reflects the rapid advancements in metabolomics, offering researchers robust options to handle every step of the workflow.

Similarly, the integration of metabolomics with other omics has become increasingly feasible and appealing over the past decade, fueling the current revolution in multi-omics integration. State-of-the-art approaches for multi-omics integration include early, middle and late family methods. Late integration focuses on identifying overlapping significant results across different omics layers, while early integration involves concatenating and imputing missing data before analyzing the unified multi-omics matrix. However, early integration does not account for the distinct data distributions of the various omics unlike middle methods, which address this by transforming and processing omics data according to their specific distributions [17]. This advantage has made middle methods the most widely used and versatile integration approaches. Various algorithms belong to this family, including matrix factorization–regression and association methods such as Multi-Omics Factor Analysis (MOFA) [18], Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies (DIABLO) [19], other matrix factorization methods [20, 21], IclusterPlus [22], and network analysis. These include Bayesian networks such as PAthway Recognition Algorithm using Data Integration on Genomic Models (PARADIGM) [23] and matrix factorization-based methods such as NEighborhood based Multi-Omics clustering (NEMO) [24] and Similarity Network Fusion (SNF) [25]. However, each available tool was developed to solve a specific task, such as disease subtyping, disease insight, or biomarker prediction. These tools require expertise in coding and bioinformatics, making them difficult to access for specialized biologists and clinicians who do not have coding skills. The suggestion to shift biological research towards a multi-omics approach is supported by the availability of databases that provides cross-analysis of multi-omics data, such as Cancer Genome Atlas (https://cancergenome.nih.gov/) and the Omics Discovery Index (https://www.omicsdi.org). Similarly, multi-omics studies can be alternatively found by consulting single-omics repositories including Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) for transcriptomics and methylation, Proteomics Identification Database PRIDE (https://www.ebi.ac.uk/pride/) for proteomics [26], and MetaboLights (https://www.ebi.ac.uk/metabolights/) for metabolomics data [27]. Bioinformaticians and data scientists must provide access to novel multi-omics integration resources and tools, especially to specialists in fields such as biology and clinical research. This is both an ethical and pragmatic necessity, as these experts are best equipped to fully understand pathological or biological alterations, such as those seen in diseases. Only a few tools, like MixOmics have attempted to democratize these omics methods [28]. Unfortunately, the integration tools available to those without bioinformatics expertise remains limited, and comprehensive bioinformatics toolkits often require the concatenation of multiple tools even for single-omics analysis.

This need has driven the development of BiomiX, a solution to ease access for users without bioinformatics expertise and our contribution to democratize multi-omics integration methods to the scientific community. To our knowledge, BiomiX is the first bioinformatics tool to include both single-omics analyses and their multiomics integration. BiomiX utilizes MOFA, a middle integration method that offers a more intuitive interpretation compared to other integration approaches (e.g. early and late). It stands out by selecting relevant factors through regularization, capturing variability across omics, and identifying key contributing variables. BiomiX chose to implement MOFA allowing for a tuning of the total number of factors and the identification of the biological processes behind the factors of interest through clinical data correlation and pathway analysis. BiomiX implemented, for the first time, the factor identification through bibliography research on Pubmed, underlining the importance of integrating literature knowledge in the interpretation of MOFA factors. BiomiX also provides robust, validated pipelines in single omics with additional functions, such as sample subgrouping analysis, gene ontology, annotation, and summary figures. The graphic user interface of BiomiX is available to ensure user-friendliness and flexibility and handle transcriptomics, metabolomics, methylomics data, unlabelled data and their integration. All this provides a wide choice of parameters and an interactive data visualization, supported by tutorials on an instructive website.

Implementation

BiomiX graphics, parameters and data manipulation

Graphical user interface and R environment

The BiomiX interface was developed using the Python toolkit PyQt5. It allows the choice of the analysis and desired parameters for each omics (transcriptomics, metabolomics, and methylomics) to provide the output results and prepare the data for the integration analysis. The global script of the BiomiX system is shown in Fig. 1. The interface is available on all OS systems, such as Windows, Linux, and Mac. The download and tutorial are available on the following BiomiX Github pages, respectively: https://github.com/IxI-97/BiomiX and https://ixi-97.github.io. The installation occurs in a conda environment.

Fig. 1
figure 1

BiomiX Pipeline and Script Schema. This illustration outlines the general BiomiX structure and its scripts. The upper part depicts the input table and the three main scripts for analyzing single omics, including the output results and transformed matrices. These matrices are then used as inputs for MOFA integration where users can adjust or arbitrarily select the total number of factors. In eighter cases, a distinct script is executed. The bottom part represents the definition of the discriminant factor and the extraction of its top features. These discriminant factors are correlated with clinical data to identify significant correlations using Pearson correlation (A), while the top features are analysed through pathway analysis (B) and explored via a text-mining/PubMed research approach (C)

BiomiX interface parameters

BiomiX aims to provide a simple, intuitive user interface, as shown in Fig. 2. The program launcher prompts users to upload a metadata file containing samples for analysis in the omics databases. It then generates the main interface, which allows users to select a detected group as a control and a condition/disease group for analysis. The interface displays all groups available in the provided database. It consists of six rows, representing a slot for omics data, and multiple columns that help users define the input and the analysis to be performed. Users are prompted to specify whether the data should be analysed or integrated the type of omics data, and a label to name the output folder. This label can also be used generate a regex for filtering samples by sample names. Single omics analysis and integration are independent processes, so neither needs to be completed before the other.

Fig. 2
figure 2

BiomiX Interface. The illustration of the BiomiX main interface on the left and advanced options windows on the right. The advanced options are divided into four sections: general, metabolomics annotation, metadata, and MOFA

An input button allows users to upload the matrix file to BiomiX. They are then asked if they wish to modify the matrix format. If so, an assisted format converter guides them through the conversion process to the BiomiX format. Transcriptomics, metabolomics, methylomics and undefined data can be added, analysed, and integrated in any combination. In this manuscript, the term "analysis" in single omics refers to the statistical comparison of variables between two groups of interest. Specifically, this includes differential gene expression (DGE) analysis for transcriptomics, differential metabolite abundance analysis for metabolomics, differential methylation analysis for methylomics, and t-tests or Wilcoxon tests for undefined omics. Once the databases for integration are selected, the parameters for MOFA integration and advanced options can be set in the lower section of the interface. Users can define an arbitrary number of MOFA factors or choose an automatic tuning option to determine the optimal number of factors in the MOFA model. One factor can be selected from the interface to focus on in the final report which display omics contributions, clustering and heatmaps. MOFA supports the integration of samples not shared across all omics datasets, though this can introduce bias if this applies to the majority of the data. To address this, BiomiX includes a parameter that allows user to filter the samples in the integration analysis based on a minimum number of shared omics.

Advanced options enable deeper customization of the analysis and are divided into five sections. The first is the “general” section, which includes Log2FC, adjusted p-value threshold, CPU usage, the number of input variables for MOFA, the gene panel, and criteria for panel positivity criteria.

The second section focuses on metabolomics allowing users to select the type of metabolomics annotation primarily to configure settings related to metabolite identification. On the one hand, there is an option for targeted metabolomics or for non-targeted dataset where peaks have been previously annotated using external resources. This option supports metabolomics data obtained from any analytical platform such as Liquid Chromatography Mass Spectrometry (LC–MS), Gas Chromatography coupled to Mass Spectrometry (GC–MS), Capillary Electrophoresis coupled to Mass Spectrometry (CE-MS) or Nuclear Magnetic Resonance (NMR), where the metabolites' biological identities are available. The difference between targeted and untargeted metabolomics lies in the precise quantification of predefined metabolites for the former, whereas the latter provides a broad profile of all detectable metabolites in a sample. Furthermore, since high resolution mass spectrometry (HRMS) is the most widely used platform in untargeted metabolomics and annotation is a workflow bottlenecks [29,30,31], BiomiX offers annotation at both the MS1 and MS2 levels. HRMS generally refers to techniques providing the highest precision in measuring molecules' mass-to-charge ratio (m/z). Users can upload MS1 files, containing mass-to-charge ratio (m/z), directories for the mzML or.mgf files for Data Dependent Acquisition (DDA)-MS2 annotation. This Data Dependent Acquisition consists in data collection from metabolites fragmented within a specified mass range in tandem mass spectrometry. Users can also prioritize metabolomics databases, such as Human Metabolome Database (HMDB), Kyoto Encyclopedia of Genes and Genomes (KEGG), LipidMap, Metlin, MassBank, and Mass Bank of North America (MoNA). The third section allows users to filter samples based on the provided metadata information, where it is possible to filter by threshold or a group within a selected metadata column (e.g. cell purity, ethnicity and proteinemia). The fourth section customizes MOFA analysis, adjusts model, iteration settings and contribution weight thresholds. It also affects MOFA interpretation by setting, the number of articles considered in bibliography research, the type of clinical data available, and the p-value threshold in pathway analysis. The final section allows users to save the selected parameters.

The BiomiX-assisted format converter and BiomiX toolkit

BiomiX-assisted format converter is a simple functionality that allows users to modify a matrix directly in the BiomiX interface. It can also perform transposition, remove columns or rows and identify the features column to facilitate the conversion of any data table to the BiomiX format. The BiomiX toolkit also allows users to manipulate the matrix before uploading it. Specifically, it supports the imputation method such as random forest, lasso, and NIPALS (Mixomics) [28] or simply replacing missing values with 0 or the mean/median of the variable. Additionally, variable or sample filtering can be applied based on the user-defined threshold for missing values.

Preview-QC visualization

To ensure a well-informed use of the uploaded data for both single-omics and integration analyses, BiomiX opens a Shiny interface. In the first tab, the data are pre-explored providing summary figures that visualize normalization status and the expression of key variables, as well as offering Principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and correlation heatmap views. Users can visualize QC samples and their loading order from metabolomics data to assess matrix quality and detect batch effects using PCA and UMAP. Users can apply different data normalization methods, such as Z-Score, Median Absolute Deviation (MAD), Quantile Normalization, Loess Normalization, Variance Stabilizing Transformation (VST), or transform them by Logarithmic, Median Centralization, and Mean Centralization. A comprehensive guideline within the Shiny interface and on the website helps users to choose the most appropriate method. Once transformed, the data are updated regenerating the figures, to show how the modifications affect the dataset. The second tab displays the data matrix, while the third tab allows users to remove a percentile of features with the highest variance and outlier samples, based on the squared sum of the distance of samples from the centroid in the principal components. The p-value corresponds to the quantile of the empirical distance distribution (e.g., a p-value threshold of 0.05 corresponds to the 95th percentile of distances). Finally, the fourth tab enables users to download the normalized or transformed data according to BiomiX format requirements.

Single omics analysis

BiomiX transcriptomics input and pipeline

BiomiX requires the expression matrix Msg where the columns “s” represent the samples, and the rows “g” contain the genes in Ensembl or the gene name symbol. The matrix should contain the row counts in integer format obtained by counting the aligned reads in the genes in the bam files. Examples of tools are HTSeq-count [32] or featureCounts [33]. Alternatively, transformed data can be used as input in floating format. Limma automatically recognizes and analyses them by DGE. If sex and age are available in the metadata, they are used to correct the Deseq2 and Limma models. The row counts are used as input for the Deseq2 or Limma R packages for the transformed counts to compare the two conditions to conduct a DGE analysis. Differentially expressed genes are then sorted in the results files and separated by their significance and up- or down-regulation compared to the controls. The default thresholds for differentially expressed genes are Log2FC > |0.5| and p.adj < 0.05. The p-value is adjusted using the False Discovery Rate (FDR) method [34] and a heatmap and a volcano plot are generated. Users can choose the number of visualized genes. The enrichment of biological processes in the results is explored in the R version of EnrichR. Moreover, output files are produced as input for gene set enrichment analysis using Gene set enrichment analysis (GSEA) [35] (http://www.gsea-msigdb.org/gsea/index.jsp) or the EnrichR web tool [36] (https://maayanlab.cloud/Enrichr/). The expression matrix of row counts is by default transformed by variance-stabilizing transformation for data visualization and MOFA integration, unless another method is set up in the pre-QC by the user. A summary of these features and the pipeline is shown in Fig. 3.

Fig. 3
figure 3

BiomiX Transcriptomics Pipeline. The illustration depicts BiomiX transcriptomics pipeline, with preview-QC visualization to transform data for MOFA integration on the right and statistical analysis on the left. The Statistical analysis includes a subpopulation step where additional differential gene expression analysis by DESeq2 or Limma is performed on subpopulations with the control (CTRL) as a reference. Positive (pos) and negative (neg) subgroup are created, and a gene panel file is required for the subpopulation analysis. The results folders from the analysis (pos vs CTRL and neg vs CTRL) contain the volcano plot and results table, including Log2FC, p-value, and False Discovery Rate (FDR). The Up-and down-regulated genes are separated to facilitate exploration and copying to EnrichR for pathway analysis. Preliminary EnrichR analysis in R is already available in the results. Additionally, the expression matrix is converted into.gct format for GSEA analysis. The gene variance distribution is explored, with corresponding plots and values provided

Subpopulation of differential gene expression analysis based on a gene panel

For transcriptomic single-omics analysis, BiomiX enables the insertion of a panel of genes to identify subpopulations within the condition being analysed to conduct separate DGE analyses. It compares positive and negative patients for the gene panel with the control group, providing the same results files for any transcriptomic analyses. It can confirm subgroups or define new subpopulations within known ones for diseases or treatments with well-known subpopulation markers (e.g. interferon or interleukin signaling genes in autoimmune diseases) Subpopulation recognition relies on the variation measured in standard deviation units compared to the mean expression in the chosen control. This method was inspired by similar approaches to measuring IFN-alpha signaling, such as the Kirou score [37] or similar methodologies [38]. Any panel of genes of interest could be similarly employed. Specifically, the standard deviation score (Z activity score) uses the counts normalized by the number of reads to compare the expression of each gene (g) in each disease or treated sample (s) with the mean expression of the controls divided by the standard deviation of the controls as in the following equation:

$$Z\_activity\_score = \frac{{gene\_expression_{gs} - mean\left( {gene\_expression\_control\_population} \right)}}{standard\_deviation (gene\_expression\_control\_population)}$$

The higher the standard deviation shift is in a gene, the higher the gene expression is in the condition compared to the controls. Subgrouping is dependent on the user parameters and criteria and is completely customizable. By default, according to the Kirou score, the samples with three genes with a score > 2 or 10 genes with a score > 1 are labeled positive. A heatmap is built using the standard deviation score, based on hierarchical clustering in the Complexheatmap package v2.12.022 on R. Euclidean distance, and Ward’s D2 method is used for hierarchical clustering by default.

BiomiX metabolomics input and pipeline

BiomiX metabolomics data requires a peak signal matrix Msg, where the columns “s” represent the samples, and the rows “g” contain the arbitrary peak numbers. BiomiX is highly flexible, supporting the analysis of both targeted (annotated) and untargeted metabolomics data, with the option to analyse untargeted data either with or without annotation. The untargeted annotation is based on MS1 annotation (mass/charge ratio “m/z”) or MS2 data (MS1 annotation plus raw MS2 fragmentation files in.mzML or.mgf format). Users starting with.mzML files can generate the peak matrix by pre-processing raw data (peak deconvolution, RT alignment, and normalization) by user-friendly tools (MZMine, MS-DIAL and Metaboanalyst) or R packages pipelines [14, 39, 40]. This matrix can be then used as input for BiomiX. Fernández-Ochoa and Shen’s articles [41, 42] provides examples of these tools. The peak signals from treated samples are compared to control samples by calculating the Log2FC which is the log2 of the ratio between their median peak signals. The p-values are evaluated by the non-parametric Mann–Whitney test and corrected through the FDR method [34].

Then, the metabolomics peaks are annotated using the CEU Mass Mediator tool [30] through the CMMR R package [43]. The MS1 m/z match is set by default to a 15-ppm error for positive mode, but neutral and negative modes are also available. The adducts available in the positive mode include [M + H]+, [M + 2H]2+, [M + NA]+, [M + NH4]+, and [M + H-H2O]+, and in the negative mode, they include [M-H], [M + Cl], [M + FA-H], and [M-H-H2O]. By default, all available MS1 databases are examined (the Human Metabolome Database [HMDB], Lipidmaps, Metlin, and Kegg), but their use is customizable. Both the databases and parameters should be carefully reviewed by the user according to the dataset. While the default options allow the analysis to proceed, they do not guarantee high-quality results without proper parameter selection. Therefore, we strongly recommend consulting the BiomiX tutorial and its parameters section before starting (https://ixi-97.github.io). BiomiX examines the lists containing previously identified or predicted metabolites in the HBMD [44] to filter metabolites at the same time associated with one identical peak for retaining those already identified or spotted in a type of specimen. These include plasma, urine, saliva, cerebrospinal fluid, feces, sweat, breast milk, bile and amniotic fluid samples. When MS/MS spectra are also available, BiomiX will upload all the.mzML or.mgf files in the indicated directory and verify the peak fragmentation spectra, looking for a match in the Mass Bank of North America (MoNA), MassBank, and HMDB. The user must prioritize these databases, using the first as a reference and the others to fill the metabolomics peaks not annotated by the higher-priority databases. Priority and use of these databases are fully customizable, but the default order of priority is HMDB, MoNA, and MassBank. The overlap of the candidate spectra retrieved in.mzML or.mgf files and those from the databases are saved in the output folder. Each peak annotation detected in MS2 will automatically replace the annotation obtained in MS1 because the former is more reliable. A summary of these features and the pipeline are shown in Fig. 4.

Fig. 4
figure 4

BiomiX Metabolomics Pipeline. This illustration shows the BiomiX metabolomic pipeline after sample data acquisition and pre-processing. It highlights the generation of transformed data via Preview-QC interface for MOFA integration on the right and statistical analysis on the left. The statistical analysis includes two pipelines based on the peak annotation status. If the annotation is already present (pipeline “A”) such as in targeted or untargeted metabolomics with annotated signals, metabolomic peaks from two conditions (disease and control) are directly analysed to produce a result folder including Log2FC, p-value, and False Discovery Rate (FDR). MetPath analyses the biological pathways of significant metabolites making them ready to copy on Metaboanalyst for further analysis. For untargeted data acquired by HRMS, annotation occurs via pipeline “B”, which is divided into two sub-pipelines depending on the HRMS data type: MS level 1 (Pipeline B.1) or DDA-MS2 (Pipeline B.2). Both pipelines utilize the CEU Mass Mediator database to match the m/z values of metabolomics peaks with those in the database, providing the best matches for each peak. Exclusively for the B.2 pipeline, the fragmentation files (.mzML or.mgf) are compared with HMDB, MassBank, and MoNA databases to match each fragmentation spectra with those available in these databases and the results are stored in the results folder. MS/MS (MS2) annotations replace MS1 annotations due to their higher reliability. For both the pipelines (B.1 and B.2), as one m/z value can have multiple annotations, the candidate can be filtered based on reports or predictions of that metabolite in the sample type (i.e., human plasma, urine, feces, and saliva) by HMDB. The remaining pipeline includes statistical and pathway analyses using MetPath and Metaboanalyst, as described previously

The top increased and reduced significant metabolites are displayed in a volcano plot and heatmap according to user choices. Transformation is applied to the peak signal matrix for MOFA integration by preview-QC, which allows to visualize the QC samples distribution in the PCA and UMAP space. To unveil the enrichment in biological pathways, BiomiX exploits the R packages MetPath v1.0.5 from TidyMass v1.0.8 [41]. Ready-to-use input files for MetaboAnalyst [13] are generated for metabolomics analysis, including conventional metabolite set enrichment analysis and late integration analysis by joint pathway and network analyses. The late integration utilizes prior results from transcriptomics or methylomics data from the same dataset.

BiomiX methylomics input and pipeline

BiomiX requires the expression matrix “Msg,” where the columns “s” represent the samples, and the rows “g” contain CpG island annotation. CpG islands are DNA regions rich in cytosine and guanine, where the methylation of these nucleotides can exert epigenetic regulation. The matrix must contain beta values; if unavailable, the Minfi R package [45] can calculate them. BiomiX performs a Differential methylation analysis using the ChAMP [6] database, providing the CpG island Δbeta value, the p-adjusted corrected by FDR and a summarizing volcano plot. The threshold has been set as the default to the beta value change (Δbeta) > |0.15| and p.adj corrected by the FDR method < 0.05 [34], but the user can customize it. Each methylomics single-omics analysis provided a volcano plot containing the names of the top CpG islands with increased and reduced methylation, as well as a heatmap including the top CpG islands with increased and reduced methylation between the two conditions. The users chose the number of CpG islands to visualize. A complete list of CpG islands with increased or reduced methylation is created. Each CpG is associated with the gene, chromosome, Log2FC, adjusted p-value, and the other ChAMP output columns. The genes associated with the CpG island with increased or reduced methylation are listed and directly analysed in EnrichR for transcriptomics results. A summary of these features and the pipeline are shown in Fig. 5.

Fig. 5
figure 5

BiomiX Methylomics Pipeline. This illustration shows the BiomiX methylomics pipeline, with beta values preview-QC visualization for MOFA integration input on the right and statistical analysis on the left. The statistical analysis uses the ChAMP package to identify the CpG island with the higherst variation between the two groups. Volcano plots and summary files assist users in exploring the results. The biological pathway analysis converts CpG island to their corresponding gene, if available, and uses them as input for further analysis. The pathway analysis results are provided in report form

BiomiX undefined input and pipeline

BiomiX undefined data requires a matrix Msg, for undefined data, where columns “s” represent the samples, and rows “g” contain the features. The features from the treated samples are compared with those from the control samples by calculating the Log2FC which is the log2 of the ratio between the median feature value in the condition-treated samples and the median feature value in the control samples for each feature. To accommodate both Gaussian and non-Gaussian data distributions, p-values are calculated using non-parametric Mann-Whitney test and t-test, both adjusted using the FDR method [34].

Multi omics integration analysis

MOFA analysis

MOFA [18] is used according to webpage developer guidelines (https://biofam.github.io/MOFA2/) with transformed input data and reduced feature size. In the preview-QC guidelines, the transformation methods are recommended based on the omics type. For transcriptomics data, the variance-stabilizing transformation function in R is suggested to improve approximate homoscedasticity, while the log transformation in metabolomics data benefits from enhancing their Gaussian distribution. Methylomics beta values do not require any transformation [18]. The top genes and CpG islands with the highest variance in transformed data are selected for MOFA integration, except for metabolomics data, which typically include fewer than thousands of peaks. MOFA can calculate any desired total number of factors to explain the shared variance between omics datasets. Other parameters customizable in the interface include convergence mode (speed of the convergence), freqELBO (frequence for Evidence Lower Bound Training curve), and Maxiter (number of iterations of MOFA model). The implementation of MOFA in BiomiX includes an automated optimization of the total number of factors. The tuning mode runs the MOFA algorithm with an increasing number of factors, stopping the iteration when at least three models show the last MOFA factor explaining less than 1% of the variability of the data. Only the top three models for separating the two conditions are maintained. The statistical discrimination between the two groups is determined for each calculated factor in each model. A non-parametric Mann–Whitney test establishes the factor value distribution between the two groups of samples; the p-values are then corrected using the FDR method [34]. The selected models have the highest number of discriminating MOFA factors. Of the MOFA models with the same number of discriminant factors, only those with the lowest adjusted p-values are selected.

The MOFA analysis provides a matrix containing the variance explained by each factor in a defined MOFA model of n factors. Two reports in PDF format recapitulate the loaded samples, the variance explained by the factors, and the genes, metabolomic peak signals and/or CpG island contributions of the selected MOFA factor to be explored by scatter plot. Furthermore, a file containing the condition separation performance of each factor and the top 5% of features with an absolute weight of > 0.50 (by default but customable by the user) are saved as output.

Multi omics integration annotation

Extraction of MOFA factors: interpretation

The tuned and arbitrary MOFA integration includes three methods to ease the user’s interpretation of the discriminating MOFA factors. A summary of the MOFA interpretation pipeline in BiomiX is shown in Fig. 6.

Fig. 6
figure 6

BiomiX MOFA Pipeline. This illustration depicts the BiomiX MOFA pipeline, with input from various normalized omics matrices on the left and their subsequent decomposition into factors. Discriminating MOFA factors and the feature contributing most to them are identified and measured by their weights. BiomiX offers several multiple tools to help better understand the nature of these factors. The first tool (A) integrates available clinical and biological data for factor identification. For numeric clinical data, Pearson correlation is used to assess significant correlations with each factor in the model. For binary labeled clinical data, the Wilcoxon test determines whether the factor value difference between the two groups is significant. The second tool (B) uses the most contributing features as input for biological pathway analysis, employing MetPath and EnrichR, depending on the type of omics. The third tool (C) retrieve relevant PubMed abstracts that closely match the most contributing features of the factor. The BiomiX PubMed search operates on three levels of research. First, it retrieves abstracts with at least one or more features from all omics. Then, it retrieves abstracts containing one or more features from all omics then from each omics pair, and finally from each single omic. A final table is generated listing the total and the unique match of features within the abstract, along with DOI, PubmedID, and keywords. As keywords can be missing novel keywords are extracted by text-mining approach in the Litsearch package. Keyword filtering is done through GSEA Biological process “BP” and Human Phenotype Ontology “HPO” vocabulary

Correlation analysis

Users can upload a matrix containing binary or numerical clinical features to integrate into the MOFA model. The numerical data are correlated through a Pearson correlation with each MOFA factor, while the binary clinical data are analysed using the Wilcoxon test after dividing the groups into positive and negative categories. The nominal p-values are corrected using the Benjamini–Hochberg method [33].

Pathway analysis

BiomiX retrieves the top contributing genes, metabolites, and CpG islands for discriminating factors in each MOFA model. Depending on the type of omics data, an R package is selected to highlight whether a biological or metabolic pathway is enriched in the enriched genes, metabolites, or CpG islands. The genes are analysed by EnrichR using the Reactome and biological process, Encyclopedia of DNA elements (ENCODE) [46] and ChIP-X Enrichment Analysis (ChEA) [47] consensus transcription factors from ChIP-X libraries, while the metabolites are analysed through MetPath using the KEGG and HMDB databases. CpG islands are associated with their genes, if they exist, and are examined using EnrichR.

Pubmed bibliography research

For each discriminating factor in each MOFA model, the top contributing genes, metabolites, and CpG island genes are used as input for PubMed research. The aim is to retrieve the abstracts of articles associated with each discriminating factor to have clues behind each factor's identity. The search algorithm has three levels of research that prioritize the results of merging more multiomics contributors. Initially, the algorithm selects the top contributors from each omics provided as input and selects only abstracts showing at least one out of ten contributors in the text. The second level does the same, but it selects article abstracts containing at least one out of ten contributors in omics pairs (e.g., transcriptomics–metabolomics, methylomics–transcriptomics, and metabolomics–methylomics). Finally, the last-level research selects article abstracts showing at least one out of ten contributors within a single omics in the text.

For these three levels, the output document includes a.tsv table containing the PubMed articles, the total number of matches among the total number of contributors and the number of times contributors. Keywords, DOIs, and match information for each contributor are available. The author-provided keywords are not optimal due to their absence in some journals. Therefore, BiomiX includes further text-mining analysis. The article’s abstract, spotted at each level, is extracted and parsed through the litsearchr version 1.0.0 in a two-to-four-word combination [48]. The vocabulary generated by each abstract is analysed to identify the more frequent combinations of words. These words are filtered by another vocabulary comprising gene set names in Gene Ontology biological processes (7,751 gene sets) and human phenotype ontology (5,405 gene sets). The 15 most frequently used words are included in the output.tsv file. Finally, a comprehensive word frequency analysis is performed on all the abstracts retrieved from all three levels.

Case studies

Case studies: Two multi-omics datasets were used to test BiomiX. First, the FastQ of tuberculosis dataset was downloaded from ENA (https://www.ebi.ac.uk/ena/browser/home) with the project code PRJNA971365. The data were processed from FastQ files, with quality checked using FastQC and adapter-containing reads trimmed using Trimmomatic v0.39. The FastQ files were aligned to the Ensembl Homo sapiens reference genome (GRCh38) and annotated to GENCODE GRCh38.104 using STAR v2.7.11 running a two-pass mapping strategy with default parameters. Gene quantification was performed using Ht-seq count v0.13.538 default parameters. At the end of the process, the 13 samples per condition, i.e. healthy controls (HC), patients with tuberculosis (PTB) and patients with tuberculosis and diabetes (PTB_DM), were available. For clarity, only the comparison between HC and PTB is reported. The entire dataset is available as an example dataset in BiomiX. The parameters were set to reproduce those similar to the original work, including |log2FC|> 1 and p.adj < 0.05 for the transcriptomics single-omics analysis, and |log2FC|> 0.5 and p.adj < 0.1 for the metabolomics single-omics analysis. Second, the FastQ files for the Chronic Lymphocytic Leukemia (CLL) dataset were downloaded from http://pace.embl.de/ and analysed using a |log2FC|> 1 and p.adj < 0.05 threshold for transcriptomics analysis, and a |log2FC|> 0.5 and p.adj < 0.1 threshold for metabolomics analysis. Each dataset was analysed individually in BiomiX for single-omics analysis, and omics integration was performed within each dataset for multi-omics analysis.

Results

Implementations and comparison with other tools

BiomiX is designed to simplify the use of the middle integration method MOFA, enhancing factor interpretation through innovative analyses of bibliographic data, pathways, and clinical correlations. It integrates with established platforms like EnrichR and GSEA, and supports a late integration approach via MetaboAnalyst integration for combining single omics results. The tool's development was guided by two key principles. The first was the decision to use matrices as input, reflecting the widespread use of this format in laboratories, consortia, and major public repositories like the Sequence Read Archive (SRA) and MetaboLights. The latter database contains annotation details from Metabolite Annotation/Assignment Files (MAFs) and.mzML or.mgf fragmentation files for MS2 annotation. The second guiding principle was to offer a comprehensive omics toolkit that can run on a standard laptop. To achieve this, computationally intensive tasks, such as converting raw instrument acquisition data into matrices, were excluded from the analysis. To promote reproducibility and reusability, all selected parameters and input files are saved in a report, adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. This ensures that analyses can be easily reproduced and shared with collaborators. Additionally, the use of the set.seed() function in R guarantees reproducibility for each analysis. BiomiX-assisted format converter can fix compatibility issues by converting the matrix into the proper format with a simple user interface, while the BiomiX toolkit button allows the removal of samples or variables over a user-selected threshold of missing values. Despite missing data in the matrices are fully compatible in BiomiX, a toolkit button provides a wide choice of imputation methods to treat the data before the single omics analysis and integration.

To our knowledge, BiomiX is the first tool designed to analyse single omics and integrate them using MOFA integration. The choice of this middle integration method was based on its ease of interpretation and faster computation compared to other methods in the same category. Additionally, its capability to handle missing data within omics and the absence of entire omics blocks in samples makes it well-suited for heterogeneous datasets, reflecting the complexity of real-world data. These characteristics render MOFA more adaptable than similar methods like DIABLO and iClusterBayes, which cannot handle missing omics or data without imputation [49]. Because it can handle missing omics and samples, which is uncommon among other middle integration methods, MOFA has become the primary integration method in BiomiX; Furthermore, since MOFA is an unsupervised integration method, it does not train a model to optimize group differences but rather explores them impartially. The availability of this method for non-bioinformaticians could also provide a novel integration approach already available on other platforms, such as MixOmics [28].

Moreover, BiomiX has also implemented the MOFA algorithm to select the models containing factors that discriminate between the two conditions of interest. The advantage of this unsupervised approach is the identification of differences between groups based on the unbiased nature of the analysis, which highlights only group-independent common omics changes. The identification of a group-discriminant MOFA factor in this context must account for a considerable explained variance, proving its reliability and palpability. BiomiX’s key innovation lies in automatically tuning the number of factors in the MOFA factor and in assisting users in identifying and annotating the discriminant MOFA factors. Specifically, the bibliography search on Pubmed performed on the discriminant factor contributors, the biological pathways analysis on the contributors and the correlation analysis between discriminant MOFA factors and clinical data. BiomiX approach differs from X-omics ACTION [50], the novel Nextflow [51] pipeline implementations for multiomics analysis, which relies solely on simple correlation analysis with clinical data for MOFA interpretation. Moreover, X-omics ACTION was designed to integrate only metabolomics and methylomics data without carrying out the statistics and biological interpretation on each of them. This is a limitation given the wide usage of transcriptomics data in integration [49]. The MixOmics project was, and still is, a milestone in multi-omics integration; however, it was not designed to perform single omics data analyses separately. Its supervised integration approach (DIABLO) for multi-omics data from the same patient, as previously mentioned, requires imputation, which could introduce bias into the data despite the impressive NIPALS methods developed. Other tools that do not integrate omics data but focus on single omics analysis are iDEP [52] and Metaboanalyst [13], for transcriptomics and metabolomics, respectively. These tools provide a wide range of downstream analyses such as heatmaps, PCA, pathways, and network analysis to facilitate data exploration. Similar functionalities are provided in BiomiX. BiomiX takes advantage of existing single-omics tools by generating inputs for their parallel use alongside BiomiX. Furthermore, it extends beyond single-omics analysis by enabling the integration of multi-omics data. Compared with bioinformatics tools specifically designed for the integration of a single omics or multi-omics (Table 1), BiomiX’s features bridge a gap in the lack of multi-omics analysis and integration tools, highlighting its importance.

Table 1 Comparison of BiomiX with single omics and multi omics integration tools

Case studies. BiomiX has been applied to two examples of databases composed of multiomics data. The first case study included transcriptomics and metabolomics analysis on PTB [53]. Compared to controls, the transcriptomics analysis of BiomiX showed the same biological pathways as in the manuscript, including Arachidonic acid metabolic process (GO:0019369), Classical antibody-Mediated complement activation (R-HSA-173623), but also novel ones as Hydrolysis of LPC (R-HSA-1483115) and B cell receptor signaling pathway (GO:0050853). The pathways were consistent in both the Gene Ontology and Reactome databases. Furthermore, to assess differences in patients’ immune responses, BiomiX enabled patients to be subgrouped according to a gene panel containing 26 IFN-induced genes. All samples having at least one IFN gene with a Z-score > 1 were considered positive, providing a clear separation (Figure S1). At first, all the PTB transcriptomes were compared with the HC ones (Figure S2A and B). The IFN-negative subgroup had genes differentially expressed enriched in Nitric oxide biosynthetic process (GO:0006809) in response to infection [54], but also Arachidonic acid metabolic process (R-HSA-2142753), suggesting an IFN-independent activation. IFN-positive were enriched in IFN signaling R-HSA-877300, B cell receptor signaling pathway (GO:0050853) with a reduction of IL-10 production (R-HSA-6783783) as expected. The metabolomics analysis on plasma revealed reductions in acylcarnitine, PC, LysoPE and TG, with significant enrichment for the sphingolipid signaling pathway, retrograde endocannabinoid signaling, caffeine, purine, linoleic and glycerophospholipid metabolism as in the manuscript, but also novel ones as phenylalanine metabolism and Fc gamma R-mediated phagocytosis (Figure S3A and B). Data integration was then carried out by BiomiX implementation of MOFA integration, which consists of two steps. First, the iterative calculus of MOFA models with a progressive number of total factors in each iteration stopped when at least three models had the last factor variance explaining less than 1%. Next, the three best-performing models in separating the two conditions by the Wilcoxon test were selected, based on the number of discriminant MOFA factors and the adjusted p.values (p.adj). Here, the MOFA factors 3, 4 and 5 were selected, with the first offering the best separation between PTB and HC and therefore being used for the following analysis. The three-factor MOFA models explained the 5.05% and 45.77% metabolomics and transcriptomics total variances, respectively. Furthermore, only factor 1 significantly discriminated the two conditions (p.adj = 0.0011, sd = 0.08) (Fig. 7A, B), catching 2.23% and 33.22% of metabolomics and transcriptomics total variances, respectively, but its identity needed identification. BiomiX provided this interpretation of factor 1 through its bibliography search, which, by evaluating the transcriptomic contributors to factor 1, spotted articles related to inflammation-driven genes, bacterial inflammation and even more specifically, articles related to Mycobacterium tuberculosis infections. The metabolomic contributors also suggested two articles related to bacteria but less specific for inflammation and less informative. Pathway analyses of the positive contributors of factor 1 confirmed the inflammation and interferon as the main biological process (R-HSA-913531, R-HSA-1280215, R-HSA-1169410) while, only Trilostane, an inhibitor of corticoid production, was identified in metabolomics positive contributors. Consistently, the negative transcriptomics contributors Interleukin-10 Signaling and immunoregulatory pathways were enriched (R-HSA-6783783, R-HSA-198933), while 11-deoxycortisol, PC, sphingolipids and cholesterol were the main negative metabolomics contributors, enriched in sphingolipid metabolism and alpha linolenic acid and linoleic acid metabolism (p-value < 0.05). The inflammation-induced lipid alteration known in the literature [55], and the negative contribution of anti-inflammatory factors (IL-10 and corticoids) support the inflammation-interferon identity of factor 1. The results obtained from BiomiX are available in the Supplementary Table 1.

Fig. 7
figure 7

Whole blood transcriptomics and plasma metabolomics analysis from tuberculosis patients using BiomiX. A Violin plot showing the distribution of patients affected by Tuberculosis (PTB; red) and control samples (HC; blue) based on each MOFA factor’s value. B Heatmaps displaying the top whole-blood 20 genes contributing to Factor 1. PTB patients (red squares) and HC (blue squares) are shown on the top. Heatmap distance: “Euclidean”, clustering method “complete”. Gene expression was normalised using the variance stabilising transformation (VST) method

The second case studied the transcriptomics and methylomics differences in CLL patients with mutated and unmutated immunoglobulin heavy variable (IGHV) gene [56]. This dataset has been used for testing the MOFA algorithm. The data presented here will thus mainly focus on the identification of the identity behind the factors, briefly citing the transcriptomics and methylomics single omics results. Transcriptomics (Figure S4A and B) and methylomics heatmap and volcano plot (Figure S5A and B) successfully highlighted the genes related to CLL differences mentioned in the article (KANK2, DGKH, MYLK, PPP1R9A, SEPTIN10, SOWAHC, PLD1, and LPL). Moreover, the pathway analysis identified upregulation in proliferation (GO:0090267) and VDJ recombination (GO:0033152, GO:0033151) pathways in methylomics. The MOFA implementation identified the models with 8, 9 and 10 factors as the best performance for identifying differences between the two conditions. The ten-factor model was the best, identifying five factors (factors 1, 3, 4, 5 and 6) discriminating mutated and unmutated IGHV CLLs (Fig. 8A). Factor 1 was the most discriminating among them, and its analysis (Fig. 8B) identified articles related to DNA damage, ageing and cancer. Correlation analysis identified a significant correlation with 11q22.3 deletion, known to induce proliferation in CLL by ZAP-70. These results confirmed the proliferation and oxidative processes already identified by the MOFA article. Factor 3 and factor 4 were correlated with chromosome 12 trisomy, while the bibliography search identified articles related to B leukaemia, DNA damage, and autoimmunity, specifically for factor 3. Factor 4 was enriched in focal adhesion and membrane-ECM Interactions and signaling (R-HSA-3000171, R-HSA-8874081), consistent with the roles of trisomy 12 in activating adhesion signaling [57]. Factor 5 was correlated with gender differences in CLL (p.adj = 1.78e−26) by articles related to sex differences in human-primates [58], mice sex development [59] and autoimmune encephalomyelitis [60]. Furthermore, it is the only factor not correlated to any treatment response. Interestingly, the pathway analysis highlighted pathways linked to galactosyltransferase activity (R-HSA-4420332, R-HSA-3560801, R-HSA-3560783), suggesting a sex difference in CLL not targeted by any drug. Finally, factor 6 was associated with mutations (TP53, SF3B1) and a deletion (del17p13) associated with a poor outcome and unfavorable prognostic factors [61, 62]. The pathway analysis supported this hypothesis, as contributors were enriched for cytokine signaling such as interleukin 4, 13, 27 (GO:0070106, R-HSA-6785807) and type 1 interferon (GO:0032481), of which interleukin 4 and IFN-alpha are already known to be linked to a more severe condition [63]. The results obtained from BiomiX are available in the supplementary Table 2.

Fig. 8
figure 8

Whole blood transcriptomics and plasma metabolomics analysis from chronic lymphocytic leukemia patients using BiomiX. A Violin plot representing the distribution of patients affected by chronic lymphocytic leukemia (CLL) with mutated (red) and unmutated (blue) IGHV based on each MOFA factor’s value. B Heatmaps showing the top whole-blood 20 genes contributing to Factor 1, comparing whole blood transcriptome (left) and the whole-blood methylome (right). IGHV unmutated CLL (blue squares) and IGHV mutated CLL (red squares) are shown on the top of the heatmaps. Heatmap distance: “Euclidean”, clustering method “complete”. The gene expression was normalised using the variance stabilising transformation (VST) method

Discussion

BiomiX aims to help biologists, physicians, and scientists with no background in bioinformatics. Some tools, such as MixOmics, allow for data integration, while others perform single-omics analysis, but none of them allow for both simultaneously [28]. Currently, to our knowledge, the only way to compare two groups with an analysis similar to that of BiomiX, is the Nextflow association of a single omics pipeline (metaboigniter [64] or other transcriptomics and methylomics modules) with integration tools such as X-omics ACTION. However, this pipeline does not provide an intuitive access to MOFA integration as MixOmics does with DIABLO and does not contain implementations such as the MOFA number of factors optimization. Moreover, while X-omics ACTION only performs correlation analysis to annotate MOFA factors, BiomiX relies on three methods based on different approaches that converge to reveal the factor's identity, as shown in the two case studies. BiomiX is thus more precise in providing factor annotation and is the first tool to exploit a Pubmed bibliography to annotate hidden factors. It is also worth mentioning that although Nextflow has developed a user interface, most changes to the pipeline require coding skills that BiomiX solution does not require. BiomiX is not a definitive solution; it is a first attempt to incorporate the gold standard single-omics pipeline in bioinformatics with tuned and accessible integration using MOFA. BiomiX guarantees high interpretability of the common source of variation among omics, providing users with single results from omic and multiomics integration, in a perspective of its application on single-cell data thanks to the MOFA method [18]. It allows for a complete overview of the changes occurring in biological systems, either in the context of disease, treatment, or physiological conditions, enhancing the interpretability of the biological pathways and processes involved. In addition to improving MOFA factor interpretability and tuning the total number of factors in the model, BiomiX simplifies interactive data visualization through a Shiny interface, enabling users to track changes before and after data transformation, as well as to remove outliers and highly variable features. A separate interface addresses missing values, offering various imputation options and controls for problematic samples or variables. With flexible parameter settings, users have full control over their analysis. We strongly recommend users to follow the guidelines and to carefully review the dataset before using BiomiX. BiomiX also provides output formats ready to copy and paste into specialized user-friendly widespread websites or programs, such as GSEA, EnrichR, and Metaboanalyst. Nevertheless, BiomiX has limitations. Although it can analyse multiple groups simultaneously, it has limited types of omics analysers. Consequently, much work remains to be done to implement functionalities based on community needs and to include more integration methods and omics data, such as proteomics and genomics data from different technologies.

Conclusions

The current work aims to grow the community of users and developers to improve the accessibility of new bioinformatics algorithms and methods within the scientific community. BiomiX represents this attempt to improve accessibility to bioinformatics tools by offering everyone access to bioinformatics methods and algorithms, focusing on the multi-omics integration methods accessibility to enable specialists in a wide range of fields to benefit from the Big Data revolution.

Availability of data and materials

The datasets generated and/or analysed during the current study are available from ENA (https://www.ebi.ac.uk/ena/browser/home) with the project code PRJNA971365 and from http://pace.embl.de/. Project name: BiomiX; Project home page: https://github.com/IxI-97/BiomiX2.2 (Github), https://ixi-97.github.io (website); Operating system(s): Windows, Linux and Mac OS; Programming language: BiomiX is implemented in R for the analysis and in Python for the user interface. Other requirements: The Miniconda 24.4.0 or higher is required for the installation; License: GNU GPL; Any restrictions to use by non-academics: None.

Abbreviations

CE-MS:

Capillary Electrophoresis coupled to Mass Spectrometry

ChEA:

ChIP-X Enrichment Analysis

CLL:

Chronic Lymphocytic Leukemia

DDA:

Data Dependent Acquisition

DGE:

Differential Gene Expression

DIABLO:

Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies

ENCODE:

Encyclopedia of DNA elements

FDR:

False Discovery Rate

GC-MS:

Gas Chromatography coupled to Mass Spectrometry

GSEA:

Gene Set Enrichment Analysis

HC:

Healthy Controls

HMDB:

Human Metabolome Database

HRMS:

High resolution mass spectrometry

IGHV:

Immunoglobulin Heavy Variable

KEGG:

Kyoto Encyclopedia of Genes and Genomes

LC-MS:

Liquid Chromatography Mass Spectrometry

MAD:

Median absolute variation

MAFs:

Metabolite Annotation/Assignment Files

MOFA:

Multiomics Factor Analysis

MoNA:

Mass Bank of North America

m/z:

Mass-to-charge ratio

NEMO:

NEighborhood based Multi-Omics clustering

NMR:

Nuclear Magnetic Resonance

PCA:

Principal component analysis

PTB:

Patient with Tuberculosis

PARADIGM:

PAthway Recognition Algorithm using Data Integration on Genomic Models

SNF:

Similarity Network Fusion

UMAP:

Uniform Manifold Approximation and Projection

VST:

Variance Stabilizing Transformation

References

  1. Barturen G, Babaei S, Català-Moll F, et al. Integrative analysis reveals a molecular stratification of systemic autoimmune diseases. Arthritis Rheumatol Hoboken NJ. 2021;73:1073–85.

    Article  CAS  Google Scholar 

  2. Fernández-Ochoa Á, Brunius C, Borrás-Linares I, et al. Metabolic disturbances in urinary and plasma samples from seven different systemic autoimmune diseases detected by HPLC-ESI-QTOF-MS. J Proteome Res. 2020;19:3220–9.

    Article  PubMed  Google Scholar 

  3. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.

    Article  CAS  PubMed  Google Scholar 

  5. Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43: e47.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Tian Y, Morris TJ, Webster AP, et al. ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics. 2017;33:3982–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Wang D, Yan L, Hu Q, et al. IMA: an R package for high-throughput analysis of Illumina’s 450K Infinium methylation data. Bioinformatics. 2012;28:729–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Perez de Souza L, Fernie AR. Computational methods for processing and interpreting mass spectrometry-based metabolomics. Essays Biochem. 2024;68(1):5–13.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Smith CA, Want EJ, O’Maille G, et al. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–87.

    Article  CAS  PubMed  Google Scholar 

  10. Brunius C, Shi L, Landberg R. Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics. 2016;12(11):173.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Klåvus A, Kokla M, Noerman S, Koistinen VM, et al. “notame”: workflow for non-targeted LC-MS metabolic profiling. Metabolites. 2020;10(4):135.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Shen X, Wu S, Liang L, et al. metID: an R package for automatable compound annotation for LC−MS-based data. Bioinformatics. 2022;38:568–9.

    Article  CAS  PubMed  Google Scholar 

  13. Pang Z, Zhou G, Ewald J, et al. Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data. Nat Protoc. 2022;17:1735–61.

    Article  CAS  PubMed  Google Scholar 

  14. Schmid R, Heuckeroth S, Korf A, et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat Biotechnol. 2023;41(4):447–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Tsugawa H, Cajka T, Kind T, et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat Methods. 2015;12(6):523–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Dührkop K, Fleischauer M, Ludwig M, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods. 2019;16(4):299–302.

    Article  PubMed  Google Scholar 

  17. Picard M, Scott-Boyer MP, Bodein A, et al. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;22(19):3735–46.

    Article  Google Scholar 

  18. Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35:3055–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinforma Biol Insights. 2020;14:1177932219899051.

    Article  Google Scholar 

  21. Vahabi N, Michailidis G. Unsupervised multi-omics data integration methods: a comprehensive review. Front. Genet. 2022; 13.

  22. Mo Q, Wang S, Seshan VE, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013;110:4245–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Vaske CJ, Benz SC, Sanborn JZ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26:i237–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Rappoport N, Shamir R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics. 2019;35:3348–56.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11:333–7.

    Article  CAS  PubMed  Google Scholar 

  26. Perez-Riverol Y, Bai J, Bandla C, et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50:D543–52.

    Article  CAS  PubMed  Google Scholar 

  27. Yurekten O, Payne T, Tejera N, et al. MetaboLights: open data repository for metabolomics. Nucleic Acids Res. 2024;52:D640–6.

    Article  CAS  PubMed  Google Scholar 

  28. Rohart F, Gautier B, Singh A, et al. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLOS Comput Biol. 2017;13: e1005752.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Theodoridis G, Gika H, Raftery D, et al. Ensuring fact-based metabolite identification in liquid chromatography-mass spectrometry-based metabolomics. Anal Chem. 2023;95(8):3909–16.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Gil-de-la-Fuente A, Godzien J, Saugar S, et al. CEU mass mediator 3.0: a metabolite annotation tool. J Proteome Res. 2019;18(2):797–802.

    Article  CAS  PubMed  Google Scholar 

  31. Pezzatti J, Boccard J, Codesido S, et al. Implementation of liquid chromatography-high resolution mass spectrometry methods for untargeted metabolomic analyses of biological samples: a tutorial. Anal Chim Acta. 2020;8(1105):28–44.

    Article  Google Scholar 

  32. Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–9.

    Article  CAS  PubMed  Google Scholar 

  33. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–30.

    Article  CAS  PubMed  Google Scholar 

  34. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc. 1995;57(1):289–300.

    Article  Google Scholar 

  35. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102:15545–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Chen EY, Tan CM, Kou Y, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Kirou KA, Lee C, George S, et al. Activation of the interferon-alpha pathway identifies a subgroup of systemic lupus erythematosus patients with distinct serologic features and active disease. Arthritis Rheum. 2005;52:1491–503.

    Article  CAS  PubMed  Google Scholar 

  38. Panwar B, Schmiedel BJ, Liang S, et al. Multi–cell type gene coexpression network analysis reveals coordinated interferon response and cross–cell type correlations in systemic lupus erythematosus. Genome Res. 2021.

  39. Libiseller G, Dvorzak M, Kleb U, et al. IPO: a tool for automated optimization of XCMS parameters. BMC Bioinform. 2015;16:118.

    Article  Google Scholar 

  40. Broeckling CD, Afsar FA, Neumann S, et al. RAMClust: a novel feature clustering method enables spectral-matching-based annotation for metabolomics data. Anal Chem. 2014;86:6812–7.

    Article  CAS  PubMed  Google Scholar 

  41. Shen X, Yan H, Wang C, et al. TidyMass an object-oriented reproducible analysis framework for LC–MS data. Nat Commun. 2022;13:4365.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Fernández-Ochoa Á, Quirantes-Piné R, Borrás-Linares I, et al. A case report of switching from specific vendor-based to R-based pipelines for untargeted LC-MS metabolomics. Metabolites. 2020;10:28.

    Article  PubMed  PubMed Central  Google Scholar 

  43. CMMR - CEU Mass Mediator API in R. 2019 (https://github.com/YaoxiangLi/cmmr).

  44. Wishart DS, Tzur D, Knox C, et al. HMDB: the human metabolome database. Nucleic Acids Res. 2007;35:D521-526.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Aryee MJ, Jaffe AE, Corrada-Bravo H, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Luo Y, Hitz BC, Gabdank I, et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020;48:D882–9.

    Article  CAS  PubMed  Google Scholar 

  47. Lachmann A, Xu H, Krishnan J, et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinforma Oxf Engl. 2010;26:2438–44.

    Article  CAS  Google Scholar 

  48. Grames EM, Stillman AN, Tingley MW, et al. An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks. Methods Ecol Evol. 2019;10:1645–54.

    Article  Google Scholar 

  49. Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2022;1(21):134–49.

    Google Scholar 

  50. Niehues A, de Visser C, Hagenbeek FA, et al. A multi-omics data analysis workflow packaged as a FAIR Digital Object. GigaScience. 2024;13:giad115.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9.

    Article  PubMed  Google Scholar 

  52. Ge SX, Son EW, Yao R. iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinform. 2018;19:534.

    Article  CAS  Google Scholar 

  53. Wang Y, He X, Zheng D, et al. Integration of metabolomics and transcriptomics reveals major metabolic pathways and potential biomarkers involved in pulmonary tuberculosis and pulmonary tuberculosis-complicated diabetes. Microbiol Spectr. 2023;11:e00577-e623. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/spectrum.00577-23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Schairer DO, Chouake JS, Nosanchuk JD, et al. The potential of nitric oxide releasing therapies as antimicrobial agents. Virulence. 2012;3:271–9.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Caterino M, Gelzo M, Sol S, et al. Dysregulation of lipid metabolism and pathological inflammation in patients with COVID-19. Sci Rep. 2021;11:2941.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Dietrich S, Oleś M, Lu J, et al. Drug-perturbation-based stratification of blood cancer. J Clin Invest. 2018;128:427–45. https://doiorg.publicaciones.saludcastillayleon.es/10.1172/JCI93801.

    Article  PubMed  Google Scholar 

  57. Riches JC, O’Donovan CJ, Kingdon SJ, et al. Trisomy 12 chronic lymphocytic leukemia cells exhibit upregulation of integrin signaling that is modulated by NOTCH1 mutations. Blood. 2014;123:4101–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Shi X, Facemire L, Singh S, et al. UBA1-CDK16 : A Sex-Specific Chimeric RNA and Its Role in Immune Sexual Dimorphism. BioRxiv Prepr. Serv. Biol. 2024; 2024.02.13.580120

  59. Rock KD, Folts LM, Zierden HC, et al. Developmental transcriptomic patterns can be altered by transgenic overexpression of Uty. Sci Rep. 2023;13:21082.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Fazazi MR, Ruda GF, Brennan PE, et al. The X-linked histone demethylases KDM5C and KDM6A as regulators of T cell-driven autoimmunity in the central nervous system. Brain Res Bull. 2023;202: 110748.

    Article  CAS  PubMed  Google Scholar 

  61. Rossi D, Cerri M, Deambrogi C, et al. The prognostic value of TP53 mutations in chronic lymphocytic leukemia is independent of Del17p13: implications for overall survival and chemorefractoriness. Clin Cancer Res. 2009;15:995–1004.

    Article  CAS  PubMed  Google Scholar 

  62. Wan Y, Wu CJ. SF3B1 mutations in chronic lymphocytic leukemia. Blood. 2013;121:4627–34.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Yan X-J, Dozmorov I, Li W, et al. Identification of outcome-correlated cytokine clusters in chronic lymphocytic leukemia. Blood. 2011;118:5201–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Ewels PA, Peltzer A, Fillinger S, et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020;38:276–8.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We also acknowledge the 3TR and PRECISESADS consortium, which guaranteed the data used in this article and allowed us to analyse them.

Funding

This work was supported by the Innovative Medicines Initiative Joint Undertaking under Grant Agreement Number 115565, resources of which are composed of financial contributions from the European Union’s Seventh Framework Program (FP7/2007–2013) and EFPIA companies in kind. CI was funded by the Université de Brest and the Région Bretagne. AFO thanks the fundings received by “Ayudas al funcionamiento de los Grupos operativos de la Asociación Europea para la Innovación (AEI) en materia de productividad y sostenibilidad agrícolas en el sector del olivar, 2020” (Grant Number GOPO-GR-20–0001).

Author information

Authors and Affiliations

Authors

Consortia

Contributions

C.I. was in charge of writing and planning the bioinformatic approaches. C.J. and A.B. contributed equally to supervising the work and the scientific relevance of the article, while the other authors evaluated, revised, and approved the article.

Corresponding author

Correspondence to Christophe Jamin.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iperi, C., Fernández-Ochoa, Á., Barturen, G. et al. BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data. BMC Bioinformatics 26, 8 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-06022-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-06022-y

Keywords