BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data

Iperi, Cristian; Fernández-Ochoa, Álvaro; Barturen, Guillermo; Pers, Jacques-Olivier; Foulquier, Nathan; Bettacchioli, Eleonore; Alarcón-Riquelme, Marta; Cornec, Divi; Bordron, Anne; Jamin, Christophe

doi:10.1186/s12859-024-06022-y

Software
Open access
Published: 10 January 2025

BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data

Cristian Iperi¹,
Álvaro Fernández-Ochoa²,
Guillermo Barturen^3,4,
Jacques-Olivier Pers⁵,
Nathan Foulquier⁵,
Eleonore Bettacchioli⁵,
Marta Alarcón-Riquelme^3,6,
PRECISESADS Flow Cytometry Study Group, PRECISESADS Clinical Consortium,
Divi Cornec⁵,
Anne Bordron¹^na1 &
…
Christophe Jamin ORCID: orcid.org/0000-0002-9494-3223⁵^na1

BMC Bioinformatics volume 26, Article number: 8 (2025) Cite this article

2610 Accesses
2 Altmetric
Metrics details

Abstract

Background

Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.

Results

BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.

Conclusions

The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the “Findable, Accessible, Interoperable, and Reusable” data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.

Peer Review reports

Background

The arise of high-throughput technologies has enabled the generation of vast amounts of data on multiple levels of biological organization, as observed in the autoimmunity field with the European PRECISESADS database [1, 2], which collected multiomics data in more than 2000 individuals suffering from seven autoimmune diseases and controls. This revolution brought new tools for analyzing single omics with high efficiency. The most common packages are Deseq2 [3], EdgeR [4], and Limma [5] for transcriptomics RNA sequencing, while ChAMP [6] and IMA [7] R are packages for methylomics analysis. For metabolomics, given the complexity of metabolomics workflows, particularly untargeted approaches, various tools have been developed to address the different stages of the process (e.g. peak deconvolution, alignment, normalization, data curation, statistical analyses, peak annotation, etc.) [8]. To meet these needs, numerous tools are covering one or several workflow stages, available as both R packages (e.g. XCMS [9], batchCorr [10], notame [11], MetID [12], etc.) and user friendly software platforms (Metaboanalyst [13], mzMine [14], MS-DIAL [15], Sirius [16], etc.). This wide range of tools reflects the rapid advancements in metabolomics, offering researchers robust options to handle every step of the workflow.

Similarly, the integration of metabolomics with other omics has become increasingly feasible and appealing over the past decade, fueling the current revolution in multi-omics integration. State-of-the-art approaches for multi-omics integration include early, middle and late family methods. Late integration focuses on identifying overlapping significant results across different omics layers, while early integration involves concatenating and imputing missing data before analyzing the unified multi-omics matrix. However, early integration does not account for the distinct data distributions of the various omics unlike middle methods, which address this by transforming and processing omics data according to their specific distributions [17]. This advantage has made middle methods the most widely used and versatile integration approaches. Various algorithms belong to this family, including matrix factorization–regression and association methods such as Multi-Omics Factor Analysis (MOFA) [18], Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies (DIABLO) [19], other matrix factorization methods [20, 21], IclusterPlus [22], and network analysis. These include Bayesian networks such as PAthway Recognition Algorithm using Data Integration on Genomic Models (PARADIGM) [23] and matrix factorization-based methods such as NEighborhood based Multi-Omics clustering (NEMO) [24] and Similarity Network Fusion (SNF) [25]. However, each available tool was developed to solve a specific task, such as disease subtyping, disease insight, or biomarker prediction. These tools require expertise in coding and bioinformatics, making them difficult to access for specialized biologists and clinicians who do not have coding skills. The suggestion to shift biological research towards a multi-omics approach is supported by the availability of databases that provides cross-analysis of multi-omics data, such as Cancer Genome Atlas (https://cancergenome.nih.gov/) and the Omics Discovery Index (https://www.omicsdi.org). Similarly, multi-omics studies can be alternatively found by consulting single-omics repositories including Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) for transcriptomics and methylation, Proteomics Identification Database PRIDE (https://www.ebi.ac.uk/pride/) for proteomics [26], and MetaboLights (https://www.ebi.ac.uk/metabolights/) for metabolomics data [27]. Bioinformaticians and data scientists must provide access to novel multi-omics integration resources and tools, especially to specialists in fields such as biology and clinical research. This is both an ethical and pragmatic necessity, as these experts are best equipped to fully understand pathological or biological alterations, such as those seen in diseases. Only a few tools, like MixOmics have attempted to democratize these omics methods [28]. Unfortunately, the integration tools available to those without bioinformatics expertise remains limited, and comprehensive bioinformatics toolkits often require the concatenation of multiple tools even for single-omics analysis.

This need has driven the development of BiomiX, a solution to ease access for users without bioinformatics expertise and our contribution to democratize multi-omics integration methods to the scientific community. To our knowledge, BiomiX is the first bioinformatics tool to include both single-omics analyses and their multiomics integration. BiomiX utilizes MOFA, a middle integration method that offers a more intuitive interpretation compared to other integration approaches (e.g. early and late). It stands out by selecting relevant factors through regularization, capturing variability across omics, and identifying key contributing variables. BiomiX chose to implement MOFA allowing for a tuning of the total number of factors and the identification of the biological processes behind the factors of interest through clinical data correlation and pathway analysis. BiomiX implemented, for the first time, the factor identification through bibliography research on Pubmed, underlining the importance of integrating literature knowledge in the interpretation of MOFA factors. BiomiX also provides robust, validated pipelines in single omics with additional functions, such as sample subgrouping analysis, gene ontology, annotation, and summary figures. The graphic user interface of BiomiX is available to ensure user-friendliness and flexibility and handle transcriptomics, metabolomics, methylomics data, unlabelled data and their integration. All this provides a wide choice of parameters and an interactive data visualization, supported by tutorials on an instructive website.

Implementation

BiomiX graphics, parameters and data manipulation

Graphical user interface and R environment

The BiomiX interface was developed using the Python toolkit PyQt5. It allows the choice of the analysis and desired parameters for each omics (transcriptomics, metabolomics, and methylomics) to provide the output results and prepare the data for the integration analysis. The global script of the BiomiX system is shown in Fig. 1. The interface is available on all OS systems, such as Windows, Linux, and Mac. The download and tutorial are available on the following BiomiX Github pages, respectively: https://github.com/IxI-97/BiomiX and https://ixi-97.github.io. The installation occurs in a conda environment.

BiomiX interface parameters

BiomiX aims to provide a simple, intuitive user interface, as shown in Fig. 2. The program launcher prompts users to upload a metadata file containing samples for analysis in the omics databases. It then generates the main interface, which allows users to select a detected group as a control and a condition/disease group for analysis. The interface displays all groups available in the provided database. It consists of six rows, representing a slot for omics data, and multiple columns that help users define the input and the analysis to be performed. Users are prompted to specify whether the data should be analysed or integrated the type of omics data, and a label to name the output folder. This label can also be used generate a regex for filtering samples by sample names. Single omics analysis and integration are independent processes, so neither needs to be completed before the other.

An input button allows users to upload the matrix file to BiomiX. They are then asked if they wish to modify the matrix format. If so, an assisted format converter guides them through the conversion process to the BiomiX format. Transcriptomics, metabolomics, methylomics and undefined data can be added, analysed, and integrated in any combination. In this manuscript, the term "analysis" in single omics refers to the statistical comparison of variables between two groups of interest. Specifically, this includes differential gene expression (DGE) analysis for transcriptomics, differential metabolite abundance analysis for metabolomics, differential methylation analysis for methylomics, and t-tests or Wilcoxon tests for undefined omics. Once the databases for integration are selected, the parameters for MOFA integration and advanced options can be set in the lower section of the interface. Users can define an arbitrary number of MOFA factors or choose an automatic tuning option to determine the optimal number of factors in the MOFA model. One factor can be selected from the interface to focus on in the final report which display omics contributions, clustering and heatmaps. MOFA supports the integration of samples not shared across all omics datasets, though this can introduce bias if this applies to the majority of the data. To address this, BiomiX includes a parameter that allows user to filter the samples in the integration analysis based on a minimum number of shared omics.

Advanced options enable deeper customization of the analysis and are divided into five sections. The first is the “general” section, which includes Log2FC, adjusted p-value threshold, CPU usage, the number of input variables for MOFA, the gene panel, and criteria for panel positivity criteria.

The second section focuses on metabolomics allowing users to select the type of metabolomics annotation primarily to configure settings related to metabolite identification. On the one hand, there is an option for targeted metabolomics or for non-targeted dataset where peaks have been previously annotated using external resources. This option supports metabolomics data obtained from any analytical platform such as Liquid Chromatography Mass Spectrometry (LC–MS), Gas Chromatography coupled to Mass Spectrometry (GC–MS), Capillary Electrophoresis coupled to Mass Spectrometry (CE-MS) or Nuclear Magnetic Resonance (NMR), where the metabolites' biological identities are available. The difference between targeted and untargeted metabolomics lies in the precise quantification of predefined metabolites for the former, whereas the latter provides a broad profile of all detectable metabolites in a sample. Furthermore, since high resolution mass spectrometry (HRMS) is the most widely used platform in untargeted metabolomics and annotation is a workflow bottlenecks [29,30,31], BiomiX offers annotation at both the MS1 and MS2 levels. HRMS generally refers to techniques providing the highest precision in measuring molecules' mass-to-charge ratio (m/z). Users can upload MS1 files, containing mass-to-charge ratio (m/z), directories for the mzML or.mgf files for Data Dependent Acquisition (DDA)-MS2 annotation. This Data Dependent Acquisition consists in data collection from metabolites fragmented within a specified mass range in tandem mass spectrometry. Users can also prioritize metabolomics databases, such as Human Metabolome Database (HMDB), Kyoto Encyclopedia of Genes and Genomes (KEGG), LipidMap, Metlin, MassBank, and Mass Bank of North America (MoNA). The third section allows users to filter samples based on the provided metadata information, where it is possible to filter by threshold or a group within a selected metadata column (e.g. cell purity, ethnicity and proteinemia). The fourth section customizes MOFA analysis, adjusts model, iteration settings and contribution weight thresholds. It also affects MOFA interpretation by setting, the number of articles considered in bibliography research, the type of clinical data available, and the p-value threshold in pathway analysis. The final section allows users to save the selected parameters.

The BiomiX-assisted format converter and BiomiX toolkit

BiomiX-assisted format converter is a simple functionality that allows users to modify a matrix directly in the BiomiX interface. It can also perform transposition, remove columns or rows and identify the features column to facilitate the conversion of any data table to the BiomiX format. The BiomiX toolkit also allows users to manipulate the matrix before uploading it. Specifically, it supports the imputation method such as random forest, lasso, and NIPALS (Mixomics) [28] or simply replacing missing values with 0 or the mean/median of the variable. Additionally, variable or sample filtering can be applied based on the user-defined threshold for missing values.

Preview-QC visualization

To ensure a well-informed use of the uploaded data for both single-omics and integration analyses, BiomiX opens a Shiny interface. In the first tab, the data are pre-explored providing summary figures that visualize normalization status and the expression of key variables, as well as offering Principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and correlation heatmap views. Users can visualize QC samples and their loading order from metabolomics data to assess matrix quality and detect batch effects using PCA and UMAP. Users can apply different data normalization methods, such as Z-Score, Median Absolute Deviation (MAD), Quantile Normalization, Loess Normalization, Variance Stabilizing Transformation (VST), or transform them by Logarithmic, Median Centralization, and Mean Centralization. A comprehensive guideline within the Shiny interface and on the website helps users to choose the most appropriate method. Once transformed, the data are updated regenerating the figures, to show how the modifications affect the dataset. The second tab displays the data matrix, while the third tab allows users to remove a percentile of features with the highest variance and outlier samples, based on the squared sum of the distance of samples from the centroid in the principal components. The p-value corresponds to the quantile of the empirical distance distribution (e.g., a p-value threshold of 0.05 corresponds to the 95th percentile of distances). Finally, the fourth tab enables users to download the normalized or transformed data according to BiomiX format requirements.

Single omics analysis

BiomiX transcriptomics input and pipeline

BiomiX requires the expression matrix M_sg where the columns “s” represent the samples, and the rows “g” contain the genes in Ensembl or the gene name symbol. The matrix should contain the row counts in integer format obtained by counting the aligned reads in the genes in the bam files. Examples of tools are HTSeq-count [32] or featureCounts [33]. Alternatively, transformed data can be used as input in floating format. Limma automatically recognizes and analyses them by DGE. If sex and age are available in the metadata, they are used to correct the Deseq2 and Limma models. The row counts are used as input for the Deseq2 or Limma R packages for the transformed counts to compare the two conditions to conduct a DGE analysis. Differentially expressed genes are then sorted in the results files and separated by their significance and up- or down-regulation compared to the controls. The default thresholds for differentially expressed genes are Log2FC > |0.5| and p.adj < 0.05. The p-value is adjusted using the False Discovery Rate (FDR) method [34] and a heatmap and a volcano plot are generated. Users can choose the number of visualized genes. The enrichment of biological processes in the results is explored in the R version of EnrichR. Moreover, output files are produced as input for gene set enrichment analysis using Gene set enrichment analysis (GSEA) [35] (http://www.gsea-msigdb.org/gsea/index.jsp) or the EnrichR web tool [36] (https://maayanlab.cloud/Enrichr/). The expression matrix of row counts is by default transformed by variance-stabilizing transformation for data visualization and MOFA integration, unless another method is set up in the pre-QC by the user. A summary of these features and the pipeline is shown in Fig. 3.

Subpopulation of differential gene expression analysis based on a gene panel

For transcriptomic single-omics analysis, BiomiX enables the insertion of a panel of genes to identify subpopulations within the condition being analysed to conduct separate DGE analyses. It compares positive and negative patients for the gene panel with the control group, providing the same results files for any transcriptomic analyses. It can confirm subgroups or define new subpopulations within known ones for diseases or treatments with well-known subpopulation markers (e.g. interferon or interleukin signaling genes in autoimmune diseases) Subpopulation recognition relies on the variation measured in standard deviation units compared to the mean expression in the chosen control. This method was inspired by similar approaches to measuring IFN-alpha signaling, such as the Kirou score [37] or similar methodologies [38]. Any panel of genes of interest could be similarly employed. Specifically, the standard deviation score (Z activity score) uses the counts normalized by the number of reads to compare the expression of each gene (g) in each disease or treated sample (s) with the mean expression of the controls divided by the standard deviation of the controls as in the following equation:

$$Z\_activity\_score = \frac{{gene\_expression_{gs} - mean\left( {gene\_expression\_control\_population} \right)}}{standard\_deviation (gene\_expression\_control\_population)}$$

The higher the standard deviation shift is in a gene, the higher the gene expression is in the condition compared to the controls. Subgrouping is dependent on the user parameters and criteria and is completely customizable. By default, according to the Kirou score, the samples with three genes with a score > 2 or 10 genes with a score > 1 are labeled positive. A heatmap is built using the standard deviation score, based on hierarchical clustering in the Complexheatmap package v2.12.022 on R. Euclidean distance, and Ward’s D2 method is used for hierarchical clustering by default.

BiomiX metabolomics input and pipeline

BiomiX metabolomics data requires a peak signal matrix M_sg, where the columns “s” represent the samples, and the rows “g” contain the arbitrary peak numbers. BiomiX is highly flexible, supporting the analysis of both targeted (annotated) and untargeted metabolomics data, with the option to analyse untargeted data either with or without annotation. The untargeted annotation is based on MS1 annotation (mass/charge ratio “m/z”) or MS2 data (MS1 annotation plus raw MS2 fragmentation files in.mzML or.mgf format). Users starting with.mzML files can generate the peak matrix by pre-processing raw data (peak deconvolution, RT alignment, and normalization) by user-friendly tools (MZMine, MS-DIAL and Metaboanalyst) or R packages pipelines [14, 39, 40]. This matrix can be then used as input for BiomiX. Fernández-Ochoa and Shen’s articles [41, 42] provides examples of these tools. The peak signals from treated samples are compared to control samples by calculating the Log2FC which is the log2 of the ratio between their median peak signals. The p-values are evaluated by the non-parametric Mann–Whitney test and corrected through the FDR method [34].

Then, the metabolomics peaks are annotated using the CEU Mass Mediator tool [30] through the CMMR R package [43]. The MS1 m/z match is set by default to a 15-ppm error for positive mode, but neutral and negative modes are also available. The adducts available in the positive mode include [M + H]⁺, [M + 2H]²⁺, [M + NA]⁺, [M + NH₄]⁺, and [M + H-H₂O]⁺, and in the negative mode, they include [M-H]⁻, [M + Cl]⁻, [M + FA-H]⁻, and [M-H-H₂O]⁻. By default, all available MS1 databases are examined (the Human Metabolome Database [HMDB], Lipidmaps, Metlin, and Kegg), but their use is customizable. Both the databases and parameters should be carefully reviewed by the user according to the dataset. While the default options allow the analysis to proceed, they do not guarantee high-quality results without proper parameter selection. Therefore, we strongly recommend consulting the BiomiX tutorial and its parameters section before starting (https://ixi-97.github.io). BiomiX examines the lists containing previously identified or predicted metabolites in the HBMD [44] to filter metabolites at the same time associated with one identical peak for retaining those already identified or spotted in a type of specimen. These include plasma, urine, saliva, cerebrospinal fluid, feces, sweat, breast milk, bile and amniotic fluid samples. When MS/MS spectra are also available, BiomiX will upload all the.mzML or.mgf files in the indicated directory and verify the peak fragmentation spectra, looking for a match in the Mass Bank of North America (MoNA), MassBank, and HMDB. The user must prioritize these databases, using the first as a reference and the others to fill the metabolomics peaks not annotated by the higher-priority databases. Priority and use of these databases are fully customizable, but the default order of priority is HMDB, MoNA, and MassBank. The overlap of the candidate spectra retrieved in.mzML or.mgf files and those from the databases are saved in the output folder. Each peak annotation detected in MS2 will automatically replace the annotation obtained in MS1 because the former is more reliable. A summary of these features and the pipeline are shown in Fig. 4.

The top increased and reduced significant metabolites are displayed in a volcano plot and heatmap according to user choices. Transformation is applied to the peak signal matrix for MOFA integration by preview-QC, which allows to visualize the QC samples distribution in the PCA and UMAP space. To unveil the enrichment in biological pathways, BiomiX exploits the R packages MetPath v1.0.5 from TidyMass v1.0.8 [41]. Ready-to-use input files for MetaboAnalyst [13] are generated for metabolomics analysis, including conventional metabolite set enrichment analysis and late integration analysis by joint pathway and network analyses. The late integration utilizes prior results from transcriptomics or methylomics data from the same dataset.

BiomiX methylomics input and pipeline

BiomiX requires the expression matrix “M_sg,” where the columns “s” represent the samples, and the rows “g” contain CpG island annotation. CpG islands are DNA regions rich in cytosine and guanine, where the methylation of these nucleotides can exert epigenetic regulation. The matrix must contain beta values; if unavailable, the Minfi R package [45] can calculate them. BiomiX performs a Differential methylation analysis using the ChAMP [6] database, providing the CpG island Δbeta value, the p-adjusted corrected by FDR and a summarizing volcano plot. The threshold has been set as the default to the beta value change (Δbeta) > |0.15| and p.adj corrected by the FDR method < 0.05 [34], but the user can customize it. Each methylomics single-omics analysis provided a volcano plot containing the names of the top CpG islands with increased and reduced methylation, as well as a heatmap including the top CpG islands with increased and reduced methylation between the two conditions. The users chose the number of CpG islands to visualize. A complete list of CpG islands with increased or reduced methylation is created. Each CpG is associated with the gene, chromosome, Log2FC, adjusted p-value, and the other ChAMP output columns. The genes associated with the CpG island with increased or reduced methylation are listed and directly analysed in EnrichR for transcriptomics results. A summary of these features and the pipeline are shown in Fig. 5.

BiomiX undefined input and pipeline

BiomiX undefined data requires a matrix Msg, for undefined data, where columns “s” represent the samples, and rows “g” contain the features. The features from the treated samples are compared with those from the control samples by calculating the Log2FC which is the log2 of the ratio between the median feature value in the condition-treated samples and the median feature value in the control samples for each feature. To accommodate both Gaussian and non-Gaussian data distributions, p-values are calculated using non-parametric Mann-Whitney test and t-test, both adjusted using the FDR method [34].

Multi omics integration analysis

MOFA analysis

MOFA [18] is used according to webpage developer guidelines (https://biofam.github.io/MOFA2/) with transformed input data and reduced feature size. In the preview-QC guidelines, the transformation methods are recommended based on the omics type. For transcriptomics data, the variance-stabilizing transformation function in R is suggested to improve approximate homoscedasticity, while the log transformation in metabolomics data benefits from enhancing their Gaussian distribution. Methylomics beta values do not require any transformation [18]. The top genes and CpG islands with the highest variance in transformed data are selected for MOFA integration, except for metabolomics data, which typically include fewer than thousands of peaks. MOFA can calculate any desired total number of factors to explain the shared variance between omics datasets. Other parameters customizable in the interface include convergence mode (speed of the convergence), freqELBO (frequence for Evidence Lower Bound Training curve), and Maxiter (number of iterations of MOFA model). The implementation of MOFA in BiomiX includes an automated optimization of the total number of factors. The tuning mode runs the MOFA algorithm with an increasing number of factors, stopping the iteration when at least three models show the last MOFA factor explaining less than 1% of the variability of the data. Only the top three models for separating the two conditions are maintained. The statistical discrimination between the two groups is determined for each calculated factor in each model. A non-parametric Mann–Whitney test establishes the factor value distribution between the two groups of samples; the p-values are then corrected using the FDR method [34]. The selected models have the highest number of discriminating MOFA factors. Of the MOFA models with the same number of discriminant factors, only those with the lowest adjusted p-values are selected.

The MOFA analysis provides a matrix containing the variance explained by each factor in a defined MOFA model of n factors. Two reports in PDF format recapitulate the loaded samples, the variance explained by the factors, and the genes, metabolomic peak signals and/or CpG island contributions of the selected MOFA factor to be explored by scatter plot. Furthermore, a file containing the condition separation performance of each factor and the top 5% of features with an absolute weight of > 0.50 (by default but customable by the user) are saved as output.

Multi omics integration annotation

Extraction of MOFA factors: interpretation

The tuned and arbitrary MOFA integration includes three methods to ease the user’s interpretation of the discriminating MOFA factors. A summary of the MOFA interpretation pipeline in BiomiX is shown in Fig. 6.

Correlation analysis

Users can upload a matrix containing binary or numerical clinical features to integrate into the MOFA model. The numerical data are correlated through a Pearson correlation with each MOFA factor, while the binary clinical data are analysed using the Wilcoxon test after dividing the groups into positive and negative categories. The nominal p-values are corrected using the Benjamini–Hochberg method [33].

Pathway analysis

BiomiX retrieves the top contributing genes, metabolites, and CpG islands for discriminating factors in each MOFA model. Depending on the type of omics data, an R package is selected to highlight whether a biological or metabolic pathway is enriched in the enriched genes, metabolites, or CpG islands. The genes are analysed by EnrichR using the Reactome and biological process, Encyclopedia of DNA elements (ENCODE) [46] and ChIP-X Enrichment Analysis (ChEA) [47] consensus transcription factors from ChIP-X libraries, while the metabolites are analysed through MetPath using the KEGG and HMDB databases. CpG islands are associated with their genes, if they exist, and are examined using EnrichR.

Pubmed bibliography research

For each discriminating factor in each MOFA model, the top contributing genes, metabolites, and CpG island genes are used as input for PubMed research. The aim is to retrieve the abstracts of articles associated with each discriminating factor to have clues behind each factor's identity. The search algorithm has three levels of research that prioritize the results of merging more multiomics contributors. Initially, the algorithm selects the top contributors from each omics provided as input and selects only abstracts showing at least one out of ten contributors in the text. The second level does the same, but it selects article abstracts containing at least one out of ten contributors in omics pairs (e.g., transcriptomics–metabolomics, methylomics–transcriptomics, and metabolomics–methylomics). Finally, the last-level research selects article abstracts showing at least one out of ten contributors within a single omics in the text.

For these three levels, the output document includes a.tsv table containing the PubMed articles, the total number of matches among the total number of contributors and the number of times contributors. Keywords, DOIs, and match information for each contributor are available. The author-provided keywords are not optimal due to their absence in some journals. Therefore, BiomiX includes further text-mining analysis. The article’s abstract, spotted at each level, is extracted and parsed through the litsearchr version 1.0.0 in a two-to-four-word combination [48]. The vocabulary generated by each abstract is analysed to identify the more frequent combinations of words. These words are filtered by another vocabulary comprising gene set names in Gene Ontology biological processes (7,751 gene sets) and human phenotype ontology (5,405 gene sets). The 15 most frequently used words are included in the output.tsv file. Finally, a comprehensive word frequency analysis is performed on all the abstracts retrieved from all three levels.

Case studies

Case studies: Two multi-omics datasets were used to test BiomiX. First, the FastQ of tuberculosis dataset was downloaded from ENA (https://www.ebi.ac.uk/ena/browser/home) with the project code PRJNA971365. The data were processed from FastQ files, with quality checked using FastQC and adapter-containing reads trimmed using Trimmomatic v0.39. The FastQ files were aligned to the Ensembl Homo sapiens reference genome (GRCh38) and annotated to GENCODE GRCh38.104 using STAR v2.7.11 running a two-pass mapping strategy with default parameters. Gene quantification was performed using Ht-seq count v0.13.538 default parameters. At the end of the process, the 13 samples per condition, i.e. healthy controls (HC), patients with tuberculosis (PTB) and patients with tuberculosis and diabetes (PTB_DM), were available. For clarity, only the comparison between HC and PTB is reported. The entire dataset is available as an example dataset in BiomiX. The parameters were set to reproduce those similar to the original work, including |log2FC|> 1 and p.adj < 0.05 for the transcriptomics single-omics analysis, and |log2FC|> 0.5 and p.adj < 0.1 for the metabolomics single-omics analysis. Second, the FastQ files for the Chronic Lymphocytic Leukemia (CLL) dataset were downloaded from http://pace.embl.de/ and analysed using a |log2FC|> 1 and p.adj < 0.05 threshold for transcriptomics analysis, and a |log2FC|> 0.5 and p.adj < 0.1 threshold for metabolomics analysis. Each dataset was analysed individually in BiomiX for single-omics analysis, and omics integration was performed within each dataset for multi-omics analysis.

Results

Implementations and comparison with other tools

BiomiX is designed to simplify the use of the middle integration method MOFA, enhancing factor interpretation through innovative analyses of bibliographic data, pathways, and clinical correlations. It integrates with established platforms like EnrichR and GSEA, and supports a late integration approach via MetaboAnalyst integration for combining single omics results. The tool's development was guided by two key principles. The first was the decision to use matrices as input, reflecting the widespread use of this format in laboratories, consortia, and major public repositories like the Sequence Read Archive (SRA) and MetaboLights. The latter database contains annotation details from Metabolite Annotation/Assignment Files (MAFs) and.mzML or.mgf fragmentation files for MS2 annotation. The second guiding principle was to offer a comprehensive omics toolkit that can run on a standard laptop. To achieve this, computationally intensive tasks, such as converting raw instrument acquisition data into matrices, were excluded from the analysis. To promote reproducibility and reusability, all selected parameters and input files are saved in a report, adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. This ensures that analyses can be easily reproduced and shared with collaborators. Additionally, the use of the set.seed() function in R guarantees reproducibility for each analysis. BiomiX-assisted format converter can fix compatibility issues by converting the matrix into the proper format with a simple user interface, while the BiomiX toolkit button allows the removal of samples or variables over a user-selected threshold of missing values. Despite missing data in the matrices are fully compatible in BiomiX, a toolkit button provides a wide choice of imputation methods to treat the data before the single omics analysis and integration.

To our knowledge, BiomiX is the first tool designed to analyse single omics and integrate them using MOFA integration. The choice of this middle integration method was based on its ease of interpretation and faster computation compared to other methods in the same category. Additionally, its capability to handle missing data within omics and the absence of entire omics blocks in samples makes it well-suited for heterogeneous datasets, reflecting the complexity of real-world data. These characteristics render MOFA more adaptable than similar methods like DIABLO and iClusterBayes, which cannot handle missing omics or data without imputation [49]. Because it can handle missing omics and samples, which is uncommon among other middle integration methods, MOFA has become the primary integration method in BiomiX; Furthermore, since MOFA is an unsupervised integration method, it does not train a model to optimize group differences but rather explores them impartially. The availability of this method for non-bioinformaticians could also provide a novel integration approach already available on other platforms, such as MixOmics [28].

Moreover, BiomiX has also implemented the MOFA algorithm to select the models containing factors that discriminate between the two conditions of interest. The advantage of this unsupervised approach is the identification of differences between groups based on the unbiased nature of the analysis, which highlights only group-independent common omics changes. The identification of a group-discriminant MOFA factor in this context must account for a considerable explained variance, proving its reliability and palpability. BiomiX’s key innovation lies in automatically tuning the number of factors in the MOFA factor and in assisting users in identifying and annotating the discriminant MOFA factors. Specifically, the bibliography search on Pubmed performed on the discriminant factor contributors, the biological pathways analysis on the contributors and the correlation analysis between discriminant MOFA factors and clinical data. BiomiX approach differs from X-omics ACTION [50], the novel Nextflow [51] pipeline implementations for multiomics analysis, which relies solely on simple correlation analysis with clinical data for MOFA interpretation. Moreover, X-omics ACTION was designed to integrate only metabolomics and methylomics data without carrying out the statistics and biological interpretation on each of them. This is a limitation given the wide usage of transcriptomics data in integration [49]. The MixOmics project was, and still is, a milestone in multi-omics integration; however, it was not designed to perform single omics data analyses separately. Its supervised integration approach (DIABLO) for multi-omics data from the same patient, as previously mentioned, requires imputation, which could introduce bias into the data despite the impressive NIPALS methods developed. Other tools that do not integrate omics data but focus on single omics analysis are iDEP [52] and Metaboanalyst [13], for transcriptomics and metabolomics, respectively. These tools provide a wide range of downstream analyses such as heatmaps, PCA, pathways, and network analysis to facilitate data exploration. Similar functionalities are provided in BiomiX. BiomiX takes advantage of existing single-omics tools by generating inputs for their parallel use alongside BiomiX. Furthermore, it extends beyond single-omics analysis by enabling the integration of multi-omics data. Compared with bioinformatics tools specifically designed for the integration of a single omics or multi-omics (Table 1), BiomiX’s features bridge a gap in the lack of multi-omics analysis and integration tools, highlighting its importance.

Table 1 Comparison of BiomiX with single omics and multi omics integration tools

Full size table

Case studies. BiomiX has been applied to two examples of databases composed of multiomics data. The first case study included transcriptomics and metabolomics analysis on PTB [53]. Compared to controls, the transcriptomics analysis of BiomiX showed the same biological pathways as in the manuscript, including Arachidonic acid metabolic process (GO:0019369), Classical antibody-Mediated complement activation (R-HSA-173623), but also novel ones as Hydrolysis of LPC (R-HSA-1483115) and B cell receptor signaling pathway (GO:0050853). The pathways were consistent in both the Gene Ontology and Reactome databases. Furthermore, to assess differences in patients’ immune responses, BiomiX enabled patients to be subgrouped according to a gene panel containing 26 IFN-induced genes. All samples having at least one IFN gene with a Z-score > 1 were considered positive, providing a clear separation (Figure S1). At first, all the PTB transcriptomes were compared with the HC ones (Figure S2A and B). The IFN-negative subgroup had genes differentially expressed enriched in Nitric oxide biosynthetic process (GO:0006809) in response to infection [54], but also Arachidonic acid metabolic process (R-HSA-2142753), suggesting an IFN-independent activation. IFN-positive were enriched in IFN signaling R-HSA-877300, B cell receptor signaling pathway (GO:0050853) with a reduction of IL-10 production (R-HSA-6783783) as expected. The metabolomics analysis on plasma revealed reductions in acylcarnitine, PC, LysoPE and TG, with significant enrichment for the sphingolipid signaling pathway, retrograde endocannabinoid signaling, caffeine, purine, linoleic and glycerophospholipid metabolism as in the manuscript, but also novel ones as phenylalanine metabolism and Fc gamma R-mediated phagocytosis (Figure S3A and B). Data integration was then carried out by BiomiX implementation of MOFA integration, which consists of two steps. First, the iterative calculus of MOFA models with a progressive number of total factors in each iteration stopped when at least three models had the last factor variance explaining less than 1%. Next, the three best-performing models in separating the two conditions by the Wilcoxon test were selected, based on the number of discriminant MOFA factors and the adjusted p.values (p.adj). Here, the MOFA factors 3, 4 and 5 were selected, with the first offering the best separation between PTB and HC and therefore being used for the following analysis. The three-factor MOFA models explained the 5.05% and 45.77% metabolomics and transcriptomics total variances, respectively. Furthermore, only factor 1 significantly discriminated the two conditions (p.adj = 0.0011, sd = 0.08) (Fig. 7A, B), catching 2.23% and 33.22% of metabolomics and transcriptomics total variances, respectively, but its identity needed identification. BiomiX provided this interpretation of factor 1 through its bibliography search, which, by evaluating the transcriptomic contributors to factor 1, spotted articles related to inflammation-driven genes, bacterial inflammation and even more specifically, articles related to Mycobacterium tuberculosis infections. The metabolomic contributors also suggested two articles related to bacteria but less specific for inflammation and less informative. Pathway analyses of the positive contributors of factor 1 confirmed the inflammation and interferon as the main biological process (R-HSA-913531, R-HSA-1280215, R-HSA-1169410) while, only Trilostane, an inhibitor of corticoid production, was identified in metabolomics positive contributors. Consistently, the negative transcriptomics contributors Interleukin-10 Signaling and immunoregulatory pathways were enriched (R-HSA-6783783, R-HSA-198933), while 11-deoxycortisol, PC, sphingolipids and cholesterol were the main negative metabolomics contributors, enriched in sphingolipid metabolism and alpha linolenic acid and linoleic acid metabolism (p-value < 0.05). The inflammation-induced lipid alteration known in the literature [55], and the negative contribution of anti-inflammatory factors (IL-10 and corticoids) support the inflammation-interferon identity of factor 1. The results obtained from BiomiX are available in the Supplementary Table 1.

The second case studied the transcriptomics and methylomics differences in CLL patients with mutated and unmutated immunoglobulin heavy variable (IGHV) gene [56]. This dataset has been used for testing the MOFA algorithm. The data presented here will thus mainly focus on the identification of the identity behind the factors, briefly citing the transcriptomics and methylomics single omics results. Transcriptomics (Figure S4A and B) and methylomics heatmap and volcano plot (Figure S5A and B) successfully highlighted the genes related to CLL differences mentioned in the article (KANK2, DGKH, MYLK, PPP1R9A, SEPTIN10, SOWAHC, PLD1, and LPL). Moreover, the pathway analysis identified upregulation in proliferation (GO:0090267) and VDJ recombination (GO:0033152, GO:0033151) pathways in methylomics. The MOFA implementation identified the models with 8, 9 and 10 factors as the best performance for identifying differences between the two conditions. The ten-factor model was the best, identifying five factors (factors 1, 3, 4, 5 and 6) discriminating mutated and unmutated IGHV CLLs (Fig. 8A). Factor 1 was the most discriminating among them, and its analysis (Fig. 8B) identified articles related to DNA damage, ageing and cancer. Correlation analysis identified a significant correlation with 11q22.3 deletion, known to induce proliferation in CLL by ZAP-70. These results confirmed the proliferation and oxidative processes already identified by the MOFA article. Factor 3 and factor 4 were correlated with chromosome 12 trisomy, while the bibliography search identified articles related to B leukaemia, DNA damage, and autoimmunity, specifically for factor 3. Factor 4 was enriched in focal adhesion and membrane-ECM Interactions and signaling (R-HSA-3000171, R-HSA-8874081), consistent with the roles of trisomy 12 in activating adhesion signaling [57]. Factor 5 was correlated with gender differences in CLL (p.adj = 1.78e−26) by articles related to sex differences in human-primates [58], mice sex development [59] and autoimmune encephalomyelitis [60]. Furthermore, it is the only factor not correlated to any treatment response. Interestingly, the pathway analysis highlighted pathways linked to galactosyltransferase activity (R-HSA-4420332, R-HSA-3560801, R-HSA-3560783), suggesting a sex difference in CLL not targeted by any drug. Finally, factor 6 was associated with mutations (TP53, SF3B1) and a deletion (del17p13) associated with a poor outcome and unfavorable prognostic factors [61, 62]. The pathway analysis supported this hypothesis, as contributors were enriched for cytokine signaling such as interleukin 4, 13, 27 (GO:0070106, R-HSA-6785807) and type 1 interferon (GO:0032481), of which interleukin 4 and IFN-alpha are already known to be linked to a more severe condition [63]. The results obtained from BiomiX are available in the supplementary Table 2.

Discussion

BiomiX aims to help biologists, physicians, and scientists with no background in bioinformatics. Some tools, such as MixOmics, allow for data integration, while others perform single-omics analysis, but none of them allow for both simultaneously [28]. Currently, to our knowledge, the only way to compare two groups with an analysis similar to that of BiomiX, is the Nextflow association of a single omics pipeline (metaboigniter [64] or other transcriptomics and methylomics modules) with integration tools such as X-omics ACTION. However, this pipeline does not provide an intuitive access to MOFA integration as MixOmics does with DIABLO and does not contain implementations such as the MOFA number of factors optimization. Moreover, while X-omics ACTION only performs correlation analysis to annotate MOFA factors, BiomiX relies on three methods based on different approaches that converge to reveal the factor's identity, as shown in the two case studies. BiomiX is thus more precise in providing factor annotation and is the first tool to exploit a Pubmed bibliography to annotate hidden factors. It is also worth mentioning that although Nextflow has developed a user interface, most changes to the pipeline require coding skills that BiomiX solution does not require. BiomiX is not a definitive solution; it is a first attempt to incorporate the gold standard single-omics pipeline in bioinformatics with tuned and accessible integration using MOFA. BiomiX guarantees high interpretability of the common source of variation among omics, providing users with single results from omic and multiomics integration, in a perspective of its application on single-cell data thanks to the MOFA method [18]. It allows for a complete overview of the changes occurring in biological systems, either in the context of disease, treatment, or physiological conditions, enhancing the interpretability of the biological pathways and processes involved. In addition to improving MOFA factor interpretability and tuning the total number of factors in the model, BiomiX simplifies interactive data visualization through a Shiny interface, enabling users to track changes before and after data transformation, as well as to remove outliers and highly variable features. A separate interface addresses missing values, offering various imputation options and controls for problematic samples or variables. With flexible parameter settings, users have full control over their analysis. We strongly recommend users to follow the guidelines and to carefully review the dataset before using BiomiX. BiomiX also provides output formats ready to copy and paste into specialized user-friendly widespread websites or programs, such as GSEA, EnrichR, and Metaboanalyst. Nevertheless, BiomiX has limitations. Although it can analyse multiple groups simultaneously, it has limited types of omics analysers. Consequently, much work remains to be done to implement functionalities based on community needs and to include more integration methods and omics data, such as proteomics and genomics data from different technologies.

Conclusions

The current work aims to grow the community of users and developers to improve the accessibility of new bioinformatics algorithms and methods within the scientific community. BiomiX represents this attempt to improve accessibility to bioinformatics tools by offering everyone access to bioinformatics methods and algorithms, focusing on the multi-omics integration methods accessibility to enable specialists in a wide range of fields to benefit from the Big Data revolution.

Availability of data and materials

The datasets generated and/or analysed during the current study are available from ENA (https://www.ebi.ac.uk/ena/browser/home) with the project code PRJNA971365 and from http://pace.embl.de/. Project name: BiomiX; Project home page: https://github.com/IxI-97/BiomiX2.2 (Github), https://ixi-97.github.io (website); Operating system(s): Windows, Linux and Mac OS; Programming language: BiomiX is implemented in R for the analysis and in Python for the user interface. Other requirements: The Miniconda 24.4.0 or higher is required for the installation; License: GNU GPL; Any restrictions to use by non-academics: None.

Abbreviations

CE-MS:: Capillary Electrophoresis coupled to Mass Spectrometry
ChEA:: ChIP-X Enrichment Analysis
CLL:: Chronic Lymphocytic Leukemia
DDA:: Data Dependent Acquisition
DGE:: Differential Gene Expression
DIABLO:: Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies
ENCODE:: Encyclopedia of DNA elements
FDR:: False Discovery Rate
GC-MS:: Gas Chromatography coupled to Mass Spectrometry
GSEA:: Gene Set Enrichment Analysis
HC:: Healthy Controls
HMDB:: Human Metabolome Database
HRMS:: High resolution mass spectrometry
IGHV:: Immunoglobulin Heavy Variable
KEGG:: Kyoto Encyclopedia of Genes and Genomes
LC-MS:: Liquid Chromatography Mass Spectrometry
MAD:: Median absolute variation
MAFs:: Metabolite Annotation/Assignment Files
MOFA:: Multiomics Factor Analysis
MoNA:: Mass Bank of North America
m/z:: Mass-to-charge ratio
NEMO:: NEighborhood based Multi-Omics clustering
NMR:: Nuclear Magnetic Resonance
PCA:: Principal component analysis
PTB:: Patient with Tuberculosis
PARADIGM:: PAthway Recognition Algorithm using Data Integration on Genomic Models
SNF:: Similarity Network Fusion
UMAP:: Uniform Manifold Approximation and Projection
VST:: Variance Stabilizing Transformation

References

Barturen G, Babaei S, Català-Moll F, et al. Integrative analysis reveals a molecular stratification of systemic autoimmune diseases. Arthritis Rheumatol Hoboken NJ. 2021;73:1073–85.
Article CAS Google Scholar
Fernández-Ochoa Á, Brunius C, Borrás-Linares I, et al. Metabolic disturbances in urinary and plasma samples from seven different systemic autoimmune diseases detected by HPLC-ESI-QTOF-MS. J Proteome Res. 2020;19:3220–9.
Article PubMed Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.
Article PubMed PubMed Central Google Scholar
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.
Article CAS PubMed Google Scholar
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43: e47.
Article PubMed PubMed Central Google Scholar
Tian Y, Morris TJ, Webster AP, et al. ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics. 2017;33:3982–4.
Article CAS PubMed PubMed Central Google Scholar
Wang D, Yan L, Hu Q, et al. IMA: an R package for high-throughput analysis of Illumina’s 450K Infinium methylation data. Bioinformatics. 2012;28:729–30.
Article CAS PubMed PubMed Central Google Scholar
Perez de Souza L, Fernie AR. Computational methods for processing and interpreting mass spectrometry-based metabolomics. Essays Biochem. 2024;68(1):5–13.
Article PubMed PubMed Central Google Scholar
Smith CA, Want EJ, O’Maille G, et al. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–87.
Article CAS PubMed Google Scholar
Brunius C, Shi L, Landberg R. Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics. 2016;12(11):173.
Article PubMed PubMed Central Google Scholar
Klåvus A, Kokla M, Noerman S, Koistinen VM, et al. “notame”: workflow for non-targeted LC-MS metabolic profiling. Metabolites. 2020;10(4):135.
Article PubMed PubMed Central Google Scholar
Shen X, Wu S, Liang L, et al. metID: an R package for automatable compound annotation for LC−MS-based data. Bioinformatics. 2022;38:568–9.
Article CAS PubMed Google Scholar
Pang Z, Zhou G, Ewald J, et al. Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data. Nat Protoc. 2022;17:1735–61.
Article CAS PubMed Google Scholar
Schmid R, Heuckeroth S, Korf A, et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat Biotechnol. 2023;41(4):447–9.
Article CAS PubMed PubMed Central Google Scholar
Tsugawa H, Cajka T, Kind T, et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat Methods. 2015;12(6):523–6.
Article CAS PubMed PubMed Central Google Scholar
Dührkop K, Fleischauer M, Ludwig M, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods. 2019;16(4):299–302.
Article PubMed Google Scholar
Picard M, Scott-Boyer MP, Bodein A, et al. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;22(19):3735–46.
Article Google Scholar
Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111.
Article PubMed PubMed Central Google Scholar
Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35:3055–62.
Article CAS PubMed PubMed Central Google Scholar
Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinforma Biol Insights. 2020;14:1177932219899051.
Article Google Scholar
Vahabi N, Michailidis G. Unsupervised multi-omics data integration methods: a comprehensive review. Front. Genet. 2022; 13.
Mo Q, Wang S, Seshan VE, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013;110:4245–50.
Article CAS PubMed PubMed Central Google Scholar
Vaske CJ, Benz SC, Sanborn JZ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26:i237–45.
Article CAS PubMed PubMed Central Google Scholar
Rappoport N, Shamir R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics. 2019;35:3348–56.
Article CAS PubMed PubMed Central Google Scholar
Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11:333–7.
Article CAS PubMed Google Scholar
Perez-Riverol Y, Bai J, Bandla C, et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50:D543–52.
Article CAS PubMed Google Scholar
Yurekten O, Payne T, Tejera N, et al. MetaboLights: open data repository for metabolomics. Nucleic Acids Res. 2024;52:D640–6.
Article CAS PubMed Google Scholar
Rohart F, Gautier B, Singh A, et al. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLOS Comput Biol. 2017;13: e1005752.
Article PubMed PubMed Central Google Scholar
Theodoridis G, Gika H, Raftery D, et al. Ensuring fact-based metabolite identification in liquid chromatography-mass spectrometry-based metabolomics. Anal Chem. 2023;95(8):3909–16.
Article CAS PubMed PubMed Central Google Scholar
Gil-de-la-Fuente A, Godzien J, Saugar S, et al. CEU mass mediator 3.0: a metabolite annotation tool. J Proteome Res. 2019;18(2):797–802.
Article CAS PubMed Google Scholar
Pezzatti J, Boccard J, Codesido S, et al. Implementation of liquid chromatography-high resolution mass spectrometry methods for untargeted metabolomic analyses of biological samples: a tutorial. Anal Chim Acta. 2020;8(1105):28–44.
Article Google Scholar
Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–9.
Article CAS PubMed Google Scholar
Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–30.
Article CAS PubMed Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc. 1995;57(1):289–300.
Article Google Scholar
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102:15545–50.
Article CAS PubMed PubMed Central Google Scholar
Chen EY, Tan CM, Kou Y, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128.
Article PubMed PubMed Central Google Scholar
Kirou KA, Lee C, George S, et al. Activation of the interferon-alpha pathway identifies a subgroup of systemic lupus erythematosus patients with distinct serologic features and active disease. Arthritis Rheum. 2005;52:1491–503.
Article CAS PubMed Google Scholar
Panwar B, Schmiedel BJ, Liang S, et al. Multi–cell type gene coexpression network analysis reveals coordinated interferon response and cross–cell type correlations in systemic lupus erythematosus. Genome Res. 2021.
Libiseller G, Dvorzak M, Kleb U, et al. IPO: a tool for automated optimization of XCMS parameters. BMC Bioinform. 2015;16:118.
Article Google Scholar
Broeckling CD, Afsar FA, Neumann S, et al. RAMClust: a novel feature clustering method enables spectral-matching-based annotation for metabolomics data. Anal Chem. 2014;86:6812–7.
Article CAS PubMed Google Scholar
Shen X, Yan H, Wang C, et al. TidyMass an object-oriented reproducible analysis framework for LC–MS data. Nat Commun. 2022;13:4365.
Article CAS PubMed PubMed Central Google Scholar
Fernández-Ochoa Á, Quirantes-Piné R, Borrás-Linares I, et al. A case report of switching from specific vendor-based to R-based pipelines for untargeted LC-MS metabolomics. Metabolites. 2020;10:28.
Article PubMed PubMed Central Google Scholar
CMMR - CEU Mass Mediator API in R. 2019 (https://github.com/YaoxiangLi/cmmr).
Wishart DS, Tzur D, Knox C, et al. HMDB: the human metabolome database. Nucleic Acids Res. 2007;35:D521-526.
Article CAS PubMed PubMed Central Google Scholar
Aryee MJ, Jaffe AE, Corrada-Bravo H, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.
Article CAS PubMed PubMed Central Google Scholar
Luo Y, Hitz BC, Gabdank I, et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020;48:D882–9.
Article CAS PubMed Google Scholar
Lachmann A, Xu H, Krishnan J, et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinforma Oxf Engl. 2010;26:2438–44.
Article CAS Google Scholar
Grames EM, Stillman AN, Tingley MW, et al. An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks. Methods Ecol Evol. 2019;10:1645–54.
Article Google Scholar
Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2022;1(21):134–49.
Google Scholar
Niehues A, de Visser C, Hagenbeek FA, et al. A multi-omics data analysis workflow packaged as a FAIR Digital Object. GigaScience. 2024;13:giad115.
Article PubMed PubMed Central Google Scholar
Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9.
Article PubMed Google Scholar
Ge SX, Son EW, Yao R. iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinform. 2018;19:534.
Article CAS Google Scholar
Wang Y, He X, Zheng D, et al. Integration of metabolomics and transcriptomics reveals major metabolic pathways and potential biomarkers involved in pulmonary tuberculosis and pulmonary tuberculosis-complicated diabetes. Microbiol Spectr. 2023;11:e00577-e623. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/spectrum.00577-23.
Article CAS PubMed PubMed Central Google Scholar
Schairer DO, Chouake JS, Nosanchuk JD, et al. The potential of nitric oxide releasing therapies as antimicrobial agents. Virulence. 2012;3:271–9.
Article PubMed PubMed Central Google Scholar
Caterino M, Gelzo M, Sol S, et al. Dysregulation of lipid metabolism and pathological inflammation in patients with COVID-19. Sci Rep. 2021;11:2941.
Article CAS PubMed PubMed Central Google Scholar
Dietrich S, Oleś M, Lu J, et al. Drug-perturbation-based stratification of blood cancer. J Clin Invest. 2018;128:427–45. https://doiorg.publicaciones.saludcastillayleon.es/10.1172/JCI93801.
Article PubMed Google Scholar
Riches JC, O’Donovan CJ, Kingdon SJ, et al. Trisomy 12 chronic lymphocytic leukemia cells exhibit upregulation of integrin signaling that is modulated by NOTCH1 mutations. Blood. 2014;123:4101–10.
Article CAS PubMed PubMed Central Google Scholar
Shi X, Facemire L, Singh S, et al. UBA1-CDK16 : A Sex-Specific Chimeric RNA and Its Role in Immune Sexual Dimorphism. BioRxiv Prepr. Serv. Biol. 2024; 2024.02.13.580120
Rock KD, Folts LM, Zierden HC, et al. Developmental transcriptomic patterns can be altered by transgenic overexpression of Uty. Sci Rep. 2023;13:21082.
Article CAS PubMed PubMed Central Google Scholar
Fazazi MR, Ruda GF, Brennan PE, et al. The X-linked histone demethylases KDM5C and KDM6A as regulators of T cell-driven autoimmunity in the central nervous system. Brain Res Bull. 2023;202: 110748.
Article CAS PubMed Google Scholar
Rossi D, Cerri M, Deambrogi C, et al. The prognostic value of TP53 mutations in chronic lymphocytic leukemia is independent of Del17p13: implications for overall survival and chemorefractoriness. Clin Cancer Res. 2009;15:995–1004.
Article CAS PubMed Google Scholar
Wan Y, Wu CJ. SF3B1 mutations in chronic lymphocytic leukemia. Blood. 2013;121:4627–34.
Article CAS PubMed PubMed Central Google Scholar
Yan X-J, Dozmorov I, Li W, et al. Identification of outcome-correlated cytokine clusters in chronic lymphocytic leukemia. Blood. 2011;118:5201–10.
Article CAS PubMed PubMed Central Google Scholar
Ewels PA, Peltzer A, Fillinger S, et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020;38:276–8.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We also acknowledge the 3TR and PRECISESADS consortium, which guaranteed the data used in this article and allowed us to analyse them.

Funding

This work was supported by the Innovative Medicines Initiative Joint Undertaking under Grant Agreement Number 115565, resources of which are composed of financial contributions from the European Union’s Seventh Framework Program (FP7/2007–2013) and EFPIA companies in kind. CI was funded by the Université de Brest and the Région Bretagne. AFO thanks the fundings received by “Ayudas al funcionamiento de los Grupos operativos de la Asociación Europea para la Innovación (AEI) en materia de productividad y sostenibilidad agrícolas en el sector del olivar, 2020” (Grant Number GOPO-GR-20–0001).

Author information

Anne Bordron and Christophe Jamin contributed equally to this work.

Authors and Affiliations

LBAI, UMR1227, Univ Brest, Inserm, Brest, France
Cristian Iperi & Anne Bordron
Department of Analytical Chemistry, University of Granada, Granada, Spain
Álvaro Fernández-Ochoa
GENYO, Centre for Genomics and Oncological Research Pfizer, University of Granada, Andalusian Regional Government, PTS Granada, Granada, Spain
Guillermo Barturen & Marta Alarcón-Riquelme
Department of Genetics, Faculty of Sciences, University of Granada, Granada, Spain
Guillermo Barturen
LBAI, UMR1227, Univ Brest, Inserm, Laboratory of Immunology, CHU Brest, Brest, France
Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Divi Cornec & Christophe Jamin
Institute for Environmental Medicine, Karolinska Institutet, 171 69, Stockholm, Sweden
Marta Alarcón-Riquelme

Authors

Cristian Iperi
View author publications
You can also search for this author inPubMed Google Scholar
Álvaro Fernández-Ochoa
View author publications
You can also search for this author inPubMed Google Scholar
Guillermo Barturen
View author publications
You can also search for this author inPubMed Google Scholar
Jacques-Olivier Pers
View author publications
You can also search for this author inPubMed Google Scholar
Nathan Foulquier
View author publications
You can also search for this author inPubMed Google Scholar
Eleonore Bettacchioli
View author publications
You can also search for this author inPubMed Google Scholar
Marta Alarcón-Riquelme
View author publications
You can also search for this author inPubMed Google Scholar
Divi Cornec
View author publications
You can also search for this author inPubMed Google Scholar
Anne Bordron
View author publications
You can also search for this author inPubMed Google Scholar
Christophe Jamin
View author publications
You can also search for this author inPubMed Google Scholar

Consortia

PRECISESADS Flow Cytometry Study Group, PRECISESADS Clinical Consortium

Contributions

C.I. was in charge of writing and planning the bioinformatic approaches. C.J. and A.B. contributed equally to supervising the work and the scientific relevance of the article, while the other authors evaluated, revised, and approved the article.

Corresponding author

Correspondence to Christophe Jamin.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional file 2.

Additional file 3.

Additional file 4.

Additional file 5.

Additional file 6.

Additional file 7.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Iperi, C., Fernández-Ochoa, Á., Barturen, G. et al. BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data. BMC Bioinformatics 26, 8 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-06022-y

Download citation

Received: 27 June 2024
Accepted: 23 December 2024
Published: 10 January 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-024-06022-y

BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data

Abstract

Background

Results

Conclusions

Background

Implementation

BiomiX graphics, parameters and data manipulation

Graphical user interface and R environment

BiomiX interface parameters

The BiomiX-assisted format converter and BiomiX toolkit

Preview-QC visualization

Single omics analysis

BiomiX transcriptomics input and pipeline

Subpopulation of differential gene expression analysis based on a gene panel

BiomiX metabolomics input and pipeline

BiomiX methylomics input and pipeline

BiomiX undefined input and pipeline

Multi omics integration analysis

MOFA analysis

Multi omics integration annotation

Extraction of MOFA factors: interpretation

Correlation analysis

Pathway analysis

Pubmed bibliography research

Case studies

Results

Implementations and comparison with other tools

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Consortia

PRECISESADS Flow Cytometry Study Group, PRECISESADS Clinical Consortium

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1.

Additional file 2.

Additional file 3.

Additional file 4.

Additional file 5.

Additional file 6.

Additional file 7.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us