- Research
- Open access
- Published:
CoMIT: a bioinformatic pipeline for risk-based prediction of COVID-19 test inclusivity
BMC Bioinformatics volume 26, Article number: 51 (2025)
Abstract
Background
The global Coronavirus Disease 2019 (COVID-19) pandemic highlighted the need to quickly diagnose infections to identify and prevent viral spread in the population. In response to the pandemic, BioFire Defense leveraged its PCR-based “lab-in-a-pouch” technology for expedited development of the BioFire® COVID-19 Test, a novel in vitro diagnostic detecting SARS-CoV-2 nucleic acid in human samples. Following clearance of an in vitro diagnostic device, regulatory bodies such as the U.S. Food and Drug Administration (FDA) require regular post market surveillance to monitor test performance against viral lineages circulating in the field, using predictive in silico inclusivity evaluations. Exponential increases in the number of sequences deposited in bioinformatic repositories such as GISAID, during the pandemic, impeded progress in meeting these post market requirements. In response, BioFire Defense developed a new bioinformatic tool to overcome scalability problems and the loss of accuracy encountered with the standard inclusivity method.
Results
The Coronavirus Monitoring for Inclusivity Tool (CoMIT) uses the Variant Sorter Algorithm to sidestep multiple sequence alignments, a significant barrier inherent in the standard inclusivity method. The implementation of CoMIT and its Variant Sorter Algorithm are described. Automated summary tables and visualizations from a typical inclusivity evaluation are presented. We report our approach to filter and display relevant information in the pipeline outputs using risk factors tied to test performance.
Conclusions
BioFire Defense has developed CoMIT, an automated bioinformatic pipeline for efficient processing and reporting of variant inclusivity from the GISAID EpiCoV™ repository. This tool ensures continuous and comprehensive post market evaluations of BioFire COVID-19 Test performance even from datasets large enough to impede standard inclusivity analyses. CoMIT’s low computational space complexity and modular code allow this tool to be generalized for inclusivity monitoring of multianalyte or single analyte tests with complex assay designs and/or highly variable targets. CoMIT’s databasing capabilities and metadata handling hold the potential for new investigations to improve readiness for future outbreaks.
Background
The Coronavirus Disease 2019 (COVID-19) pandemic brought challenges never faced in the modern era, necessitating accelerated timelines and processes to address urgent public health needs. The global outbreak spawned the rapid development of diagnostic tests capable of detecting the presence of SARS-CoV-2 (the etiological agent of COVID-19) in human samples [1,2,3,4,5,6]. In response, BioFire Defense leveraged its existing BioFire® FilmArray® PCR-based technology to develop a test specific for the identification of SARS-CoV-2 nucleic acid from patient samples. On March 24, 2020, BioFire Defense received initial Emergency Use Authorization (EUA) by the U.S. Food and Drug Administration (FDA) for the BioFire® COVID-19 Test. A 510 k clearance was granted for the BioFire® COVID-19 Test 2 on November 1, 2021, which has identical chemistry to the EUA version, becoming the first single-analyte, PCR-based COVID-19 in vitro diagnostic (IVD) device to receive FDA clearance.
Over the course of the pandemic, the SARS-CoV-2 genome evolved rapidly, resulting in a burgeoning population of genomic variants. The Global Initiative on Sharing All Influenza Data (GISAID) EpiCoV™ database was quickly organized as the public sequence repository for compiling viral genomes from human cases [7, 8] and many countries undertook large scale sequencing efforts. The number of deposited sequences was beginning to accelerate when the BioFire® COVID-19 Test EUA was granted, just 13 days after the World Health Organization (WHO) declared the COVID-19 outbreak a global pandemic.
As more sequence data became available, the need to frequently assess SARS-CoV-2 variants and their potential impacts to test performance was quickly apparent [9]. Such evaluations of test inclusivity must consider a combination of genomic factors and lineage-associated clinical phenotypes (such as increased transmissibility) in order to make informed decisions regarding when corrective actions or mitigations may be needed. For example, mutations falling within primer binding regions can reduce template-primer affinity and extensibility to retard or prevent PCR amplification [9,10,11,12]. This risk may increase in the case of highly transmissible variants. Lineages harboring these mutations may quickly gain prominence in the population and escape detection in clinical specimens, especially with low viral titers. FDA and other regulatory bodies require regular viral sequence monitoring of authorized and cleared products to ensure these tests continue to identify positive cases, including emergent strains circulating both in the US and globally [13].
Challenges in the in-silico inclusivity evaluation process
In silico inclusivity evaluations use publicly available sequence data to approximate risks to detection in the field. The process includes building a multiple sequence alignment (MSA) of intended sequence targets (inclusive sequences) and comparing nucleotide changes across primer binding regions against a reference sequence. Figure 1 reports high-level steps for standard in silico inclusivity evaluations of SARS-CoV-2 variants, similar to inclusivity processes reported in the literature [9]. Online tools such as the Basic Local Alignment Search Tool (BLAST) can also be used, where amplicons or primers are queried against the National Center for Biotechnology Information (NCBI) sequence library [14], although limitations exist with this method [15]. At BioFire Defense, datasets for inclusive sequences are generally small and their evaluations managed using the MSA-based approach outlined in Fig. 1.
As global outbreaks and subsequent sequence submissions caused a surge in publicly available SARS-CoV-2 sequence data, concerns grew for monitoring test performance. The combination of increasingly large data sets and the accelerated demand for new analyses exposed limitations in the standard in silico inclusivity evaluation process. Computational bottlenecks in the MSA step resulted in increased failures and reduced accuracy of large alignments, slowing the evaluation progress and complicating data interpretation. Figure 2 shows the number of complete, high coverage, human-host SARS-CoV-2 sequences collected from January 2020 through August 2022 and submitted to the GISAID EpiCoV™ database through 2 November 2022. The exponential growth of sequence datasets combined with the demands of responding to FDA monitoring requirements and customer inquiries necessitated development of a more scalable and reliable evaluation approach. Here, we describe a fully automated and adaptive bioinformatic pipeline for comprehensive monitoring and reporting BioFire COVID-19 Test performance using a predictive, risk-based strategy.
Global Initiative for Sharing All Influenza Data (GISAID) EpiCoV™ SARSCoV-2 Sequence Submissions by Collection Month. Complete, high coverage, human-host viral sequences with submission dates through 2 November 2022 and collection dates from January through August 2022 are included. Vertical lines indicate the dates when emergency use authorization (EUA) and 510(K) clearance was granted to BioFire Defense
Implementation
The variant sorter algorithm
CoMIT is written as an R package. The code executes a pipeline that initially builds an empty database housing data for an evaluation (an existing database can also be updated). The pipeline takes GISAID EpiCoV™ sequence submissions (FASTA) and associated metadata (TSV) files as the input. Sequences are processed by the Variant Sorter Algorithm and the resulting data are added to the database. The BioFire COVID-19 Test uses seven nested and multiplexed SARS-COV-2 target regions, or assays, requiring inclusivity surveillance of 30 individual primers. A diagram of the Variant Sorter Algorithm – the main portion of the CoMIT pipeline – with its inputs, high-level operations, and database file output is shown in Fig. 3.
The Variant Sorter Algorithm in the Coronavirus Monitoring for Inclusivity (CoMIT) Pipeline. Sequences containing a primer-spanning match to the reference sequence are identified in the first classification step (Match bin). The red symbol indicates the portion of the algorithm where a successful classification process stops. When no match is found, the algorithm compares the sequence region against previously characterized primer variants in the database, and sequences with exact matches to existing variants are assigned the same primer variant ID (Known Primer Variant bin). When no classification has been made, the algorithm generates a library of all possible single mismatch variations and compares them to the unknown primer variant to find a match. If still no match is found, the algorithm generates all possible double mismatches for the primer and searches the expected primer region for a match. After these steps if a match has not been found, a pairwise sequence alignment against the reference sequence is used to classify the mutation (New Primer Variant bin). The novel primer variant is given a new primer variant ID and logged in the database
The Variant Sorter Algorithm uses an iterative string-matching comparison to identify primer variants (i.e., mutations exclusively found within primer binding regions of the test). For each sequence in the submission set, a small search space is defined around the presumptive location of the primer binding region. Novel primer variants identified by the Variant Sorter Algorithm are assigned a unique identification number and their mutation characteristics (e.g., primer affected, position, type) are captured in the database. These steps are repeated for each primer binding region. This process ensures that only a small proportion of sequences require alignment for classification. A recent inclusivity run showed 0.2% of sequences required alignment (50/21947 sequences). Pairwise alignments are performed using the DECIPHER R package [16]. Sequences processed by the Variant Sorter Algorithm are stored in a relational database generated for the run (alternatively, an existing database can be specified to which new data is appended during the run). The database holds 11 tables storing sequence, assay, and primer variant data. A database schema is provided as an additional file (Additional File 1).
Structured query language database processing and visualization code
After algorithm processing, other CoMIT package functions can be run on a database to generate summary tables and visualizations. Risk criteria based on primer variant prevalence, mutation severity, co-occurrence, and variant lineage type help identify primer variants predicted to be the highest risk to inclusivity. These criteria are applied in different ways to filter, highlight, and stratify data and can be modified, as needed. Figure 4 shows a flowchart for a typical in silico inclusivity analysis using the CoMIT pipeline.
Flowchart for the Typical in silico Inclusivity Evaluation Using the Coronavirus Monitoring for Inclusivity (CoMIT) Pipeline. The CoMIT pipeline requires two inputs: GISAID sequence submissions restricted by complete, high coverage and human-host filters and GISAID sequence metadata. The dotted red line highlights the portion of the pipeline involved in a Variant Sorter Run, including processing by the Variant Sorter Algorithm. Results are captured in a SQLite database where SQL queries are leveraged as inputs for the visualization and summary code (in R). Outputs of the pipeline are the visualizations. A variant mapping file is updated for each evaluation using: two website sources [17, 18]; DBMS: database management system; GISAID: Global Initiative on Sharing Influenza Data; and SQL: structured query language
Risk-based reporting
Key factors for evaluating risk are described in this section, including mutational severity, co-occurrence, prevalence, and variant lineage. Figure 5 provides a summary of these considerations. Sequences harboring mutations to any test primers are identified in the evaluation, and characteristics of mutations (such as mutation position along the primer-spanning region) are leveraged to predict impacts at the individual assay level. Mismatches falling within the last five bases of the 3’ end of a primer binding region are more likely to interfere with amplification [11, 19,20,21,22]. Therefore, sequences carrying 3’ end mutations are labeled as a severity risk in the evaluation (Fig. 5, orange circle).
Risk Considerations for Evaluating SARS-CoV-2 Variant Detection. The diagram outlines factors with increased risk for affecting overall test performance. All mutations are tracked and reported based on risk criteria and elevated risks are identified when overlapping risk characteristics are present (labels 1–4). Highest risk to overall test performance is predicted when sequences carrying primer-spanning mutations both negatively impact performance of all or several individual assays and occur at a significant prevalence in the sequence dataset (e.g., 5% or higher) (label 4)
When considering the risk of complete test failure, inclusivity evaluations may be further complicated by complex test designs. The BioFire COVID-19 Test 2 leverages a nested, multiplex PCR approach, targeting five independent regions of the SARS-CoV-2 genome. Detection of the expected amplicon from only one region is required to successfully elicit a SARS-CoV-2 detected result. Any sequences with mismatches to all or multiple assay primers are identified as a co-occurrence risk in the assessments (Fig. 5, grey circle).
Genetic evolution of SARS-CoV-2 variants resulting in increased pathogenicity of the virus in human hosts can have significant public health impacts [23]. The Centers for Disease Control and Prevention and WHO evaluate and classify emerging variants based on potential or known impacts to effectiveness of medical treatments, severity of disease, and transmissibility [18, 24]. Variant lineages associated with official designations given by US and global health organizations (e.g., Variants of Concern) are considered a prevalence risk (Fig. 5, yellow circle). The prevalence risk is also assessed for unclassified variant lineages when represented at a significant frequency in the sequence dataset [13].
Sequences characterized by an overlap of any two risk factors (Fig. 5, regions indicated by 1–3) would be considered high risk, whereas sequences characterized by all risk factors (Fig. 5, area indicated by 4) are of the greatest concern due to the potential negative impacts on diagnostic accuracy. Sequences carrying primer spanning mutations flagged as high risk in these predictive evaluations are escalated for wet benchtop testing and/or thermodynamic modeling analysis [15, 25, 26].
Results
Five automated visualizations were developed to summarize processed sequence data and enable clear and concise reporting of results. An example of visualization outputs for a candidate evaluation are shown as figures and tables (Tables 1, 2 and 3, Figs. 6 and 7) and as additional files (Additional Files 2–5). Each output in the pipeline features two or more risk indicators (i.e., co-occurrence, prevalence, lineage, and growth). All outputs (except Fig. 7) can be filtered based on mutational severity (i.e., when a primer-spanning mutation is positioned within 10 base pairs of the 3’ end).
Primer Variant Combinations in a Typical in silico Inclusivity Evaluation. The figure is divided into two sections (a heatmap and histogram) to summarize co-occurring primer variant combinations in the dataset. The number of assays affected is shown across the top with each column relating to a specific primer variant combination group. The top heatmap shows abbreviated assay names along the left y-axis (2a, 2c, 2d, 2e, 2f, 2 g, and 2 h). Purple shading indicates the specific assays impacted in the various primer variant combination groups. The number of assays affected increases from left to right. The bottom histogram reports the percent frequency of sequences in the dataset represented in each primer variant combination group
Primer Variant Characteristics and Trending. Assay names of affected primers are shown along the top with shared primer regions indicated by both assay names (A, C, C/D, D, E, E/F, G, H). The column indicated by All combines all assay data. Each row relates to the specific primer variant indicated by the assay name column. The top part of the figure shows a histogram with raw count frequencies of each primer variant. The middle section shows characteristics of each primer variant including prevalence, growth trend compared with the previous three-month dataset*, primer affected, mutation position(s), and primer/template mismatch (or DL## to indicate a deletion and number of base pairs spanning the deletion). The bottom section of the figure provides a bar chart showing the lineage distribution of the primer variant. *Up arrow indicates a 0.1 or greater delta increase, downward arrow indicates 0.1 or greater delta decrease, equivalent arrows indicate a less than 0.1 growth change
Database breakdown table
The Database Breakdown Table provides a summary of collection date, variant identity (Pangolin lineage and WHO label), sequence frequencies and frequency changes of variants included in the analysis. These data are taken from the GISAID metadata associated with each sequence analyzed and can be used to clearly summarize the dataset included in the evaluation.
Table 1 shows an example Database Breakdown table for a typical in silico inclusivity evaluation. Date columns refer to sample collection dates, which should include sequences from patient samples collected in the most recent three-month period. Sequence frequencies are reported for the entire dataset (All Sequences) and stratified by Pangolin lineage [27] and WHO label (i.e., Variants of Concern, Variants of Interest, Variants Under Monitoring); these annotations are updated for every evaluation using a variant mapping file sourced from publicly available information [17, 18]. Frequency changes are compared between the one-month sequence data (newest) and a superset of the most recent three months. Growth is represented as yellow shading when delta frequencies increase or decrease between three- and one-month sequence datasets. Variant lineages with notable frequency changes in the most recent month (i.e., greater than or equal to five percent change) are shaded in this example. Delta frequency thresholds can be modified, as needed.
Identifiable mutations in each assay region
As shown in Fig. 3, the CoMIT tool first bins data according to previously identified mutations in the assay primer regions. These mutations are summarized in a table like that shown in Table 2. The number and frequency of each lineage is recorded as an indicator of prevalence within the dataset, in this example over a 3-month period. Within these lineages, the frequencies of sequences with observed mutations are recorded in each assay column (e.g. assay 2a, 2c etc.). This gives visibility to assays which may have reduced sensitivity with emerging lineage variation. In the case of the COVID-19 Test, the Test is comprised of seven assays (2a, 2c-2 g); the co-occurring mutated sequences column indicates the frequency of sequences within each lineage that contains, in this example, mutations in 5 or 6 of the assays on the Test, and that could be potentially at risk of missed or late detection.
Table 2 provides an example summary table for sequences containing identifiable primer spanning mutations across COVID-19 Test assays. The Sequences by Lineage column shows the most recent three-month period sequence frequencies (count and rate) stratified by lineage exactly matching the Database Breakdown table. The remaining columns report lineage-stratified frequencies of sequences harboring an assay-specific primer variant (i.e., a primer-spanning mutation or set of mutations) (Mutated Sequences by Assay) and sequences with co-occurring primer variants across multiple assays (Co-occurring Mutated Sequences). The bottom row shows sequence frequencies of primer variants by assay (Summary: All Sequences by Assays). Sequence frequencies below one percent are shaded in blue; frequencies equal to or greater than five percent are shaded yellow. A summary table filtering for 3’ end mutations can be generated to represent high-risk mutations (Additional File 2). A version of this table showing an expanded section for Co-occurring Mutated Sequences is also available as an additional file (Additional File 3).
Table 3 provides a detailed breakdown of sequences with mutations under multiple assay primers. Specifically, it shows the number of assays affected by a mutation under those primers, organized by lineage. The columns show the number of assays affected, increasing from left to right. Sequence frequencies (counts and rates) are reported by lineage (Sequences by Lineage) and by increasing co-occurrence risk based on the total number of assays impacted (# Assays Affected, columns 0 through ≥ 6). The Summary: All Sequences by Assays section shows sequence frequencies based on co-occurrence risk. Blue shading indicates sequence frequencies below one percent; frequencies equal to or greater than five percent are shaded yellow. A version of this table with filtering for 3’ end mutations is provided (Additional File 4).
Figure 6 visualizes the lineages with mutations under multiple sets of assay primers, along with their frequencies (counts and rate). All primer-spanning mutations are reported. Purple shading indicates the impact of primer variant combinations on each individual assay and combinations when they are compounded across assays (indicating co-occurrence risks). Sequence counts and percent frequency for each combination are shown at the bottom of the figure. This figure can be filtered for 3’ end mutations only (Additional File 5).
Trending of primer variants
Post-market surveillance not only tracks newly emerging sequence variants but also trends their frequencies. This monitoring helps assess the risk of missed detection based on prevalence. Variant prevalence becomes one of the risk criteria used to assess whether the diagnostic test is still functional in an evolving outbreak.
Figure 7 summarizes characteristics of individual primer variants at or above 0.1% frequency in the sequence dataset compared with the previous 3-month period (note: these datasets represent nonoverlapping time periods totaling six months). The figure consists of three sections: a histogram showing assay location of primer variants and their percent frequencies in the current dataset (top section), a table detailing the primer variant characteristics and trending based on a comparison with the previous 3-month period (middle section), and a stacked bar graph displaying the primer variant distribution across lineages (bottom section). Trending symbols indicate a 0.1% frequency change in the sequence dataset compared with the previous 3-month period. An equivalent symbol represents a delta frequency less than 0.1%. The delta frequency thresholds defining inclusion criteria and trending symbols can be adjusted, as needed.
Taken together, the automated outputs of the CoMIT pipeline provide summary tables and visualizations with risk-based features and modifiable thresholds for added flexibility in reporting evaluation results.
Discussion
CoMIT was developed specifically for in silico inclusivity evaluations of the BioFire COVID-19 Test, a single analyte, PCR-based IVD designed for use with BioFire® FilmArray® Systems. Evaluating SARS-CoV-2 genomes as they evolve through human infection is required by regulatory bodies to ensure reliable detection of COVID-19 cases in the US and globally [28]. The standard inclusivity approach includes a sequence alignment step, which presented a computational bottleneck with the increasing volume and rate of sequence data needing to be analyzed. The Variant Sorter Algorithm identifies and catalogues primer variants using iterative string matching and binning functions, an efficient process to sidestep the predominance of MSAs in the standard approach. The bioinformatic analysis and visualization pipeline handles large volumes of sequence data with automated results reporting and databasing capabilities for regular comprehensive post market in silico inclusivity monitoring. CoMIT’s low computational space complexity requires minimal memory, allowing it to be run on a personal computer.
In silico inclusivity monitoring serves many purposes and its results inform different audiences: online to customers in the BioFire COVID-19 Test Reactivity Technical Note [29], to regulators in FDA submissions, and companywide as required for internal trending purposes. For added flexibility and clarity in reporting, the pipeline applies risk-based parameters to summary tables and visualizations, as primer variants with these characteristics pose the greatest risks to overall the test performance. For example, figures and tables can be filtered to reporting only primer variants likely to disrupt the PCR reaction (i.e., 3’ end mutations). Co-occurrence, lineage, frequency, and delta frequency (i.e., growth) are also featured prominently in the outputs. The visualizations leverage auto-generated shading, data stratification, and symbols to identify prevalence risks in currently circulating variants using both lineage associations and growth characteristics of unclassified sequence populations. These risk criteria can be adjusted as needed to align with FDA or other post market requirements. High risk sequences can be flagged for wet benchtop testing to empirically confirm any predicted performance impacts.
Despite being built for in silico inclusivity testing of COVID-19 tests, CoMIT has been developed as an accessible and user-friendly R package. Researchers can easily download and utilize CoMIT to query and test the inclusivity of their own primer sequences against any organism in the GISAID database. Details on accessing and downloading the CoMIT R package can be found in the Availability and Requirements section.
CoMIT’s Variant Sorter Algorithm has been adapted at BioFire Defense for inclusivity monitoring of different pathogens. We developed a modified version of CoMIT to evaluate in silico inclusivity of the Lassa virus for the BioFire® Global Fever Special Pathogens Panel (an IVD cleared by the FDA). The Lassa assays have complex designs because of the genetic diversity of the Lassa virus species which can be as high as 24.6% between lineages [30]. The pipeline and algorithm are currently being expanded for processing sequences for multianalyte panels.
CoMIT is primarily designed to assess the potential impact of sequence variants on the efficacy of a diagnostic test which can be used to issue a warning when a new variant is likely to escape detection. However, CoMIT could also be leveraged in epidemiology to monitor viral evolution, provide early detection of emerging variants, and inform outbreak response. CoMIT’s databasing capability allows for analyses of sequence datasets to gain new information. GISAID metadata contains details on sequence submissions, including variant information (i.e., Nextstrain clade, variant and Pangolin lineage), case data (i.e., location of exposure, demographics, reporting hospital/laboratory), and amino acid change constellation summaries. Tracking changes in viral properties such as rate and location of spread, disease severity, and variant lineage could aid in forecasting where new outbreaks may occur and when appropriate countermeasures may be needed.
The CoMIT pipeline is currently limited to evaluate only primer binding sites. Successful detection on the BioFire® FilmArray® Systems depends on post run melting temperature (Tm) analyses, and variants with mutations (such as large indels in the amplicon region) that fundamentally change characteristics of the amplified region and could result in a missed detection. We are developing an amplicon tracking feature for monitoring changes to the inner amplicon region. Our inner amplicon tracker will leverage thermodynamic models to predict when sequence changes, such as indels or accumulation of single nucleotide polymorphisms, impact Tm. This new feature highlights the adaptability of CoMIT for improved predictions.
Conclusion
CoMIT leverages publicly available SARS-CoV-2 sequence data and metadata in GISAID EpiCoV™ repository to predict BioFire COVID-19 Test performance in the field. The pipeline can process large datasets with low computational space complexity and leverages adjustable, risk-based summarization features for easily digestible reports of highly complex testing targets. Its flexible database design and improved metadata handling provide opportunities for new epidemiological investigations of both emerging and archived case data, with the potential to improve readiness for future outbreaks as an early warning system for new variants.
Availability and Requirements
Project name: The Coronavirus Monitoring for Inclusivity (CoMIT) Pipeline Project
Project home page: https://bitbucket.org/biofiredefense/comit/src/main/
Operating system: Platform independent
Programming language: R
Other requirements: Package dependencies
License: CC-BY-NC4.0
Any restrictions to use by non-academics: Yes
Availability of data and materials
The in-silico inclusivity evaluation presented in this report is based on 24,911 SARS-CoV-2 sequences and associated metadata available from January 1, 2023, up to March 31, 2023, via gisaid.org/EPI_SET_230509yp. A Supplemental Table describing the sequence dataset is provided as an additional file (Additional File 6). The CoMIT R Package and instructions for its use are available at https://bitbucket.org/biofiredefense/comit/src/main/.
Abbreviations
- COVID-19:
-
Coronavirus disease 2019
- CoMIT:
-
Coronavirus monitoring for inclusivity tool
- EUA:
-
Emergency use authorization
- FDA:
-
U.S. food and drug administration
- IVD:
-
In vitro diagnostic
- GISAID:
-
Global initiative on sharing all influenza data
- WHO:
-
World health organization
- MSA:
-
Multiple sequence alignment
- Tm:
-
Melting temperature
References
Gao J, Quan L. Current status of diagnostic testing for SARS-CoV-2 infection and future developments: a review. Med Sci Monit Int Med J Exp Clin Res. 2020;17(26):e928552.
Nguyen NNT, McCarthy C, Lantigua D, Camci-Unal G. Development of diagnostic tests for detection of SARS-CoV-2. Diagnostics. 2020;10(11):905.
Ravi N, Cortade DL, Ng E, Wang SX. Diagnostics for SARS-CoV-2 detection: a comprehensive review of the FDA-EUA COVID-19 testing landscape. Biosens Bioelectron. 2020;1(165):112454.
Jayamohan H, Lambert CJ, Sant HJ, Jafek A, Patel D, Feng H, et al. SARS-CoV-2 pandemic: a review of molecular diagnostic tools including sample collection and commercial response with associated advantages and limitations. Anal Bioanal Chem. 2021;413(1):49–71.
Jalandra R, Yadav AK, Verma D, Dalal N, Sharma M, Singh R, et al. Strategies and perspectives to develop SARS-CoV-2 detection methods and diagnostics. Biomed Pharmacother. 2020;1(129):110446.
Mitchell SL, St K, George DD, Rhoads SM, Butler-Wu VD, McNult P, Miller MB. Understanding, verifying, and implementing emergency use authorization molecular diagnostics for the detection of SARS-CoV-2 RNA. J Clin Microbiol. 2020;58(8):e00796.
GISAID Initiative [Internet]. [cited 2022 Dec 20]. Available from: https://www.epicov.org/epi3/frontend#56093
Khare S, Gurry C, Freitas L, Schultz MB, Bach G, Diallo A, et al. GISAID’s Role in pandemic response. China CDC Wkly. 2021;3(49):1049–51.
Khan KA, Cheung P. Presence of mismatches between diagnostic PCR assays and coronavirus SARS-CoV-2 genome. R Soc Open Sci. 2022;7(6):200636.
Cha RS, Thilly WG. Specificity, efficiency, and fidelity of PCR. Genome Res. 1993;3(3):S18-29.
Bru D, Martin-Laurent F, Philippot L. Quantification of the detrimental effect of a single primer-template mismatch by real-time PCR using the 16S rRNA gene as an example. Appl Environ Microbiol. 2008;74(5):1660–3.
Rejali NA, Moric E, Wittwer CT. The effect of single mismatches on primer extension. Clin Chem. 2018;64(5):801–9.
Policy for Evaluating Impact of Viral Mutations on COVID-19 Tests (Revised) - Guidance for Test Developers and Food and Drug Administration Staff.
BLAST: Basic Local Alignment Search Tool [Internet]. [cited 2023 Feb 24]. Available from: https://blast.ncbi.nlm.nih.gov/Blast.cgi
SantaLucia J. Physical Principles and Visual-OMP Software for Optimal PCR Design. In: Yuryev A, editor. PCR Primer Design [Internet]. Totowa, NJ: Humana Press; 2007 [cited 2022 Sep 20]. p. 3–33. (Walker JM, editor. Methods in Molecular BiologyTM; vol. 402).
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics. 2015;16(1):322.
Pango designation [Internet]. CoV-lineages; 2023 [cited 2023 Feb 7]. Available from: https://github.com/cov-lineages/pango-designation/blob/106720cbb83f1cd10a55ab537f84967d8b6c2e7a/lineage_notes.txt
Tracking SARS-CoV-2 variants [Internet]. [cited 2022 Nov 8]. Available from: https://www.who.int/activities/tracking-SARS-CoV-2-variants
Rychlik W. Priming efficiency in PCR. Biotechniques. 1995;18(1):84–6.
Wu JH, Hong PY, Liu WT. Quantitative effects of position and type of single mismatch on single base primer extension. J Microbiol Methods. 2009;77(3):267–75.
Stadhouders R, Pas SD, Anber J, Voermans J, Mes THM, Schutten M. The effect of primer-template mismatches on the detection and quantification of nucleic acids using the 5′ nuclease assay. J Mol Diagn. 2010;12(1):109–17.
Kim M, Smith WA, Van Hollebeke H. Personal communication.
Aleem A, Akbar Samad AB, Slenker AK. Emerging Variants of SARS-CoV-2 And Novel Therapeutics Against Coronavirus (COVID-19). In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2022 [cited 2023 Feb 22]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK570580/
CDC. Centers for Disease Control and Prevention. 2020 [cited 2023 Feb 22]. Coronavirus Disease 2019 (COVID-19). Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html
Mann T, Humbert R, Dorschner M, Stamatoyannopoulos J, Noble WS. A thermodynamic approach to PCR primer design. Nucleic Acids Res. 2009;37(13):e95–e95.
Howson ELA, Orton RJ, Mioulet V, Lembo T, King DP, Fowler VL. GoPrime: development of an in silico framework to predict the performance of real-time PCR primers and probes using foot-and-mouth disease virus as a model. Pathogens. 2020;9(4):303.
Rambaut A, Holmes EC, O’Toole Á, Hill V, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020;5(11):1403–7.
Health C for D and R. U.S. Food and Drug Administration. FDA; 2021 [cited 2022 Nov 14]. Policy for Evaluating Impact of Viral Mutations on COVID-19 Tests. Available from: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/policy-evaluating-impact-viral-mutations-covid-19-tests
BioFire® COVID-19 Test [Internet]. BioFire Defense. [cited 2022 Nov 13]. Available from: https://www.biofiredefense.com/covid-19test/
Bowen MD, Rollin PE, Ksiazek TG, Hustad HL, Bausch DG, Demby AH, et al. Genetic diversity among lassa virus strains. J Virol. 2000;74(15):6992–7004.
Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank members of the BioFire Defense Regulatory Affairs and Research and Development departments, including Kristin Casper, Dave Rabiger, and Jason Nielson for their thoughtful reviews of the manuscript. Thanks to Scott Glaittli for his technical expertise on the CoMIT software package.
Funding
This project was funded internally by BioFire Defense, LLC.
Author information
Authors and Affiliations
Contributions
Authors' contributions: Tool conceptualization: D.W., L.G., H.F.VH., W.A.S., and M.K.; software development: D.W., L.G., A.S, J.W., and C.H.; data curation: D.W., LG., and A.S.; evaluation methodology: D.W., L.G., H.F.VH., W.A.S., and M.K.; visualizations: L.G., A.S., D.W., H.F.VH., and W.A.S.; project administration: D.W. and M.K.; database validation: L.G. and H.F.VH.; benchmarking: A.S., code reviews: D.W., L.G., A.S., J.W., and C.H.; literature review: L.G., A.S., D.W., H.F.VH.; writing – original draft: D.W.; writing – review and editing: D.W., L.G., H.F.VH., A.S., W.A.S., J.W., C.H., and M.K; supervision: M.K. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
W.A. Smith, J.T. Wolff, C.P. Healy, H.F. Van Hollebeke, and M. Kim are current employees of Biofire Defense, LLC; D. Walker, L. Gale, and A. Stephenson are former employees of Biofire Defense, LLC; D. Walker, W.A. Smith, and M. Kim are shareholders of Biofire Defense, LLC.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Walker, D.M., Smith, W.A., Gale, L. et al. CoMIT: a bioinformatic pipeline for risk-based prediction of COVID-19 test inclusivity. BMC Bioinformatics 26, 51 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06046-y
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06046-y