Skip to main content

ChIPbinner: an R package for analyzing broad histone marks binned in uniform windows from ChIP-Seq or CUT&RUN/TAG data

Abstract

Background

The decreasing costs of sequencing, along with the growing understanding of epigenetic mechanisms driving diseases, have led to the increased application of chromatin immunoprecipitation (ChIP), Cleavage Under Targets & Release Using Nuclease (CUT&RUN) and Cleavage Under Targets and Tagmentation (CUT&TAG) sequencing—which are designed to map DNA or chromatin-binding proteins to their genome targets—in biomedical research. Existing software tools, namely peak-callers, are available for analyzing data from these technologies, although they often struggle with diffuse and broad signals, such as those associated with broad histone post-translational modifications (PTMs).

Results

To address this limitation, we present ChIPbinner, an open-source R package tailored for reference-agnostic analysis of broad PTMs. Instead of relying on pre-identified enriched regions from peak-callers, ChIPbinner divides (bins) the genome into uniform windows. Thus, users are provided with an unbiased method to explore genome-wide differences between two samples using scatterplots, principal component analysis (PCA), and correlation plots. It also facilitates the identification and characterization of differential clusters of bins, allowing users to focus on specific genomic regions significantly affected by treatments or mutations. We demonstrated the effectiveness of this tool through simulated datasets and a case study assessing H3K36me2 depletion following NSD1 knockout in head and neck squamous cell carcinoma, highlighting the advantages of ChIPbinner in detecting broad histone mark changes over existing software.

Conclusions

Binned analysis provides a more holistic view of the genomic landscape, allowing researchers to uncover broader patterns and correlations that may be missed when solely focusing on individual peaks. ChIPbinner offers researchers a convenient tool to perform binned analysis. It improves on previously published software by providing a clustering approach that is independent of each bin’s differential enrichment status and more precisely identifies differentially bound regions for broad histone marks, while also offering additional features for downstream analysis of these differentially enriched bins.

Peer Review reports

Background

Histone modifications refer to chemical modifications to the histone tails of nucleosomes, the protein complexes around which DNA is wrapped. These modifications, also known as histone marks, regulate gene expression by facilitating or hindering access to the DNA. Narrow histone marks, such as acetylation of lysine 27 at histone 3 (H3K27ac) are deposited in specific, focused genomic regions whereas broad histone marks, such as methylation of lysine 36 at histone 3 (H3K36me), are diffused across the genome, covering large genomic domains. Recent advancements and decreasing costs of sequencing technology, such as ChIP, CUT&RUN and CUT&TAG sequencing, which allow mapping of protein-DNA interactions, for example histone marks, have led to their increasing application in biomedical research. The widespread adoption of these technologies is further driven by our growing understanding of epigenetic mechanisms driving disease and cancer.

Model-based Analysis of ChIP-Seq (MACS) is a widely used tool for identifying genomic regions where the target ChIP-Seq signal is enriched relative to background noise from a control DNA input or non-targeting antibody experiment [1]. Commonly known as a “peak-caller”, MACS was originally designed to detect transcriptional factor binding sites, which are found in highly specific genomic regions [1]. Detection of diffuse enrichment covering extended genomic regions often suffers from high noise level and lack of saturation in sequencing coverage [2]. Thus, peak-callers designed for broad histone marks have been developed, such as EPIC2 for ChIP-Seq data [3], the “--broad” feature in MACS, and Sparse Enrichment Analysis for CUT&RUN (SEACR) [4]. However, despite these advances in peak calling strategies for broad histone marks, there is often a discordance amongst peak callers as to what constitutes true signal enrichment. Furthermore, diffuse, broad domains become fragmented into smaller, often biologically meaningless peaks. This situation is further confounded when the nature of the genomic distribution of chromatin modifications changes: H3K27me3 in the presence of the pediatric glioma H3K27M (histone 3 lysine-to-methionine mutation at position 27) mutation from a broad to promoter-focused distribution [5] or H3K36me2 deposited by NSD1/2 (broad), versus NSD3/ASH1L (narrow), making it difficult to apply a uniform approach for comparative analysis [6].

As an alternative to being limited to some set of pre-defined regions (peaks), the genome can be divided into uniform windows, which can be referred to as “binning the genome”. The advantage of this strategy compared to peak-calling is that it is reference-agnostic—the analysis is unbiased without a prior set of references needed. Bins that are differentially enriched would be identified, but it does not apply prior assumptions or biases to those bins, as opposed to peak callers which rely on an algorithm and prior assumptions. Here, we present ChIPbinner, an R package designed to compare genome-wide changes in broad histone marks between two samples without relying on pre-identified regions. It can be used to normalize raw signal that has been binned in uniform windows and combine replicates. Furthermore, exploratory analyses can be conducted: the distribution of bins in genic and intergenic regions can be assessed, and genome-wide profiles of samples can be evaluated using PCA or correlation plots to assess consistency of replicates and separation of samples according to treatment. A primary motivation for performing ChIP/CUT&RUN/TAG sequencing is to identify genomic regions where significant changes occur between different treatment conditions, also known as differential binding (DB) sites. This can uncover potential mechanisms through which changes in binding may contribute to the treatment effect. Software tools for detecting DB sites in ChIP-Seq data have been developed using DiffBind [7] and csaw [8]. However, DiffBind relies on peak-sets derived from peak-callers to identify DB sites between sample groups [7]. Consequently, there is a lack of independence between peak calling and DB detection—DiffBind is constrained by the same assumptions and biases associated with peak-callers. Conversely, csaw uses a window-based strategy to summarize read counts across the genome and is independent of peak-callers. It uses statistical methods in the edgeR package, which were designed for differential gene expression analysis [9], to test for significant differences in each window. Finally, it clusters windows into regions and controls the false discovery rate over all detected regions [8]. However, the default clustering procedures for csaw relies on independent filtering to remove irrelevant windows. This ensures that the regions that are differentially changing are reasonably narrow and can be easily interpreted, which is typically the case for transcription factors and narrow histone marks. However, enriched regions tend to be very large for more diffuse marks and such regions may be difficult to interpret when smaller DB subintervals is of interest and not necessarily well-separated, defined regions. To address this issue, csaw employs a post-hoc analysis for diffuse histone marks, using only significant windows clustering [8]. This approach defines and clusters significant windows while controlling the cluster-level false discovery rate (FDR) at 5%. This alleviates the need for stringent abundance filtering to achieve well-separated regions prior to clustering for diffuse marks. However, the calculation of the cluster-level FDR is not entirely rigorous, and the clusters are influenced by the DB status of the windows. Thus, csaw struggles to detect DB properly when dealing with the diffuse signal from broad histone marks. Overall, although csaw avoids the aggressive assumptions when defining peaks as no peak calling is performed, it introduces other assumptions by applying a fixed predefined statistical model to detect differentially bound sites prior to clustering.

In contrast to csaw, ChIPbinner cluster bins independent of the DB status of each bin—normalized read counts per window from the two samples are directly inputted into the clustering algorithm without any prior statistical comparisons. Furthermore, although designing sequencing experiments without replicates is discouraged, ChIPbinner can be used with only a single replicate per treatment, allowing cross-validation across cell lines as independent controls. If replicates are indeed included, ChIPbinner uses the ROTS (reproducibility-optimized test statistics) method [10] to assess DB between two groups for each bin. Unlike methods relying on a fixed predefined statistical model (e.g., edgeR's negative binomial generalized linear models with quasi-likelihood tests used in csaw), ROTS optimizes the test statistic directly from the data. This adaptive approach, based on t-type statistics, ranks genomic features based on the strength of evidence for differential binding in two-group comparisons. A priori assumptions about the data distribution or specific cutoffs for ranking do not need to be specified for ROTS [10]. The optimization process maximizes the overlap of top-ranked features in bootstrap datasets that maintain the original group structure. Previous performance comparisons demonstrated that ROTS, while exhibiting less power in datasets with a small proportion of differentially expressed features, surpasses other differential expression methods, such as edgeR and DESeq2, in datasets characterized by a large proportion of differentially expressed features and a skewed distribution of these features [11]—conditions frequently observed in ChIP-seq data following mutations affecting global histone levels.

ChIPbinner includes additional features that are not readily available in csaw, such as built-in functions for PCA, hierarchical clustering, annotating individual bins as genic or intergenic, and enrichment/depletion analysis for specified clusters. Lastly, ChIPbinner offers functionalities that enable users to easily normalize and quantitatively scale the signal in their bins, whereas csaw requires some manual coding to achieve the same results. Thus, in addition to identifying clusters of bins that remain unchanged between different treatment conditions, ChIPbinner can also identify clusters of bins that exhibit differential changes and assess whether these clusters are enriched or depleted in specific classes of functionally annotated regions.

Implementation

ChIPbinner is an open-source R package that can be installed using ‘remotes::install_github(“padilr1/ChIPbinner”,build_vignettes = TRUE)’ (latest version) in R. Alternatively, users can install ChIPbinner v0.99.2 by running ‘install.packages(“Additional_file_1_ChIPbinner_0.99.2.tar.gz”,repos = NULL,type = ”source”)’ in R [see Additional file 1: Additional_file_1_ChIPbinner_0.99.2.tar.gz]. ChIPbinner has been written to minimize the number of external dependencies required for installation. The ChIPbinner workflow is depicted in Fig. 1.

Fig. 1
figure 1

The ChIPbinner workflow requires that aligned reads from ChIP-Seq or CUT&RUN/TAG experiments, typically in BAM format, to be binned in uniform windows and converted to BED (Browser Extensible Data) format (minimum BED3 format with chromosome, start and end information) (red box). Users can normalize the raw binned signal. Pairwise comparisons, such as treated versus control (yellow box), can then be performed using genic/intergenic scatterplots, which annotates each bin as genic or intergenic. For comparing more than two samples, PCA or hierarchical clustering can be used to assess consistency of replicates or effect of treatment/mutation on genome-wide profiles of each sample. Next, users can combine replicates for use in downstream analysis (green boxes). After selecting two samples for pairwise comparison, users can identify and annotate clusters of similarly-behaving bins and visualize them. Finally, users can run differential bin analysis and/or select specific clusters of bins for enrichment and depletion analysis

Data pre-processing and input

ChIPbinner takes ChIP-Seq or CUT&RUN/TAG data binned in uniform windows in a BED format as input. Hence, users would need to firstly convert their aligned sequence reads in BAM format into BED format. One of the published tools that than can be used to perform this task is “bedtools bamtobed” [12]. The size of the predicted changes will determine the suitable window size for the analysis. For more granular, localized changes in specific regions, a smaller window (e.g., 1 or 10 kilobase (kb)) is recommended. In contrast, larger windows (e.g., 100 kb) are better suited for analyzing broader-scale changes, such as the loss of H3K9me3 across broad genomic domains. Users can also begin with smaller bin sizes and subsequently merge overlapping or adjacent features later in the analysis. Tools like “bedtools merge” [12] can facilitate this process.

Normalization and transformation into bigWig

Next, users can normalize the raw read counts in their BED file to adjust for library size and, in the case of ChIP-Seq data, background signal using the input samples. Additionally, ChIPbinner provides an option for users to normalize binned raw counts across samples using DESeq2’s median ratio method [13] or edgeR’s trimmed mean of M-values (TMM) [9]. Both normalization techniques are applied to the raw binned counts to obtain a set of normalization factors. When normalizing binned counts, DESeq2 calculates sample-specific size factors using the median ratio method and divides the raw counts by these factors. EdgeR, on the other hand, utilizes the TMM method to generate effective library sizes, which subsequently replaces the original library sizes for normalization. In addition to correcting for technical biases arising from differences in sequencing depth and library composition across samples, both methods utilize normalization factors to address outlier regions that could distort comparisons of read counts across samples. However, these methods assume most features are not differentially affected between samples, a condition that may not be met when mutations or knockouts cause global, widespread reductions in broad histone marks. In our simulations with downsampled intergenic or genic regions for four broad histone marks (H3K36me2, H3K4me1, H3K27me3 and H3K9me3), we found four times more true positives when we only normalized by library size compared to when we applied edgeR’s TMM normalization (Table 1). Thus, users should be careful when applying these normalization techniques, as they may reduce the statistical power to detect genuine biological changes in broad histone marks.

Table 1 Normalizing solely by library size outperformed edgeR normalization in true positive detection for simulated ChIP-Seq data

Nevertheless, users can also quantitatively scale the histone mark signals using techniques such as ChIP with reference exogenous genome (ChIP-Rx) [14] or the genome-wide modification percentage information obtained from mass spectrometry, as described in Farhangdoost et al. [15], to allow for direct quantitative comparisons and to correct for other confounding experimental variables, such as variations in genome fragmentation and immunoprecipitation efficiency. Additionally, users have options to exclude artifact signals from the ENCODE blacklist regions [16] and filter out bins with low raw read count. The normalized and scaled signals are outputted in a bigWig track format, which is useful for dense, continuous data (Fig. 1; see green and yellow boxes).

Handling replicates and global sample profiles

To ensure replicates for a given condition cluster together, users can generate plots of PCA and/or correlation matrix-based hierarchical clustering using the normalized bigWig files from the previous step. These plots also offer insights as to the effect of a treatment or mutation on the global profiles of samples for a given histone mark. If satisfied with the consistency of replicates, the users can proceed to merge replicates, which takes the average signal per bin between replicates.

Genic/intergenic scatterplot

Comparing two bigWig files, typically the treated versus untreated samples, users can generate a scatterplot for bins annotated as genic or intergenic (Fig. 1; refer to the scatterplot located at the center of the schematic diagram). Genic regions were taken as the union of any intervals having the “gene” annotations in the curated Ensembl database [17] and intergenic regions were thus defined as the complement of genic ones. Annotated genic and intergenic regions are available for the following assemblies: mm10 (mouse) and hg38 (human).

Identifying and annotating clusters

ChIPbinner utilizes HDBSCAN, a density-based hierarchical clustering method, to identify clusters of similarly-behaving bins [18]. HDBSCAN builds a comprehensive density-based clustering hierarchy, from which a simplified hierarchy containing only the most significant clusters can be easily extracted [18]. Users can specify the stringency of the clustering algorithm, such as the minimum size grouping for a cluster and how conservative the clustering will be—the more conservative, the more points will be declared as noise and clusters will be restricted to progressively more dense areas. Each cluster will be assigned a letter (A to Z) for reference in downstream analysis.

Density-based plots

Density-based scatterplots offer a genome-wide overview of clusters that show enrichment, depletion, or remain unchanged between two samples for a specific histone mark (Fig. 1; bottom-right section of the schematic diagram).

Differential binding analysis

The user can perform differential binding analysis on normalized and/or scaled counts from the bigWig files generated in previous steps. ChIPbinner uses the ROTS method [10] to assess DB between two groups for each bin. For each bin assigned to a specific cluster, the log fold change and adjusted p value will be generated.

Enrichment and depletion analysis

After identifying and annotating clusters of bins, a Fisher’s exact test can be used to determine whether a specified cluster of bins significantly overlap a specific class of annotated regions against a background of all bins (Fig. 1; refer to the rightmost plot as an example), as implemented in LOLA [19]. Additionally, the user can evaluate exclusively a specified cluster of genic or intergenic bins overlapping a specific class of annotated regions. In these cases, the background is stratified to only genic or intergenic regions to avoid spurious associations to annotations confounded by their predominantly genic or intergenic localization. Three curated datasets of annotated regions have been provided with the ChIPbinner package: Ensembl regulatory build [17], repeatMasker (www.repeatmasker.org) and ENCODE candidate cis-regulatory elements [20], which can be found at https://github.com/padilr1/ChIPbinner_database. Optionally, the user can provide their own dataset of annotated regions to test their cluster of bins against.

Results and discussion

Simulated ChIP-Seq datasets

To perform detailed performance comparisons between csaw and ChIPbinner, simulated datasets were created by down-sampling ChIP-Seq wildtype samples from human head and neck squamous cell carcinoma cell lines (H3K36me2 and H3K9me3) [6, 15] and from mouse mesenchymal stem cells (H3K4me1 and H3K27me3) [21, 22]. For H3K36me2, H3K27me3, and H3K9me3, intergenic regions covered by the histone mark (i.e., showing signal) were depleted in the wildtype samples. In contrast, for H3K4me1, covered genic regions were depleted. These down-sampled samples were then compared to the original, unaltered samples to evaluate precision, specificity, and sensitivity between ChIPbinner and csaw (Table 2). Bins in the computed clusters derived from the ChIPbinner analysis workflow were matched to their resulting statistics from running ROTS [10] on depth-normalized counts. We ran csaw with default parameters, which included performing normalization using edgeR's TMM method, fitting a negative binomial generalized log-linear model, and running quasi-likelihood F-tests [8]. For both software, only bins with an adjusted p value < 0.05 were analyzed. Precision, specificity, sensitivity and F1 scores were calculated as follows:

$${\text{Precision}} = TP/\left( {TP + FP} \right)$$
$${\text{Specificity}} = TN/\left( {TN + FP} \right)$$
$${\text{Sensitivity}}\;\left( {{\text{Recall}}} \right) = TP/\left( {TP + FN} \right)$$
$${\text{F1}}\;{\text{score}} = 2*\frac{precision*sensitivity}{{precision + sensitivity}}.$$
Table 2 ChIPbinner identifies differentially bound regions in broad histone marks with greater precision than csaw

True positives (TP) referred to the number of correctly identified depleted bins, false positives (FP) were referred to as incorrectly identified depleted or upregulated bins, true negatives (TN) were correctly identified invariant bins, and finally, false negatives (FN) were bins that were depleted but were identified by the software as invariant. Across all four histone marks, sensitivity and specificity were not significantly different between ChIPbinner and csaw (Table 2). However, For H3K36me2 and H3K4me1, csaw had significantly lower precision and consequently lower F1 scores than ChIPbinner. The difference in performance was due to csaw's identification of upregulated bins in the downsampled samples, despite the absence of any actual increase in absolute read counts. These incorrectly identified upregulated bins were classified as false positives.

Case study

The methods implemented in ChIPbinner are largely derived from our analyses in Farhangdoost et al. [15], where we identified and characterized genomic compartments that exhibited the greatest loss of the broad histone mark H3K36me2 following NSD1 knockout (NSD1-KO) in a head and neck squamous cell carcinoma (HNSCC) cell line Cal27. In that study, we subdivided the genome into 10 kb bins. Using an approach similar to what can be performed with ChIPbinner, we identified a cluster of bins, designated as Cluster B, with high initial levels of H3K36me2 in the parental cell lines (WT) and low levels in the NSD1-KO (Fig. 2a). For comparison, we used csaw with 10 kb bins and cluster-level FDR at 5% to detect a similar cluster of bins with downregulated levels of H3K36me2 in NSD1-KO compared to WT (Fig. 2b, c; purple boxes). 86% (51,607/59848) of downregulated DB bins identified using csaw were also detected by ChIPbinner whereas csaw only detected 55% (51,607/93,613) of the DB bins identified by ChIPbinner (Fig. 2c). Furthermore, csaw reported 36,038 upregulated bins in the comparison of NSD1-KO to WT, whereas ChIPbinner, combined with ROTS for differential analysis, detected only 12 upregulated bins (using a logFC > 0 and adjusted p value < 0.05 cutoff for upregulated bins for both software). Prior research has reported that when a large proportion of genes (e.g., 54%) are upregulated, the performance of methods like DESeq2 and edgeR declines significantly, leading to nearly random predictions [11]. Conversely, ROTS has been demonstrated to be more robust to the presence of numerous and imbalanced differentially expressed features [11], two conditions commonly encountered in ChIP-Seq data following a mutation or knockout affecting global histone levels, such as in samples where NSD1 is mutated. Indeed, in our NSD1-KO to WT comparison, 50% of 10 kb bins with signal were downregulated. Therefore, for broad histone marks, ChIPbinner with ROTS provides superior precision in detecting differentially bound regions compared to csaw. An added benefit of the ROTS method is it allows input of scaled counts (using ChIP-Rx or mass spectrometry-derived modification percentages for example), a feature absent in csaw.

Fig. 2
figure 2

A case study in a head and neck squamous cell carcinoma (HNSCC) cell line Cal27, which belong to the HPV(−) subgroup of HNSCC but harbor no endogenous mutations affecting H3K36me. a Scatterplots and genome-browser tracks demonstrating how clusters of 10-kb bins identified using ChIPbinner correspond to specific genomic regions. A comparison of the wildtype cell line (WT) (dark red) and an NSD1 knockout cell line (NSD1-KO) (green) reveals depletion of intergenic H3K36me2 in Cluster B (highlighted in blue), where high levels of H3K36me2 are present in WT and low levels are found in NSD1-KO. In clusters A (highlighted in pink) and C (highlighted in brown), there is low and high levels of H3K36me2 in both conditions respectively. ChIP-seq signals were normalized using genome-wide modification percentage values obtained from mass spectrometry (MS). b Genome-browser tracks showing an example locus where ChIPbinner (blue boxes) detected more regions of H3K36me2 loss compared to csaw (purple boxes). c Genome-wide overlap analysis revealed that out of the 10 kb bins with loss of H3K36me2, ChIPbinner detected 51,607 out of 59,848 (86%) bins identified by csaw, while csaw only detected 51,607 out of 93,613 (55%) bins identified by ChIPbinner. d ChIPbinner allows users to perform differential enrichment and depletion analysis for a specified cluster. In this case, cluster B bins were tested for overlap with a specific class of annotated regions from Ensembl. **** represents p value < 0.0001 based on Fisher’s exact test. The results show that cluster B bins are enriched in intergenic regions, particularly at enhancers and open chromatin regions, and depleted in promoter flanking regions, exons and introns

Nevertheless, as stated previously, ChIPbinner also allows users to perform downstream analysis following identification of clusters. In our case study, characterization of Cluster B indicated that these regions were predominantly intergenic and enriched for enhancers (Fig. 2a, d) [15]—both of which can be determined for a specific cluster using the genic/intergenic scatterplot and enrichment/depletion functions, respectively, available in the ChIPbinner package. Further characterization of cluster B revealed that these regions exhibited the greatest reduction in H3K27ac binding, and led to the strongest downregulation of target genes within the same topologically associating domain (TAD) [15]. More recently, we used ChIPbinner to identify genomic regions with significant loss of the broad, heterochromatic histone mark H3K9me3 following complete depletion of H3K36me in mouse mesenchymal stem cells [23]. These regions were predominantly intergenic, and we observed significant upregulation of genes and transposable elements within them. We replicated this binned analysis in HNSCC cell lines, where we similarly identified a cluster of bins significantly losing H3K9me3 following H3K36me depletion [23]. In another study from our group, we used intergenic/genic scatterplots for binned genomic regions—as can be generated with ChIPbinner—to demonstrate that the greatest reduction in H3K36me2 following NSD1 knockout in mouse embryonic stem cells occurs primarily at intergenic regions [24]. These studies highlight the value of ChIPbinner in analyzing broad histone marks, such as H3K36me2 and H3K9me3, for identifying and characterizing genomic compartments undergoing the most significant changes, or as a visualization tool for genome-wide comparisons between different samples.

Conclusions

ChIPbinner offers an alternative approach to peak-callers and equips researchers with an unbiased tool to analyze trends and patterns in epigenetic data, especially for broad histone marks. By providing a compiled workflow and dedicated functions, our software saves researchers time and effort in implementing binned analysis techniques. It also ensures that their analyses can be easily replicated and applied to new datasets, even for users with limited programming experience. ChIPbinner improves on previously published software by offering a clustering approach that is independent of DB and more precisely identifies differentially bound regions for broad histone marks. Furthermore, it includes additional features for downstream analysis.

In systems characterized by large global changes in the distribution or levels of broad histone marks, as observed here following NSD1-KO for H3K36me2 and NSD1/2-SETD2-TKO for H3K9me3, ChIPbinner is a useful tool for exploring differences of chromatin modifications between conditions. It allows users to analyze changes in the distribution of broad histone marks within genic and intergenic regions, without the assumptions or constraints imposed by peak callers or fixed predefined statistical models.

Availability of data and materials

The datasets analysed during the current study are available in the National Center for Biotechnology Information Gene Expression Omnibus (NCBI-GEO) under accession number GSE149670, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149670 [15], number GSE69291, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69291 [21] and number GSE160266, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE160266 [22]. Curated databases from Ensembl, RepeatMasker and ENCODE candidate cis-regulatory elements can be found at https://github.com/padilr1/ChIPbinner_database. Project name: ChIPbinner; Project home page: https://github.com/padilr1/ChIPbinner; Operating system: Platform independent; Programming language: R; Other requirements: None; License: GPL-3.0; Any restrictions to use by non-academics: None.

Abbreviations

ChIP-Seq:

Chromatin immunoprecipitation sequencing

CUT&RUN:

Cleavage Under Targets & Release Using Nuclease

CUT&TAG:

Cleavage Under Targets and Tagmentation

PTM:

Post-translational modifications

PCA:

Principal component analysis

H3K27ac:

Histone three lysine 27 acetylation

H3K36me:

Histone three lysine 36 methylation

MACS:

Model-based Analysis of ChIP-Seq

SEACR:

Sparse enrichment analysis for CUT&RUN

H3K27M:

Lysine-to-methionine mutations in histone 3 at position 27

NSD:

Nuclear receptor-binding SET domain protein

ASH1L:

Absent, small, or homeotic discs 1-like

DB:

Differential binding

BED:

Browser extensible data

DE:

Differential expression

TMM:

Trimmed mean of M-values

logFC:

Log fold change

ChIP-Rx:

ChIP with reference exogenous genome

KB:

Kilobase

WT:

Wildtype

NSD1-KO:

NSD1 knockout

HNSCC:

Head and neck squamous cell carcinoma

ROTS:

Reproducibility-optimized test statistics

TAD:

Topologically associating domain

References

  1. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009;25:1952–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Stovner EB, Sætrom P. epic2 efficiently finds diffuse domains in ChIP-seq data. Bioinformatics. 2019;35:4392–3.

    Article  CAS  PubMed  Google Scholar 

  4. Meers MP, Tenenbaum D, Henikoff S. Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling. Epigenetics Chromatin. 2019;12:42.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Harutyunyan AS, Chen H, Lu T, Horth C, Nikbakht H, Krug B, et al. H3K27M in gliomas causes a one-step decrease in H3K27 methylation and reduced spreading within the constraints of H3K36 methylation. Cell Rep. 2020;33:108390.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Shipman GA, Padilla R, Horth C, Hu B, Bareke E, Vitorino FN, et al. Systematic perturbations of SETD2, NSD1, NSD2, NSD3 and ASH1L reveals their distinct contributions to H3K36 methylation. 2023.

  7. Stark R, Brown G. DiffBind: differential binding analysis of ChIP-Seq peak data.

  8. Lun ATL, Smyth GK. csaw: a bioconductor package for differential binding analysis of ChIP-seq data using sliding windows. Nucleic Acids Res. 2016;44:e45.

    Article  PubMed  Google Scholar 

  9. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.

    Article  CAS  PubMed  Google Scholar 

  10. Suomi T, Seyednasrollah F, Jaakkola MK, Faux T, Elo LL. ROTS: an R package for reproducibility-optimized statistical testing. PLoS Comput Biol. 2017;13:e1005562.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Baik B, Yoon S, Nam D. Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data. PLoS ONE. 2020;15:e0232271.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Orlando DA, Chen MW, Brown VE, Solanki S, Choi YJ, Olson ER, et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep. 2014;9:1163–70.

    Article  CAS  PubMed  Google Scholar 

  15. Farhangdoost N, Horth C, Hu B, Bareke E, Chen X, Li Y, et al. Chromatin dysregulation associated with NSD1 mutation in head and neck squamous cell carcinoma. Cell Rep. 2021;34:108769.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Amemiya HM, Kundaje A, Boyle AP. The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep. 2019;9:9354.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR. The ensembl regulatory build. Genome Biol. 2015;16:56.

    Article  PubMed  PubMed Central  Google Scholar 

  18. McInnes L, Healy J, Astels S. hdbscan: hierarchical density based clustering. J Open Source Softw. 2017;2:205.

    Article  Google Scholar 

  19. Sheffield NC, Bock C. LOLA: enrichment analysis for genomic region sets and regulatory elements in R and bioconductor. Bioinformatics. 2016;32:587–9.

    Article  CAS  PubMed  Google Scholar 

  20. Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, Adrian J, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Lu C, Jain SU, Hoelper D, Bechet D, Molden RC, Ran L, et al. Histone H3K36 mutations promote sarcomagenesis through altered histone methylation landscape. Science. 2016;352:844–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Rajagopalan KN, Chen X, Weinberg DN, Chen H, Majewski J, Allis CD, et al. Depletion of H3K36me2 recapitulates epigenomic and phenotypic changes induced by the H3.3K36M oncohistone mutation. Proc Natl Acad Sci. 2021;118:e2021795118.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Padilla R, Shipman GA, Horth C, Bareke E, Majewski J. H3K36 Methylation—a Guardian of Epigenome Integrity. 2024;:2024.08.10.607446.

  24. Chen H, Hu B, Horth C, Bareke E, Rosenbaum P, Kwon SY, et al. H3K36 dimethylation shapes the epigenetic interaction landscape by directing repressive chromatin modifications in embryonic stem cells. Genome Res. 2022;genome;gr.276383.121v2.

Download references

Funding

J.M. is supported by the Canadian Institutes of Health Research (CIHR) Grant CIHR PJT-183939 and the United States National Institutes of Health (NIH) Grant P01-CA196539. R.P. is supported by the Fonds de Recherche du Québec Santé.

Author information

Authors and Affiliations

Authors

Contributions

R.P. conceptualized the software tool with the help of J.M. B.H. provided the initial scripts. R.P. developed and compiled software package, with assistance from E.B. R.P. wrote the initial manuscript, with suggestions and contributions from J.M., E.B. and B.H. for the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jacek Majewski.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Padilla, R., Bareke, E., Hu, B. et al. ChIPbinner: an R package for analyzing broad histone marks binned in uniform windows from ChIP-Seq or CUT&RUN/TAG data. BMC Bioinformatics 26, 89 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06103-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12859-025-06103-6

Keywords