A subscription to JoVE is required to view this content. Sign in or start your free trial.
Method Article
Our Bayesian Change Point (BCP) algorithm builds on state-of-the-art advances in modeling change-points via Hidden Markov Models and applies them to chromatin immunoprecipitation sequencing (ChIPseq) data analysis. BCP performs well in both broad and punctate data types, but excels in accurately identifying robust, reproducible islands of diffuse histone enrichment.
ChIPseq is a widely used technique for investigating protein-DNA interactions. Read density profiles are generated by using next-sequencing of protein-bound DNA and aligning the short reads to a reference genome. Enriched regions are revealed as peaks, which often differ dramatically in shape, depending on the target protein1. For example, transcription factors often bind in a site- and sequence-specific manner and tend to produce punctate peaks, while histone modifications are more pervasive and are characterized by broad, diffuse islands of enrichment2. Reliably identifying these regions was the focus of our work.
Algorithms for analyzing ChIPseq data have employed various methodologies, from heuristics3-5 to more rigorous statistical models, e.g. Hidden Markov Models (HMMs)6-8. We sought a solution that minimized the necessity for difficult-to-define, ad hoc parameters that often compromise resolution and lessen the intuitive usability of the tool. With respect to HMM-based methods, we aimed to curtail parameter estimation procedures and simple, finite state classifications that are often utilized.
Additionally, conventional ChIPseq data analysis involves categorization of the expected read density profiles as either punctate or diffuse followed by subsequent application of the appropriate tool. We further aimed to replace the need for these two distinct models with a single, more versatile model, which can capably address the entire spectrum of data types.
To meet these objectives, we first constructed a statistical framework that naturally modeled ChIPseq data structures using a cutting edge advance in HMMs9, which utilizes only explicit formulas-an innovation crucial to its performance advantages. More sophisticated then heuristic models, our HMM accommodates infinite hidden states through a Bayesian model. We applied it to identifying reasonable change points in read density, which further define segments of enrichment. Our analysis revealed how our Bayesian Change Point (BCP) algorithm had a reduced computational complexity-evidenced by an abridged run time and memory footprint. The BCP algorithm was successfully applied to both punctate peak and diffuse island identification with robust accuracy and limited user-defined parameters. This illustrated both its versatility and ease of use. Consequently, we believe it can be implemented readily across broad ranges of data types and end users in a manner that is easily compared and contrasted, making it a great tool for ChIPseq data analysis that can aid in collaboration and corroboration between research groups. Here, we demonstrate the application of BCP to existing transcription factor10,11 and epigenetic data12 to illustrate its usefulness.
1. Preparing Input Files for BCP Analysis
2a. Diffuse Read Profiles: Preprocessing ChIP Read Densities for Detection of Enriched Islands in Diffuse Data
2b. Punctate Read Profiles: Preprocessing ChIP and Input BED Files for Detection of Peaks in Punctate Data
3. Estimate the Posterior Mean Read Density of Each Block using our BCMIX Approximation
4a. Diffuse Read Profiles: Post-process Posterior Means into Segments of Diffuse Enrichment
4b. Punctate Read Profiles: Post-process Posterior Means into Peak Candidates
BCP excels at identifying regions of broad enrichment in histone modification data. As a point of reference, we previously compared our results to those of SICER3, an existing tool which has demonstrated strong performance. To best illustrate BCP's advantages, we examined a histone modification that had been well studied to establish a foundation for assessing success rates. With this in mind, we then analyzed H3K36me3, since it has been shown to associate strongly with actively transcribed gene bodies (
We set out to develop a model for analyzing ChIPseq data that could identify both punctate and diffuse data structures equally well. Until now, regions of enrichment, particularly diffuse regions, which reflect the presupposed expectation of large island size, have been difficult to identify. To address these problems, we utilized the most recent advances in HMM technology, which possess many advantages over existing heuristic models and less innovative HMMs.
Our model makes use of a Bayesian...
No conflicts of interest declared.
STARR foundation award (MQZ), NIH grant ES017166 (MQZ), NSF grant DMS0906593 (HX).
Name | Company | Catalog Number | Comments |
Linux-based workstation |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved