JoVE Logo

Zaloguj się

Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.

W tym Artykule

  • Podsumowanie
  • Streszczenie
  • Protokół
  • Wyniki
  • Dyskusje
  • Ujawnienia
  • Podziękowania
  • Materiały
  • Odniesienia
  • Przedruki i uprawnienia

Podsumowanie

Our Bayesian Change Point (BCP) algorithm builds on state-of-the-art advances in modeling change-points via Hidden Markov Models and applies them to chromatin immunoprecipitation sequencing (ChIPseq) data analysis. BCP performs well in both broad and punctate data types, but excels in accurately identifying robust, reproducible islands of diffuse histone enrichment.

Streszczenie

ChIPseq is a widely used technique for investigating protein-DNA interactions. Read density profiles are generated by using next-sequencing of protein-bound DNA and aligning the short reads to a reference genome. Enriched regions are revealed as peaks, which often differ dramatically in shape, depending on the target protein1. For example, transcription factors often bind in a site- and sequence-specific manner and tend to produce punctate peaks, while histone modifications are more pervasive and are characterized by broad, diffuse islands of enrichment2. Reliably identifying these regions was the focus of our work.

Algorithms for analyzing ChIPseq data have employed various methodologies, from heuristics3-5 to more rigorous statistical models, e.g. Hidden Markov Models (HMMs)6-8. We sought a solution that minimized the necessity for difficult-to-define, ad hoc parameters that often compromise resolution and lessen the intuitive usability of the tool. With respect to HMM-based methods, we aimed to curtail parameter estimation procedures and simple, finite state classifications that are often utilized.

Additionally, conventional ChIPseq data analysis involves categorization of the expected read density profiles as either punctate or diffuse followed by subsequent application of the appropriate tool. We further aimed to replace the need for these two distinct models with a single, more versatile model, which can capably address the entire spectrum of data types.

To meet these objectives, we first constructed a statistical framework that naturally modeled ChIPseq data structures using a cutting edge advance in HMMs9, which utilizes only explicit formulas-an innovation crucial to its performance advantages. More sophisticated then heuristic models, our HMM accommodates infinite hidden states through a Bayesian model. We applied it to identifying reasonable change points in read density, which further define segments of enrichment. Our analysis revealed how our Bayesian Change Point (BCP) algorithm had a reduced computational complexity-evidenced by an abridged run time and memory footprint. The BCP algorithm was successfully applied to both punctate peak and diffuse island identification with robust accuracy and limited user-defined parameters. This illustrated both its versatility and ease of use. Consequently, we believe it can be implemented readily across broad ranges of data types and end users in a manner that is easily compared and contrasted, making it a great tool for ChIPseq data analysis that can aid in collaboration and corroboration between research groups. Here, we demonstrate the application of BCP to existing transcription factor10,11 and epigenetic data12 to illustrate its usefulness.

Protokół

1. Preparing Input Files for BCP Analysis

  1. Align the short reads produced from sequencing runs (ChIP and input libraries) to the appropriate reference genome using the preferred short read alignment software. The mapped locations should be converted to the 6 column browser extensible data (BED) format13 (UCSC genome browser, http://genome.ucsc.edu/), a tab-delimited line per mapped read indicating the mapped chromosome, start position (0-based), end position (half-open), read name, score (optional), and strand.

2a. Diffuse Read Profiles: Preprocessing ChIP Read Densities for Detection of Enriched Islands in Diffuse Data

  1. Extend the ChIP and input mapped locations to a predetermined fragment length, i.e. the fragment size targeted during enzyme digestion or sonication of the DNA, usually around 200 bp. Fragment counts are then aggregated in adjacent bins. By default, bin size is set to the estimated fragment length of 200 bp.
  2. Any possible change-points in a set of bins with identical read counts will most likely fall at the outer most boundaries. Accordingly, it is improbable that a change point will occur at an internal boundary between two bins with the same read counts. So, group adjacent bins, with identical reads per bin, into a single block, i.e. bedGraph format13.

2b. Punctate Read Profiles: Preprocessing ChIP and Input BED Files for Detection of Peaks in Punctate Data

  1. Aggregate overlapping reads for plus and minus strand ChIP reads separately. The strand specific read densities should form a bimodal profile of plus and minus peaks. Choose plus/minus pairs of the most enriched peaks and use the distance between their summits as an estimate for the library fragment length.
  2. Shift the ChIP and input reads half the fragment length to the center and recalculate the read density of the shifted and merged plus and minus strand reads. This methodology for estimating the fragment length was adopted from Zhang, et al.3. Positions with identical merge counts should be grouped into blocks, similar to step 2a.2.

3. Estimate the Posterior Mean Read Density of Each Block using our BCMIX Approximation

  1. The read density of each block is modeled as a Poisson distribution, Pois(θt), with a mean parameter following a mixture of Gamma distributions, Γ(α,β), and a prior probability of a change point occurring at any block boundary of p. Conditioning Pois(θt) on G(α,β) effectively renders the model an infinite state HMM. Estimate the hyper-parameters, α, β, and p, using maximum posterior likelihood.
  2. Explicitly calculate the Bayes estimates for each block, θt, as E(θtZ). Replace the more traditional but time consuming forward and backward filters often used in HMMs, with the more computationally efficiently Bounded Complexity Mixture approximation to estimate posterior means, θc. The resulting posterior means will be "smoothed" into an approximate piecewise constant profile so blocks with identical, θc, should be further blocked together with updated boundary coordinates.

4a. Diffuse Read Profiles: Post-process Posterior Means into Segments of Diffuse Enrichment

  1. Use the number of input reads per each newθc block as the background rate, Pois(λa) and determine enrichment using a simple hypothesis test based on whether the ChIP posterior mean, θc, exceeds some threshold δ. The 90th-quantile is the default d and is appropriate in most cases.
  2. Merge adjacent θc blocks that exceed the enrichment into a single region and report merge coordinates in simple BED format. Alternatively, one can report the θc for each block in bedGraph format to preserve the high-resolution details of the read density estimates.

4b. Punctate Read Profiles: Post-process Posterior Means into Peak Candidates

  1. Define the background rate, Pois(λa), as the average of all read counts (γ2) and identify all blocks which exceed the threshold, d. Since punctate peaks are expected to be more substantially enriched, the default δ is set to the 99th-quantile of Pois(λa).
  2. Set the block with the maximal θc as the candidate peak summit and adjoin flanking blocks that share a similar read density (±1 read count to allow for slight variation). This adjoined region is defined as a candidate binding site.
  3. Calculateλ2 as the average read counts in the ChIP candidate binding site and hypothesis test this versus input background were the null hypothesis, H0, is that λ1 λ2 and reject H0 based on a p-value threshold. Output candidate peaks in BED format.

Wyniki

BCP excels at identifying regions of broad enrichment in histone modification data. As a point of reference, we previously compared our results to those of SICER3, an existing tool which has demonstrated strong performance. To best illustrate BCP's advantages, we examined a histone modification that had been well studied to establish a foundation for assessing success rates. With this in mind, we then analyzed H3K36me3, since it has been shown to associate strongly with actively transcribed gene bodies (

Dyskusje

We set out to develop a model for analyzing ChIPseq data that could identify both punctate and diffuse data structures equally well. Until now, regions of enrichment, particularly diffuse regions, which reflect the presupposed expectation of large island size, have been difficult to identify. To address these problems, we utilized the most recent advances in HMM technology, which possess many advantages over existing heuristic models and less innovative HMMs.

Our model makes use of a Bayesian...

Ujawnienia

No conflicts of interest declared.

Podziękowania

STARR foundation award (MQZ), NIH grant ES017166 (MQZ), NSF grant DMS0906593 (HX).

Materiały

NameCompanyCatalog NumberComments
Linux-based workstation

Odniesienia

  1. Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669-680 (2009).
  2. Barski, A., et al. High-resolution profiling of histone methylations in the human genome. Cell. 129, 823-837 (2007).
  3. Zhang, Y., et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
  4. Zang, C., et al. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 25, 1952-1958 (2009).
  5. Jothi, R., Cuddapah, S., Barski, A., Cui, K., Zhao, K. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221-5231 (2008).
  6. Qin, Z. S., et al. HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics. 11, 369 (2010).
  7. Song, Q., Smith, A. D. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics. 27, 870-871 (2011).
  8. Spyrou, C., Stark, R., Lynch, A. G., Tavaré, S. BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics. 10, 299 (2009).
  9. Lai, T., Xing, H. A simple Bayesian approach to multiple change-points. Statistica Sinica. , (2011).
  10. Robertson, G., et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 4, 651-657 (2007).
  11. Stitzel, M. L., et al. Global epigenomic analysis of primary human pancreatic islets provides insights into type 2 diabetes susceptibility loci. Cell Metab. 12, 443-455 (2010).
  12. Bernstein, B. E., et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045-1048 (2010).
  13. Karolchik, D., et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, 493-496 (2004).
  14. Matys, V., et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374-378 (2003).
  15. Portales-Casamar, E., et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105-D110 (2010).

Przedruki i uprawnienia

Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE

Zapytaj o uprawnienia

Przeglądaj więcej artyków

Bayesian Change point AlgorithmGenome wide AnalysisChIPseq Data TypesProtein DNA InteractionsRead Density ProfilesPeaksTranscription FactorsHistone ModificationsStatistical ModelsHidden Markov Models HMMsAd Hoc ParametersResolutionUsabilityParameter Estimation ProceduresFinite State Classifications

This article has been published

Video Coming Soon

JoVE Logo

Prywatność

Warunki Korzystania

Zasady

Badania

Edukacja

O JoVE

Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone