Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.
This protocol guides bioinformatics beginners through an introductory CUT&RUN analysis pipeline that enables users to complete an initial analysis and validation of CUT&RUN sequencing data. Completing the analysis steps described here, combined with downstream peak annotation, will allow users to draw mechanistic insights into chromatin regulation.
The CUT&RUN technique facilitates detection of protein-DNA interactions across the genome. Typical applications of CUT&RUN include profiling changes in histone tail modifications or mapping transcription factor chromatin occupancy. Widespread adoption of CUT&RUN is driven, in part, by technical advantages over conventional ChIP-seq that include lower cell input requirements, lower sequencing depth requirements, and increased sensitivity with reduced background signal due to a lack of cross-linking agents that otherwise mask antibody epitopes. Widespread adoption of CUT&RUN has also been achieved through the generous sharing of reagents by the Henikoff lab and the development of commercial kits to accelerate adoption for beginners. As technical adoption of CUT&RUN increases, CUT&RUN sequencing analysis and validation become critical bottlenecks that must be surmounted to enable complete adoption by predominantly wet lab teams. CUT&RUN analysis typically begins with quality control checks on raw sequencing reads to assess sequencing depth, read quality, and potential biases. Reads are then aligned to a reference genome sequence assembly, and several bioinformatics tools are subsequently employed to annotate genomic regions of protein enrichment, confirm data interpretability, and draw biological conclusions. Although multiple in silico analysis pipelines have been developed to support CUT&RUN data analysis, their complex multi-module structure and usage of multiple programming languages render the platforms difficult for bioinformatics beginners who may lack familiarity with multiple programming languages but wish to understand the CUT&RUN analysis procedure and customize their analysis pipelines. Here, we provide a single-language step-by-step CUT&RUN analysis pipeline protocol designed for users with any level of bioinformatics experience. This protocol includes completing critical quality checks to validate that the sequencing data is suitable for biological interpretation. We expect that following the introductory protocol provided in this article combined with downstream peak annotation will allow users to draw biological insights from their own CUT&RUN datasets.
The ability to measure interactions between proteins and genomic DNA is fundamental to understanding the biology of chromatin regulation. Effective assays that measure chromatin occupancy for a given protein provide at least two key pieces of information: i) genomic localization and ii) protein abundance at a given genomic region. Tracking the recruitment and localization changes of a protein of interest in chromatin can reveal direct target loci of the protein and reveal mechanistic roles of that protein in chromatin-based biological processes such as regulation of transcription, DNA repair, or DNA replication. The techniques available today to profile protein-DNA interactions are enabling researchers to explore regulation at unprecedented resolution. Such technical advances have been enabled through the introduction of new chromatin profiling techniques that include the development of Cleavage Under Targets and Release Using Nuclease (CUT&RUN) by the Henikoff laboratory. CUT&RUN offers several technical advantages over conventional chromatin immunoprecipitation (ChIP) that include lower cell input requirements, lower sequencing depth requirements, and increased sensitivity with reduced background signal due to a lack of cross-linking agents that otherwise mask antibody epitopes. Adopting this technique to study chromatin regulation requires a thorough understanding of the principle underlying the technique, and an understanding of how to analyze, validate, and interpret CUT&RUN data.
The CUT&RUN procedure begins with binding of cells to Concanavalin A conjugated to magnetic beads to enable manipulation of low cell numbers throughout the procedure. Isolated cells are permeabilized using a mild detergent to facilitate introduction of an antibody that targets the protein of interest. Micrococcal nuclease (MNase) is then recruited to the bound antibody using a Protein A or Protein A/G tag tethered to the enzyme. Calcium is introduced to initiate enzymatic activity. MNase digestion results in mono-nucleosomal DNA-protein complexes. Calcium is subsequently chelated to end the digestion reaction, and short DNA fragments from the MNase digestion are released from nuclei, then subjected to DNA purification, library preparation, and high-throughput sequencing1 (Figure 1).
In silico approaches to map and quantify protein occupancy across the genome have developed in parallel with the wet lab approaches used to enrich those DNA-protein interactions. Identification of regions of enriched signals (peaks) is one of the most critical steps in the bioinformatics analysis. Initial ChIP-seq analysis methods used algorithms such as MACS2 and SICER3, which employed statistical models to distinguish bona fide protein-DNA binding sites from background noise. However, the lower background noise and higher resolution of CUT&RUN data render some peak calling programs employed in ChIP-seq analysis unsuitable for CUT&RUN analysis4. This challenge highlights the need for new tools better suited for the analysis of CUT&RUN data. SEACR4 represents one such tool recently developed to enable peak calling from CUT&RUN data while overcoming limitations associated with tools typically employed toward ChIP-seq analysis.
Biological interpretations from CUT&RUN sequencing data are drawn from the outputs downstream of peak calling in the analysis pipeline. Several functional annotation programs can be implemented to predict the potential biological relevance of the called peaks from CUT&RUN data. For example, the Gene Ontology (GO) project provides well-established functional identification of genes of interest5,6,7. Various software tools and resources facilitate GO analysis to reveal genes and gene sets enriched amongst CUT&RUN peaks8,9,10,11,12,13,14. Furthermore, visualization software such as Deeptools15, Integrative genomics viewer (IGV)16, and UCSC Genome Browser17 enable visualization of signal distribution and patterns at regions of interest across the genome.
The ability to draw biological interpretations from CUT&RUN data depends critically upon validation of data quality. Critical components to validate include the assessment of: i) CUT&RUN library sequencing quality, ii) replicate similarity, and iii) signal distribution at peak centers. Completing the validation of all three components is crucial to ensure the reliability of CUT&RUN library samples and downstream analysis results. Therefore, it is essential to establish introductory CUT&RUN analysis guides to enable bioinformatics beginners and wet lab researchers to conduct such validation steps as part of their standard CUT&RUN analysis pipelines.
Alongside the development of wet lab CUT&RUN experiment, various in silico CUT&RUN analysis pipelines, such as CUT&RUNTools 2.018,19, nf-core/cutandrun20, and CnRAP21, have been developed to support CUT&RUN data analysis. These tools provide powerful approaches to analyzing single-cell and bulk CUT&RUN and CUT&Tag datasets. However, the relatively complex modular program structure and required familiarity with multiple programming languages to conduct these analysis pipelines may hinder adoption by bioinformatics beginners who seek to thoroughly understand the CUT&RUN analysis steps and customize their own pipelines. Circumvention of this barrier requires a new introductory CUT&RUN analysis pipeline that is provided in simple step-by-step scripts encoded using a simple single programming language.
In this article, we describe a simple single-language CUT&RUN analysis pipeline protocol that provides step-by-step scripts supported with detailed descriptions to enable new and novice users to conduct CUT&RUN sequencing analysis. Programs used in this pipeline are publicly available by the original developer groups. Major steps described in this protocol include read alignment, peak calling, functional analysis and, most critically, validation steps to assess sample quality to determine data suitability and reliability for biological interpretation (Figure 2). Furthermore, this pipeline provides users the opportunity to cross-reference analysis results against publicly available CUT&RUN datasets. Ultimately, this CUT&RUN analysis pipeline protocol serves as an introductory guide and reference for bioinformatic analysis beginners and wet lab researchers.
NOTE: Information for CUT&RUN fastq files in GSE126612 are available in Table 1. Information related to the software applications used in this study are listed in the Table of Materials.
1. Downloading Easy-Shells_CUTnRUN pipeline from its Github page
2. Installing the programs required for Easy Shells CUTnRUN
3. Downloading the publicly available CUT&RUN dataset from Sequence Read Archive (SRA)
4. Initial quality check for the raw sequencing files
5. Quality and adapter trimming for raw sequencing files
6. Downloading the bowtie2 index for the reference genomes for actual and spike-in control samples
7. Mapping trimmed CUT&RUN sequencing reads to the reference genomes
8. Sorting and filtering the mapped read pairs files
9. Convert mapped read pairs to fragment BEDPE, BED and raw readcounts bedGraph files
10. Converting raw readcounts bedGraph files to normalized bedGraph and bigWig files
11. Validating fragment size distribution
12. Calling peaks using MACS2, MACS3 and SEACR
13. Creating called peak bed files
14. Validating similarity between replicates using Pearson correlation and Principal component (PC) analysis.
15. Validating similarity between replicates, peak calling methods and options using Venn diagram
16. Analyzing heatmaps and average plots to visualize called peaks.
Quality and adapter trimming retains reads with high sequencing quality
High-throughput sequencing techniques are prone to generating sequencing errors such as sequence 'mutations' in reads. Furthermore, sequencing adapter dimers can be enriched in sequencing datasets due to poor adapter removal during library preparation. Excessive sequencing errors, such as read mutations, generation of reads shorter than required for proper mapping, and enrichment of adapter dimers, can increase read map...
The ability to map protein occupancy on chromatin is fundamental to conducting mechanistic studies in the field of chromatin biology. As laboratories adopt new wet lab techniques to profile chromatin, the ability to analyze sequencing data from those wet lab experiments becomes a common bottleneck for wet lab scientists. Therefore, we describe an introductory step-by-step protocol to enable bioinformatics beginners to overcome the analysis bottleneck, and initiate analysis and quality control checks of their own CUT&...
The authors declare no disclosures.
All illustrated figures were created with BioRender.com. CAI acknowledges support provided through an Ovarian Cancer Research Alliance Early Career Investigator Award, a Forbeck Foundation Accelerator Grant, and the Minnestoa Ovarian Cancer Alliance National Early Detection Research Award.
Name | Company | Catalog Number | Comments |
bedGraphToBigWig | ENCODE | https://hgdownload.soe.ucsc.edu/admin/exe/ | Software to compress and convert readcounts bedGraph to bigWig |
bedtools-2.31.1 | The Quinlan Lab @ the U. of Utah | https://bedtools.readthedocs.io/en/latest/index.html | Software to process bam/bed/bedGraph files |
bowtie2 2.5.4 | Johns Hopkins University | https://bowtie-bio.sourceforge.net/bowtie2/index.shtml | Software to build bowtie index and perform alignment |
CollectInsertSizeMetrics (Picard) | Broad institute | https://github.com/broadinstitute/picard | Software to perform insert size distribution analysis |
Cutadapt | NBIS | https://cutadapt.readthedocs.io/en/stable/index.html | Software to perform adapter trimming |
Deeptoolsv3.5.1 | Max Planck Institute | https://deeptools.readthedocs.io/en/develop/index.html | Software to perform Pearson coefficient correlation analysis, Principal component analysis, and Heatmap/average plot analysis |
FastQC Version 0.12.0 | Babraham Bioinformatics | https://github.com/s-andrews/FastQC | Software to check quality of fastq file |
Intervenev0.6.1 | Computational Biology & Gene regulation - Mathelier group | https://intervene.readthedocs.io/en/latest/index.html | Software to perform venn diagram analysis using peak files |
MACSv2.2.9.1 | Chan Zuckerberg initiative | https://github.com/macs3-project/MACS/tree/macs_v2 | Software to call peaks |
MACSv3.0.2 | Chan Zuckerberg initiative | https://github.com/macs3-project/MACS/tree/master | Software to call peaks |
Samtools-1.21 | Wellcome Sanger Institute | https://github.com/samtools/samtools | Software to process sam/bam files |
SEACRv1.3 | Howard Hughes Medial institute | https://github.com/FredHutch/SEACR | Software to call peaks |
SRA Toolkit Release 3.1.1 | NCBI | https://github.com/ncbi/sra-tools | Software to download SRR from GEO |
Trim_Galore v0.6.10 | Babraham Bioinformatics | https://github.com/FelixKrueger/TrimGalore | Software to perform quality and atapter trimming |
Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE
Zapytaj o uprawnieniaThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone