A subscription to JoVE is required to view this content. Sign in or start your free trial.
Method Article
* These authors contributed equally
A bioinformatics pipeline, namely miRDeep-P2 (miRDP2 for short), with updated plant miRNA criteria and an overhauled algorithm, could accurately and efficiently analyze microRNA transcriptomes in plants, especially for species with complex and large genomes.
MicroRNAs (miRNAs) are 20- to 24-nucleotide (nt) endogenous small RNAs (sRNAs) extensively existing in plants and animals that play potent roles in regulating gene expression at the post-transcriptional level. Sequencing sRNA libraries by Next Generation Sequencing (NGS) methods has been widely employed to identify and analyze miRNA transcriptomes in the last decade, resulting in a rapid increase of miRNA discovery. However, two major challenges arise in plant miRNA annotation due to increasing depth of sequenced sRNA libraries as well as the size and complexity of plant genomes. First, many other types of sRNAs, in particular, short interfering RNAs (siRNAs) from sRNA libraries, are erroneously annotated as miRNAs by many computational tools. Second, it becomes an extremely time-consuming process for analyzing miRNA transcriptomes in plant species with large and complex genomes. To overcome these challenges, we recently upgraded miRDeep-P (a popular tool for miRNA transcriptome analyses) to miRDeep-P2 (miRDP2 for short) by employing a new filtering strategy, overhauling the scoring algorithm and incorporating newly updated plant miRNA annotation criteria. We tested miRDP2 against sequenced sRNA populations in five representative plants with increasing genomic complexity, including Arabidopsis, rice, tomato, maize and wheat. The results indicate that miRDP2 processed these tasks with very high efficiency. In addition, miRDP2 outperformed other prediction tools regarding sensitivity and accuracy. Taken together, our results demonstrate miRDP2 as a fast and accurate tool for analyzing plant miRNA transcriptomes, therefore a useful tool in helping the community better annotate miRNAs in plants.
One of the most exciting discoveries in the last two decades in biology is the proliferating role of sRNA species in regulating diverse functions of the genome1. In particular, miRNAs constitute an important class of 20- to 24-nt sRNAs in eukaryotes, and mainly function at post-transcriptional level as prominent gene regulators throughout life cycle development stages as well as in stimulus and stress responses2,3. In plants, miRNAs arise from primary transcripts called pri-miRNAs, which are generally transcribed by RNA polymerase II as individual transcription units4,5. Processed by evolutionarily conserved cellular machinery (Drosha RNase III in animals, DICER-like in plants), pri-miRNAs are excised into the immediate miRNA precursors, pre-miRNAs, which contain sequences forming intra-molecular stem-loop structures6,7. Pre-miRNAs are then processed into double-stranded intermediates, namely miRNA duplexes, consisting of the functional strand, mature miRNA, and the less frequently functional partner, miRNA*2,8. After loaded into the RNA-induced silencing complex (RISC), the mature miRNAs could recognize their mRNA targets based on sequence complementarity, resulting in a negative regulatory function2,8. miRNAs could either destabilize their target transcripts or prevent target translation but the former manner is dominated in plants8,9.
Since the fortuitous discovery of the first miRNA in the nematode Caenorhabditis elegans10,11, much research has been committed to miRNA identification and its functional analysis, especially after the availability of NGS method. The wide application of the NGS method has greatly promoted the utilization of computational tools that were designed to capture the unique feature of miRNAs, such as the stem-loop structure of precursors and their preferential accumulation of sequence reads on mature miRNA and miRNA*. As a result, researchers have achieved remarkable success in identifying miRNAs in diverse species. Based on a previously described probability model12, we developed miRDeep-P13, which was the first computational tool for discovering plant miRNAs from NGS data. miRDeep-P was specifically aimed at conquering the challenges of decoding plant miRNAs featuring more variable precursor length and large paralogous families13,14,15. After its release, this program has been downloaded thousands of times and used to annotate miRNA transcriptomes in more than 40 plant species16. Propelled by NGS-based tools like miRDeep-P, there has been a dramatic increase in the number of registered miRNAs in the public miRNA repository miRBase17, where over 38,000 miRNA items are currently hosted (release 22.1) in comparison to only ~500 miRNA items (release 2.0) in 200818.
However, two new challenges have arisen from plant miRNA annotation. First, high ratios of false-positives have heavily impacted the quality of plant miRNA annotations16,19 for the following reasons: 1) a deluge of endogenous short interfering RNAs (siRNAs) from NGS sRNA libraries were erroneously annotated as miRNAs due to lacking of a stringent miRNA annotation criteria; 2) for species without a priori miRNA information, false-positives predicted based on NGS data are difficult to eliminate. Using miRBase as an example, Taylor et al.20 found one third of plant miRNA entries in the public repository21 (release 21) lacked convincing supporting evidence and even three-fourths of plant miRNA families were questionable. Second, it becomes an extremely time-consuming process for predicting plant miRNAs with large and complex genomes16. To overcome these challenges, we updated miRDeep-P by adding a new filtering strategy, overhauling the scoring algorithm and integrating new criteria for plant miRNA annotation, and released the new version miRDP2. In addition, we tested miRDP2 using NGS sRNA datasets with gradually increasing genome sizes: Arabidopsis, rice, tomato, maize and wheat. Compared to other five widely used tools and its old version, miRDP2 parsed these sRNA data and analyzed miRNA transcriptomes faster with improved accuracy and sensitivity.
Contents of the miRDP2 package
The miRDP2 package consists of six documented Perl scripts that should be run sequentially by the prepared bash script. Of the six scripts, three (convert_bowtie_to_blast.pl, filter_alignments.pl, and excise_candidate.pl) are inherited from miRDeep-P. The other scripts are modified from the original version. Functions of the six scripts are described in the following:
preprocess_reads.pl filters input reads, including reads that are too long or too short (<19 nt or >25 nt), and reads correlated with Rfam ncRNA sequences, as well as reads with RPM (Reads Per Million) less than 5. The script then retrieves reads correlated to known miRNA mature sequences. The input files are original reads in FASTA/FASTQ format and bowtie2 output of reads mapping to miRNA and ncRNA sequences.
The formula for calculating RPM is as the following:
convert_bowtie_to_blast.pl changes the bowtie format into BLAST-parsed format. BLAST-parsed format is a custom tabular separated format derived from standard NCBI BLASToutput format.
filter_alignments.pl filters the alignments of deep sequencing reads to a genome. It filters partial alignments as well as multi-aligned reads (user-specified frequency cutoff). The basic input is a file in BLAST-parsed format.
excise_candidate.pl cuts out potential precursor sequences from a reference sequence using aligned reads as guidelines. The basic input is a file in BLAST-parsed format and a FASTA file. The output is all potential precursor sequences in FASTA format.
mod-miRDP.pl needs two input files, signature file and structure file, which is modified from the core miRDeep-P algorithm by changing the scoring system with plant specific parameters. The input files are dot-bracket precursor structure file and reads distribution signature file.
mod-rm_redundant_meet_plant.pl needs three input files: chromosome_length, precursors and original_prediction generated by mod-miRDP.pl. It generates two output files, non-redundant predicted file and predicted file filtered by newly updated plant miRNA criteria. Details on the format of output file are described in section 1.4.
1. Installation and testing
2. Identifying novel miRNAs
3. Modifications and caution using miRDP2
The miRNA annotation pipeline, miRDP2, described herein is applied to 10 public sRNA-seq libraries from 5 plant species with gradually increased genome length, including Arabidopsis thaliana, Oryza sativa (rice), Solanum lycopersicum (tomato), Zea mays (maize) and Triticum aestivum (wheat) (Figure 1A). Overall, for each species, 2 representative sRNA libraries from different tissues (collapsed into unique reads, details in the pro...
With the advent of NGS, a large number of miRNA loci have been identified from an ever-increasing amount of sRNA sequencing data in diverse species29,30. In the centralized community database miRBase21, the deposited miRNA items have increased almost 100 times in the last decade. However, in comparison to miRNAs in animals, plant miRNAs have many unique features that make the identification/annotation more complicated13
The authors have nothing to disclose.
This work has been supported by Beijing Academy of Agriculture and Forestry Sciences (KJCX201917, KJCX20180425, and KJCX20180204) to XY and National Natural Science Foundation of China (31621001) to LL.
Name | Company | Catalog Number | Comments |
Computer/computing node | N/A | N/A | Perl is required; at least 8 GB RAM and 100 GB storage are recommended |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved