A Bioinformatics Pipeline to Accurately and Efficiently Analyze the MicroRNA Transcriptomes in Plants

Ying Wang; Zheng Kuang; Lei Li; Xiaozeng Yang

doi:10.3791/59864

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

A bioinformatics pipeline, namely miRDeep-P2 (miRDP2 for short), with updated plant miRNA criteria and an overhauled algorithm, could accurately and efficiently analyze microRNA transcriptomes in plants, especially for species with complex and large genomes.

Abstract

MicroRNAs (miRNAs) are 20- to 24-nucleotide (nt) endogenous small RNAs (sRNAs) extensively existing in plants and animals that play potent roles in regulating gene expression at the post-transcriptional level. Sequencing sRNA libraries by Next Generation Sequencing (NGS) methods has been widely employed to identify and analyze miRNA transcriptomes in the last decade, resulting in a rapid increase of miRNA discovery. However, two major challenges arise in plant miRNA annotation due to increasing depth of sequenced sRNA libraries as well as the size and complexity of plant genomes. First, many other types of sRNAs, in particular, short interfering RNAs (siRNAs) from sRNA libraries, are erroneously annotated as miRNAs by many computational tools. Second, it becomes an extremely time-consuming process for analyzing miRNA transcriptomes in plant species with large and complex genomes. To overcome these challenges, we recently upgraded miRDeep-P (a popular tool for miRNA transcriptome analyses) to miRDeep-P2 (miRDP2 for short) by employing a new filtering strategy, overhauling the scoring algorithm and incorporating newly updated plant miRNA annotation criteria. We tested miRDP2 against sequenced sRNA populations in five representative plants with increasing genomic complexity, including Arabidopsis, rice, tomato, maize and wheat. The results indicate that miRDP2 processed these tasks with very high efficiency. In addition, miRDP2 outperformed other prediction tools regarding sensitivity and accuracy. Taken together, our results demonstrate miRDP2 as a fast and accurate tool for analyzing plant miRNA transcriptomes, therefore a useful tool in helping the community better annotate miRNAs in plants.

Introduction

One of the most exciting discoveries in the last two decades in biology is the proliferating role of sRNA species in regulating diverse functions of the genome¹. In particular, miRNAs constitute an important class of 20- to 24-nt sRNAs in eukaryotes, and mainly function at post-transcriptional level as prominent gene regulators throughout life cycle development stages as well as in stimulus and stress responses²^,³. In plants, miRNAs arise from primary transcripts called pri-miRNAs, which are generally transcribed by RNA polymerase II as individual transcription units⁴^,⁵. Processed by evolutionarily conserved cellular machinery (Drosha RNase III in animals, DICER-like in plants), pri-miRNAs are excised into the immediate miRNA precursors, pre-miRNAs, which contain sequences forming intra-molecular stem-loop structures⁶^,⁷. Pre-miRNAs are then processed into double-stranded intermediates, namely miRNA duplexes, consisting of the functional strand, mature miRNA, and the less frequently functional partner, miRNA*²^,⁸. After loaded into the RNA-induced silencing complex (RISC), the mature miRNAs could recognize their mRNA targets based on sequence complementarity, resulting in a negative regulatory function²^,⁸. miRNAs could either destabilize their target transcripts or prevent target translation but the former manner is dominated in plants⁸^,⁹.

Since the fortuitous discovery of the first miRNA in the nematode Caenorhabditis elegans¹⁰^,¹¹, much research has been committed to miRNA identification and its functional analysis, especially after the availability of NGS method. The wide application of the NGS method has greatly promoted the utilization of computational tools that were designed to capture the unique feature of miRNAs, such as the stem-loop structure of precursors and their preferential accumulation of sequence reads on mature miRNA and miRNA*. As a result, researchers have achieved remarkable success in identifying miRNAs in diverse species. Based on a previously described probability model¹², we developed miRDeep-P¹³, which was the first computational tool for discovering plant miRNAs from NGS data. miRDeep-P was specifically aimed at conquering the challenges of decoding plant miRNAs featuring more variable precursor length and large paralogous families¹³^,¹⁴^,¹⁵. After its release, this program has been downloaded thousands of times and used to annotate miRNA transcriptomes in more than 40 plant species¹⁶. Propelled by NGS-based tools like miRDeep-P, there has been a dramatic increase in the number of registered miRNAs in the public miRNA repository miRBase¹⁷, where over 38,000 miRNA items are currently hosted (release 22.1) in comparison to only ~500 miRNA items (release 2.0) in 2008¹⁸.

However, two new challenges have arisen from plant miRNA annotation. First, high ratios of false-positives have heavily impacted the quality of plant miRNA annotations¹⁶^,¹⁹ for the following reasons: 1) a deluge of endogenous short interfering RNAs (siRNAs) from NGS sRNA libraries were erroneously annotated as miRNAs due to lacking of a stringent miRNA annotation criteria; 2) for species without a priori miRNA information, false-positives predicted based on NGS data are difficult to eliminate. Using miRBase as an example, Taylor et al.²⁰ found one third of plant miRNA entries in the public repository²¹ (release 21) lacked convincing supporting evidence and even three-fourths of plant miRNA families were questionable. Second, it becomes an extremely time-consuming process for predicting plant miRNAs with large and complex genomes¹⁶. To overcome these challenges, we updated miRDeep-P by adding a new filtering strategy, overhauling the scoring algorithm and integrating new criteria for plant miRNA annotation, and released the new version miRDP2. In addition, we tested miRDP2 using NGS sRNA datasets with gradually increasing genome sizes: Arabidopsis, rice, tomato, maize and wheat. Compared to other five widely used tools and its old version, miRDP2 parsed these sRNA data and analyzed miRNA transcriptomes faster with improved accuracy and sensitivity.

Contents of the miRDP2 package
The miRDP2 package consists of six documented Perl scripts that should be run sequentially by the prepared bash script. Of the six scripts, three (convert_bowtie_to_blast.pl, filter_alignments.pl, and excise_candidate.pl) are inherited from miRDeep-P. The other scripts are modified from the original version. Functions of the six scripts are described in the following:

preprocess_reads.pl filters input reads, including reads that are too long or too short (<19 nt or >25 nt), and reads correlated with Rfam ncRNA sequences, as well as reads with RPM (Reads Per Million) less than 5. The script then retrieves reads correlated to known miRNA mature sequences. The input files are original reads in FASTA/FASTQ format and bowtie2 output of reads mapping to miRNA and ncRNA sequences.

The formula for calculating RPM is as the following:

figure-introduction-6062

convert_bowtie_to_blast.pl changes the bowtie format into BLAST-parsed format. BLAST-parsed format is a custom tabular separated format derived from standard NCBI BLASToutput format.

filter_alignments.pl filters the alignments of deep sequencing reads to a genome. It filters partial alignments as well as multi-aligned reads (user-specified frequency cutoff). The basic input is a file in BLAST-parsed format.

excise_candidate.pl cuts out potential precursor sequences from a reference sequence using aligned reads as guidelines. The basic input is a file in BLAST-parsed format and a FASTA file. The output is all potential precursor sequences in FASTA format.

mod-miRDP.pl needs two input files, signature file and structure file, which is modified from the core miRDeep-P algorithm by changing the scoring system with plant specific parameters. The input files are dot-bracket precursor structure file and reads distribution signature file.

mod-rm_redundant_meet_plant.pl needs three input files: chromosome_length, precursors and original_prediction generated by mod-miRDP.pl. It generates two output files, non-redundant predicted file and predicted file filtered by newly updated plant miRNA criteria. Details on the format of output file are described in section 1.4.

Protocol

1. Installation and testing

Download required dependencies: Bowtie2²² and RNAfold²³. Compiled packages are recommended.
1. Download Bowtie2, a read mapping tool, from its home site (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
2. Download RNAfold, a tool of the Vienna package used to predict RNA secondary structure, from http://www.tbi.univie.ac.at/~ivo/RNA/.
3. Before installing miRDP2, ensure that these two dependencies are correctly installed, and customize the bash environment file (e.g., .bashrc) to set a correct PATH for these two dependencies.
  NOTE: Other mapping tools such as Bowtie²⁴ are also suitable to miRDP2; either Bowtie or Bowtie2 can be used after version 1.1.3.
To download the miRDP2 package, go to https://sourceforge.net/projects/mirdp2/files/latest_version/ and fetch the tarball files.
Before installing miRDP2, make sure that Perl is in the PATH. To install miRDP2, extract all contents of the downloaded tarball file into one folder (command lines as in 1.4.2), and then set the folder path into the PATH.
NOTE: A computer or computing node with at least 8 GB RAM and 100 GB storage are recommended to run miRDP2.
Test the MiRDP2 pipeline.
1. To test whether miRDP2 has been correctly installed, use the test data and the expected output found in https://sourceforge.net/projects/mirdp2/files/TestData/. Test data contain one formatted GSM sequencing file and one Arabidopsis thaliana genome file.
2. Move all downloaded files to the current working directory:
  mv miRDP2-v*.tar.gz TestData.tar.gz ncRNA_rfam.tar.gz <user_selected_folder>
  cd <user_selected_folder>
3. Extract the compressed tarball files:
  tar –xvzf miRDP2-v*.tar.gz
  tar –xvzf TestData.tar.gz
  tar –xvzf ncRNA_rfam.tar.gz
4. Build the Arabidopsis genome reference index:
  bowtie2-build -f ./TestData/TAIR10_genome.fa ./TestData/TAIR10_genome
5. Build the ncRNA reference index:
  bowtie2-build -f ./ncRNA_rfam.fa ./1.1.3/script/index/rfam_index
6. Run the miRDP2 pipeline:
  bash ./1.1.3/miRDP2-v1.1.3_pipeline.bash –g ./TestData/TAIR10_genome.fa -i ./ TestData/TAIR 10_genome –f ./TestData/GSM2094927.fa –o .
  NOTE: Linux commands used are in bold and italic fonts, with command line options in italics. *indicates the version of miRDP2 (the current version is 1.1.3). The bowtie2-build command should take roughly 10 minutes, and the miRDP2 pipeline should finish within several minutes
Check testing outputs.
1. Note that a folder named 'GSM2094927-15-0-10' is automatically generated in <user_selected_folder>, containing all intermediate files and results.
2. Check that the tab-delimited output file GSM2094927-15-0-10_filter_P_prediction, the final output of predicted miRNAs, contains columns that indicate chromosome id, strand direction, representative reads id, precursor id, mature miRNA location, precursor location, mature sequence, and precursor sequence. Note the additional bed file derived from this file to facilitate further analysis.
3. Check the file "progress_log", which provides information about finished steps, and the files "script_log" and "script_err", that contain program output and warnings.
  NOTE: Currently, we have tested miRDP2 on two Linux platforms, including CentOS release 6.5 on a cluster server, and Cygwin 2.6.0 on PC Windows system, and miRDP2 should work on similar systems that support Perl.

2. Identifying novel miRNAs

Before running the pipeline, ensure that the input reads are preprocessed into proper format.
NOTE: The new version 1.1.3 of miRDP2 can accept original FASTQ format files as inputs, although the process of formatting reads is carried out as in previous versions.
1. First, remove adapters from the 5' and 3' ends of the deep sequencing reads (if present).
2. Second, parse the deep sequencing reads into FASTA format.
3. Third, remove redundancy such that reads with identical sequence are represented with a single and unique FASTA entry.
4. Finally, ensure that all of the FASTA identifiers are unique. Each sequence identifier must end with a '_x' and an integer, indicating the copy number of the exact sequence that was retrieved in the deep sequencing datasets. One way to ensure unique FASTA identifier is to include a running number in the ID. For reference, see the file GSM2094927.fa in the test data (https://sourceforge.net/projects/mirdp2/files/TestData/).
5. See the following for examples of correctly formatted reads:
  
  >read0_x29909
  TTTGGATTGAAGGGAGCTCTA
  >read1_x36974
  TTCCACAGCTTTCTTGAACTG
  >read2_x32635
  TTCCACAGCTTTCTTGAACTT
Build reference indices.
1. For the genome reference, to save time, download Bowtie2 index files from the iGenomes website (https://support.illumina.com/sequencing/sequencing_software/igenome.html) if the genome sequences of the species of interest have been indexed. Otherwise, users index reference sequences and keep the index file for a while till the project is finished since the genome sequence might need to be re-indexed. Details on how to index a genome reference are included in bowtie2 manual (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).
2. Another non-miRNA ncRNA index is also needed to filter out noisy sequences from other non-coding RNA fragments. The file is a collection of main ncRNA sequences from Rfam, including rRNA, tRNA, snRNA, and snoRNA. To build this index, please refer to part 1.4, as the index should be placed and named correctly, i.e. <miRDP2_version>/script/index/rfam_index.
Run miRDP2.
1. To use miRDP2 to detect new miRNAs from deep sequencing data, run the bash script in the package to start the analysis pipeline (An example can be found in step 1.4):
  <path_to_miRDP2_folder>/miRDP2-v*.*_pipeline.bash –g <genome_file> -i <path_to_index/index_prefix> -f <seq_file > -o <output_folder>
  where * indicates the version of the pipeline bash script. There are three parameters that can be modified: 1) the number of different locations a read could be mapped to, 2) the mismatch number for running bowtie2, and 3) the threshold of RPM (Reads Per Million). Modify these using the –L, -M, and –R options, respectively. A detailed explanation is in section 3.1.
Check the miRDP2 outputs.
1. Note that the output folder will be automatically generated under <output_folder>, and named '<seq_file_name>-15-0-10'; the last 3 numbers indicate the values (default in this case) for parameters 1, 2, and 3, respectively. The file <seq_file_name>_filter_P_prediction contains information of the final predicted miRNAs satisfying the newly updated plant miRNA annotation criteria. Details on the format of output file are described in part 1.4.

3. Modifications and caution using miRDP2

Parameters that can be modified
1. Use the '-L' option to set the limit of how many locations a read could be mapped to (parameter 1). Read mapping to too many sites are possibly associated with repeat sequences, and are not likely to miRNAs. The default setting is 15. For specific species, if there are miRNA families with many members, the first parameter may be increased manually to adapt to the genome landscape.
2. Use the '-M' option to set the allowed mismatches for bowtie (parameter 2). The default setting is 0.
3. Use the '-R' option to set the threshold for reads potentially corresponding to mature miRNAs (parameter 3). To reduce time consumption and false-positives, filter reads by RPM. Only reads exceeding a certain RPM threshold may represent mature sequences of miRNAs rather than background noise, and would be kept for further analysis. The default setting is 10 RPM.
4. Note that changing these parameters can potentially affect performance and time consumption. In general, an increase of parameter 1 and 2 and a decrease of parameter 3 would generate a less stringent result and longer running time and vice versa.
Redundancy and miRNA*
1. Note that the output miRNAs from miRDP2 may differ from the known miRNAs. We found that this is mainly due to one of two reasons: heterogeneity of the mature miRNAs or the relative abundance of miRNA and miRNA*. We found that this does not impact the optimal length selection of precursors and the profiling of known miRNA genes.

Results

The miRNA annotation pipeline, miRDP2, described herein is applied to 10 public sRNA-seq libraries from 5 plant species with gradually increased genome length, including Arabidopsis thaliana, Oryza sativa (rice), Solanum lycopersicum (tomato), Zea mays (maize) and Triticum aestivum (wheat) (Figure 1A). Overall, for each species, 2 representative sRNA libraries from different tissues (collapsed into unique reads, details in the pro...

Discussion

With the advent of NGS, a large number of miRNA loci have been identified from an ever-increasing amount of sRNA sequencing data in diverse species²⁹^,³⁰. In the centralized community database miRBase²¹, the deposited miRNA items have increased almost 100 times in the last decade. However, in comparison to miRNAs in animals, plant miRNAs have many unique features that make the identification/annotation more complicated¹³

Disclosures

The authors have nothing to disclose.

Acknowledgements

This work has been supported by Beijing Academy of Agriculture and Forestry Sciences (KJCX201917, KJCX20180425, and KJCX20180204) to XY and National Natural Science Foundation of China (31621001) to LL.

Materials

Name	Company	Catalog Number	Comments
Computer/computing node	N/A	N/A	Perl is required; at least 8 GB RAM and 100 GB storage are recommended

References

Ghildiyal, M., Zamore, P. D. Small silencing RNAs: an expanding universe. Nature Reviews Genetics. 10 (2), 94-108 (2009).
Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell. 136 (2), 215-233 (2009).
Moran, Y., Agron, M., Praher, D., Technau, U. The evolutionary origin of plant and animal microRNAs. Nature Ecology Evolution. 1 (3), 27 (2017).
Xie, Z., et al. Expression of Arabidopsis MIRNA genes. Plant Physiology. 138 (4), 2145-2154 (2005).
Zhao, X., Zhang, H., Li, L. Identification and analysis of the proximal promoters of microRNA genes in Arabidopsis. Genomics. 101 (3), 187-194 (2013).
Bologna, N. G., Mateos, J. L., Bresso, E. G., Palatnik, J. F. A loop-to-base processing mechanism underlies the biogenesis of plant microRNAs miR319 and miR159. EMBO JOURNAL. 28 (23), 3646-3656 (2009).
Rogers, K., Chen, X. Biogenesis, turnover, and mode of action of plant microRNAs. Plant Cell. 25 (7), 2383-2399 (2013).
Voinnet, O. Origin, biogenesis, and activity of plant microRNAs. Cell. 136 (4), 669-687 (2009).
Iwakawa, H. O., Tomari, Y. The Functions of MicroRNAs: mRNA Decay and Translational Repression. Trends in Cell Biology. 25 (11), 651-665 (2015).
Lee, R. C., Feinbaum, R. L., Ambros, V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 75 (5), 843-854 (1993).
Wightman, B., Ha, I., Ruvkun, G. Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 75 (5), 855-862 (1993).
Friedlander, M. R., et al. Discovering microRNAs from deep sequencing data using miRDeep. Nature Biotechnology. 26 (4), 407-415 (2008).
Yang, X., Li, L. miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants. Bioinformatics. 27 (18), 2614-2615 (2011).
Meyers, B. C., et al. Criteria for annotation of plant MicroRNAs. Plant Cell. 20 (12), 3186-3190 (2008).
Yang, X., Zhang, H., Li, L. Global analysis of gene-level microRNA expression in Arabidopsis using deep sequencing data. Genomics. 98 (1), 40-46 (2011).
Kuang, Z., Wang, Y., Li, L., Yang, X. miRDeep-P2: accurate and fast analysis of the microRNA transcriptome in plants. Bioinformatics. , (2018).
Kozomara, A., Birgaoanu, M., Griffiths-Jones, S. miRBase: from microRNA sequences to function. Nucleic Acids Research. 47 (1), 155-162 (2019).
Griffiths-Jones, S., Saini, H. K., van Dongen, S., Enright, A. J. miRBase: tools for microRNA genomics. Nucleic Acids Research. 36, 154-158 (2008).
Axtell, M. J., Meyers, B. C. Revisiting Criteria for Plant MicroRNA Annotation in the Era of Big Data. Plant Cell. 30 (2), 272-284 (2018).
Taylor, R. S., Tarver, J. E., Hiscock, S. J., Donoghue, P. C. Evolutionary history of plant microRNAs. Trends in Plant Science. 19 (3), 175-182 (2014).
Kozomara, A., Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research. 42, 68-73 (2014).
Langmead, B., Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods. 9 (4), 357-359 (2012).
Lorenz, R., et al. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 6, 26 (2011).
Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 10 (3), 25 (2009).
An, J., Lai, J., Sajjanhar, A., Lehman, M. L., Nelson, C. C. miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data. BMC Bioinformatics. 15, 275 (2014).
Lei, J., Sun, Y. miR-PREFeR: an accurate, fast and easy-to-use plant miRNA prediction tool using small RNA-Seq data. Bioinformatics. 30 (19), 2837-2839 (2014).
Evers, M., Huttner, M., Dueck, A., Meister, G., Engelmann, J. C. miRA: adaptable novel miRNA identification in plants using small RNA sequencing data. BMC Bioinformatics. 16, 370 (2015).
Mathelier, A., Carbone, A. MIReNA: finding microRNAs with high accuracy and no learning at genome scale and from deep sequencing data. Bioinformatics. 26 (18), 2226-2234 (2010).
Zhu, Q. H., et al. A diverse set of microRNAs and microRNA-like small RNAs in developing rice grains. Genome Research. 18 (9), 1456-1465 (2008).
Fahlgren, N., et al. MicroRNA gene evolution in Arabidopsis lyrata and Arabidopsis thaliana. Plant Cell. 22 (4), 1074-1089 (2010).
Fromm, B., et al. A Uniform System for the Annotation of Vertebrate microRNA Genes and the Evolution of the Human microRNAome. Annual Review of Genetics. 49, 213-242 (2015).
Blevins, T., et al. Identification of Pol IV and RDR2-dependent precursors of 24 nt siRNAs guiding de novo DNA methylation in Arabidopsis. Elife. 4, 09591 (2015).
Zhai, J., et al. A One Precursor One siRNA Model for Pol IV-Dependent siRNA Biogenesis. Cell. 163 (2), 445-455 (2015).
Werner, S., Wollmann, H., Schneeberger, K., Weigel, D. Structure determinants for accurate processing of miR172a in Arabidopsis thaliana. Current Biology. 20 (1), 42-48 (2010).
Mateos, J. L., Bologna, N. G., Chorostecki, U., Palatnik, J. F. Identification of microRNA processing determinants by random mutagenesis of Arabidopsis MIR172a precursor. Current Biology. 20 (1), 49-54 (2010).
Vitsios, D. M., et al. Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests. Nucleic Acids Research. 45 (21), 177 (2017).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

Bioinformatics Pipeline MiRDeep2 Plant MicroRNAs Transcriptional Regulation Environmental Challenges MicroRNA Annotation Scoring Algorithm Genome Reference Index Arabidopsis Thaliana Sequencing File Noncoding RNA Reference Index Test Data Output File Preprocessing Reads FASTA Identifiers

This article has been published

Video Coming Soon

Keep me updated: