mirMachine: A One-Stop Shop for Plant miRNA Annotation

H. Busra Cagirici; Taner Z. Sen; Hikmet Budak

doi:10.3791/62430

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

Herein, we present a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately and 2) is fully automated and freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline.

Abstract

Of different types of noncoding RNAs, microRNAs (miRNAs) have arguably been in the spotlight over the last decade. As post-transcriptional regulators of gene expression, miRNAs play key roles in various cellular pathways, including both development and response to a/biotic stress, such as drought and diseases. Having high-quality reference genome sequences enabled identification and annotation of miRNAs in several plant species, where miRNA sequences are highly conserved. As computational miRNA identification and annotation processes are mostly error-prone processes, homology-based predictions increase prediction accuracy. We developed and have improved the miRNA annotation pipeline, SUmir, in the last decade, which has been used for several plant genomes since then.

This study presents a fully automated, new miRNA pipeline, mirMachine (miRNA Machine), by (i) adding an additional filtering step on the secondary structure predictions, (ii) making it fully automated, and (iii) introducing new options to predict either known miRNA based on homology or novel miRNAs based on small RNA sequencing reads using the previous pipeline. The new miRNA pipeline, mirMachine, was tested using The Arabidopsis Information Resource, TAIR10, release of the Arabidopsis genome and the International Wheat Genome Sequencing Consortium (IWGSC) wheat reference genome v2.

Introduction

Advances in next generation sequencing technologies have widened the understanding of RNA structures and regulatory elements, revealing functionally important non-coding RNAs (ncRNAs). Among different types of ncRNAs, microRNAs (miRNAs) constitute a fundamental regulatory class of small RNAs with a length between 19 and 24 nucleotides in plants¹^,². Since the discovery of the first miRNA in the nematode Caenorhabditis elegans³, the presence and the functions of miRNAs have been studied extensively in animal and plant genomes as well⁴^,⁵^,⁶. miRNAs function by targeting mRNAs for cleavage or translational repression⁷. Accumulating evidence has also shown that miRNAs are involved in a wide range of biological processes in plants including growth and development⁸, self-biogenesis⁹, and several biotic and abiotic stress responses¹⁰.

In plants, miRNAs are initially processed from long primary transcripts called pri-miRNAs¹¹. These pri-miRNAs generated by RNA polymerase II inside the nucleus are long transcripts forming an imperfect fold-back structure¹². The pri-miRNAs later undergo a cleavage process to produce endogenous single-stranded (ss) hairpin precursors of miRNAs called pre-miRNAs¹¹. The pre-miRNA forms a hairpin-like structure wherein a single strand folds into a double-stranded structure to excise an miRNA duplex (miRNA/miRNA*)¹³. Dicer-like protein cuts both strands of the miRNA/miRNA* duplex, leaving 2-nucleotide 3'-overhangs¹⁴^,¹⁵. The miRNA duplex is methylated inside the nucleus, which protects the 3'-end of the miRNA from degradation and uridylation activity¹⁶^,¹⁷. A helicase unwinds the methylated miRNA duplex after export and exposes the mature miRNA to the RNA-induced silencing complex (RISC) in the cytosol¹⁸. One strand of the duplex is mature miRNA incorporated into RISC , whereas the other strand, miRNA*, is degraded. The miRNA-RISC complex binds to the target sequence leading to either mRNA degradation in case of full complementarity or translational repression in case of partial complementarity¹³.

Based on the expression and biogenesis features, guidelines for miRNA annotation have been described¹⁵^,¹⁹. With the defined guidelines, Lucas and Budak developed the SUmir pipeline to perform a homology-based in silico miRNA identification in plants⁹. The SUmir pipeline was composed of two scripts: SUmirFind and SUmirFold. SUmirFind performs similarity searches against known miRNA datasets through National Center for Biotechnology Information (NCBI) Basic Local Alignment Search tool (BLAST) screening with modified parameters to include hits with only 2 or fewer mismatches and to avoid bias towards shorter hits (blastn-short -ungapped -penalty -1 -reward 1). SUmirFold evaluates the secondary structure of the putative miRNA sequences from BLAST²⁰ results using UNAfold²¹. SUmirFold differentiates miRNAs from small interfering RNAs by the identification of the characteristics of hairpin structure. Moreover, it differentiates miRNAs from other ssRNAs such as tRNA and rRNA by the parameters, minimum fold energy index > 0.67 and GC content of 24-71%. This pipeline has been recently updated by adding two additional steps to (i) increase sensitivity, (ii) increase annotation accuracy, and (iii) provide genomic distribution of the predicted miRNA genes²². Given the high conservation of plant miRNA sequences²³, this pipeline was originally designed for homology-based miRNA prediction. Novel miRNAs, however, could not be accurately identified with this bioinformatics analysis as it heavily relied on sequence conservation of miRNAs between closely related species.

This paper presents a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately (for example, the pipeline now uses sRNA-seq-based novel miRNA predictions as well as homology-based miRNA identification) and 2) is fully automated and freely available. The outputs have also included the genomic distributions of the predicted miRNAs. mirMachine was tested for both homology-based and sRNA-seq-based predictions in wheat and Arabidopsis genomes. Although initially released as free software, UNAfold became a commercial software in the last decade. With this upgrade, the secondary structure prediction tool was switched from UNAfold to RNAfold so that mirMachine can be freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline (examples are provided at https://github.com/hbusra/mirMachine.git).

Access restricted. Please log in or start a trial to view this content.

Protocol

1. Software dependencies and installation

Install software dependencies from their home site or using conda.
1. Download and install Perl, if it is not already installed, from its home site (https://www.perl.org/get.html).
  NOTE: Represented results were predicted using Perl v5.32.0.
2. Download Blast+, an alignment program, from its home site (https://www.ncbi.nlm.nih.gov/books/NBK279671/) as an executable and as source code.
  NOTE: Represented results were predicted using the BLAST 2.6.0+.
3. Install precompiled package of RNAfold from https://www.tbi.univie.ac.at/RNA/.
4. Alternatively, install these softwares using the following conda: i) conda install -c bioconda blast; ii) conda install -c bioconda viennarna.

2. The mirMachine setup and testing

Download the latest version of the mirMachine scripts and the mirMachine submission script from GitHub, https://github.com/hbusra/mirMachine.git, and then set the scripts path into the PATH.
Use the test data provided at the GitHub to make sure that the mirMachine along with all its dependencies have been downloaded correctly.
Run the mirMachine on the test data shown below.
bash mirMachine_submit.sh -f iwgsc_v2_chr5A.fasta -i mature_high_conf_v22_1.fa.filtered.fasta -n 10
NOTE: Set the -n option to 10 as the test data contains only one chromosome of the wheat genome. At defaults, the -n option is set to 20.
Control the hairpins.tbl.out.tbl output files for the predicted mature miRNAs, their predicted precursors, and their locations on the chromosomes.
Check the log files for the program outputs and warnings.

3. Homology-based miRNA identification

Run the mirMachine using the bash script shown below:
bash mirMachine_submit.sh -f $genome_file -i $input_file -m $mismatches -n $number_of_hits
Check the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.tbl.out.tbl for the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.fsa for the pre-miRNA FASTA sequences. Find the output file named $input_file.results.tbl.hairpins.log for the hairpin log file.

4. Novel miRNA identification

Preprocess the sRNA-seq FASTQ files into proper FASTA format. Trim adaptors if needed. Do not trim low-quality reads; instead, remove them. Remove reads containing N. Convert the FASTQ file into FASTA file ($input_file).
Run the mirMachine using the bash script shown below.
bash mirMachine_submit.sh -f $genome_file -i $input_file -n $number_of_hits -sRNAseq -lmax $lmax -lmin $lmin -rpm $rpm
NOTE: $mismatches was set to 0 for sRNA-seq based predictions.
Check the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.tbl.out.tbl for the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.fsa for the pre-miRNA FASTA sequences. Find the output file named $input_file.results.tbl.hairpins.log for the hairpin log file.

5. Advance parameters

NOTE: The defaults are defined for all the parameters except for the genome file and the input miRNA file.

Set the -db option to a blast database to skip the building reference database within the pipeline.
Set the -m option to the number of mismatches allowed.
NOTE: At defaults, -m option was set to 1 for homology-based predictions and 0 for the sRNA-seq-based predictions.
Set the -n to the number of hits to eliminate after alignment (default to 20). Change this based on the species.
Use the -long to assess the secondary structures for the suspect list.
Use the -s to activate the novel miRNA prediction based on sRNA-seq data.
Set the -lmax option to the maximum length of the sRNA-seq reads to include in the screening.
Set the -lmax option to the minimum length of the sRNA-seq reads to include in the screening.
Use the -rpm option to set the Reads Per Million (RPM) threshold.
NOTE: For advanced parameters like the length of pri-miRNAs/pre-miRNAs, experienced users are encouraged to modify the scripts for their research of interest. Additionally, if the users intend to skip some steps or prefer to use modified outputs, the submission script can be modified by simply adding # at the beginning of the lines to skip those lines.

Access restricted. Please log in or start a trial to view this content.

Results

The miRNA pipeline, mirMachine, described above was applied to the test data for the fast evaluation of the performance of the pipeline. Only the high-confidence plant miRNAs deposited at miRBase v22.1 were screened against the chromosome 5A of IWGSC wheat RefSeq genome v2²⁴. mirMachine_find returned 312 hits for the nonredundant list of 189 high-confidence miRNAs with a maximum of 1 mismatch allowed (Table 1). mirMachine_fold classified 49 of them as putative miRNAs depending on ...

Access restricted. Please log in or start a trial to view this content.

Discussion

Our miRNA pipeline, SUmir, has been used for the identification of many plant miRNAs for the last decade. Here, we developed a new, fully automated, and freely available miRNA identification and annotation pipeline, mirMachine. Furthermore, a number of miRNA identification pipelines including, but not limited to the previous pipeline, were dependent on UNAfold software²¹, which became a commercial software over time, although once being freely available. This new and fully automated mirMachine is ...

Access restricted. Please log in or start a trial to view this content.

Materials

Name	Company	Catalog Number	Comments
https://www.ncbi.nlm.nih.gov/books/NBK279671/			Blast+
https://github.com/hbusra/mirMachine.git			mirMachine submission script
https://www.perl.org/get.html			Perl
https://www.tbi.univie.ac.at/RNA/			RNAfold
Arabidopsis TAIR10
Triticum aestivum (wheat, IWGSC RefSeq v2)

References

Voinnet, O. Origin, biogenesis, and activity of plant microRNAs. Cell. 136 (4), 669-687 (2009).
Budak, H., Akpinar, B. A. Plant miRNAs: biogenesis, organization and origins. Functional & Integrative Genomics. 15 (5), 523-531 (2015).
Lee, R. C., Feinbaum, R. L., Ambros, V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 75 (5), 843-854 (1993).
Zhang, L., et al. Exogenous plant MIR168a specifically targets mammalian LDLRAP1: evidence of cross-kingdom regulation by microRNA. Cell Research. 22 (1), 107-126 (2012).
Pang, K. C., Frith, M. C., Mattick, J. S. Rapid evolution of noncoding RNAs: Lack of conservation does not mean lack of function. Trends in Genetics. 22 (1), 1-5 (2006).
Guleria, P., Mahajan, M., Bhardwaj, J., Yadav, S. K. Plant small RNAs: biogenesis, mode of action and their roles in abiotic stresses. Genomics, Proteomics and Bioinformatics. 9 (6), 183-199 (2011).
Jones-Rhoades, M. W., Bartel, D. P., Bartel, B. MicroRNAs and their regulatory roles in plants. Annual Review of Plant Biology. 57, 19-53 (2006).
Singh, A., et al. Plant small RNAs: advancement in the understanding of biogenesis and role in plant development. Planta. 248 (3), 545-558 (2018).
Lucas, S. J., Budak, H. Sorting the wheat from the chaff: identifying miRNAs in genomic survey sequences of Triticum aestivum chromosome 1AL. PloS One. 7 (7), 40859(2012).
Li, S., Castillo-González, C., Yu, B., Zhang, X. The functions of plant small RNAs in development and in stress responses. Plant Journal. 90 (4), 654-670 (2017).
Lee, Y., Jeon, K., Lee, J. T., Kim, S., Kim, V. N. MicroRNA maturation: Stepwise processing and subcellular localization. EMBO Journal. 21 (17), 4663-4670 (2002).
Lee, Y., et al. MicroRNA genes are transcribed by RNA polymerase II. EMBO Journal. 23 (2), 4051-4060 (2004).
Bartel, D. P. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell. 116 (2), 281-297 (2004).
Lee, Y., et al. The nuclear RNase III Drosha initiates microRNA processing. Nature. 425 (6956), 415-419 (2003).
Meyers, B. C., et al. Criteria for annotation of plant microRNAs. Plant Cell. 20 (12), 3186-3190 (2008).
Sanei, M., Chen, X. Mechanisms of microRNA turnover. Current Opinion in Plant Biology. 27, 199-206 (2015).
Li, J., Yang, Z., Yu, B., Liu, J., Chen, X. Methylation protects miRNAs and siRNAs from a 3′-end uridylation activity in Arabidopsis. Current Biology. 15 (16), 1501-1507 (2005).
Rogers, K., Chen, X. Biogenesis, turnover, and mode of action of plant microRNAs. Plant Cell. 25 (7), 2383-2399 (2013).
Axtell, M. J., Meyers, B. C. Revisiting criteria for plant microRNA annotation in the Era of big data. Plant Cell. 30 (2), 272-284 (2018).
Camacho, C., et al. BLAST+: architecture and applications. BMC Bioinformatics. 10 (1), 421(2009).
Markham, N. R. N., Zuker, M. UNAFold: Software for nucleic acid folding and hybridization. Methods in Molecular Biology. 453, 3-31 (2008).
Alptekin, B., Akpinar, B. A., Budak, H. A comprehensive prescription for plant miRNA identification. Frontiers in Plant Science. 7, 2058(2017).
Zhang, B., Pan, X., Cannon, C. H., Cobb, G. P., Anderson, T. A. Conservation and divergence of plant microRNA genes. Plant Journal. 46 (2), 243-259 (2006).
Appels, R., et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 361 (6403), 7191(2018).
Wang, Y., Kuang, Z., Li, L., Yang, X. A bioinformatics pipeline to accurately and efficiently analyze the microRNA transcriptomes in plants. Journal of Visualized Experiments: JoVE. (155), e59864(2020).
Kozomara, A., Griffiths-Jones, S. MiRBase: Annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research. 42, 68-73 (2014).
Lorenz, R., et al. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 6 (1), 26(2011).
Wicker, T., et al. Impact of transposable elements on genome structure and evolution in bread wheat. Genome Biology. 19 (1), 103(2018).
Flavell, R. B., Bennett, M. D., Smith, J. B., Smith, D. B. Genome size and the proportion of repeated nucleotide sequence DNA in plants. Biochemical Genetics. 12 (4), 257-269 (1974).
Wicker, T., et al. The repetitive landscape of the 5100 Mbp barley genome. Mobile DNA. 8, 22(2017).
Yang, Q., Ye, Q. A., Liu, Y. Mechanism of siRNA production from repetitive DNA. Genes and Development. 29 (5), 526-537 (2015).
Lam, J. K. W., Chow, M. Y. T., Zhang, Y., Leung, S. W. S. siRNA versus miRNA as therapeutics for gene silencing. Molecular Therapy. Nucleic Acids. 4 (9), 252(2015).
Bartel, B. MicroRNAs directing siRNA biogenesis. Nature Structural and Molecular Biology. 12 (7), 569-571 (2005).
Meng, Y., Shao, C., Wang, H., Chen, M. Are all the miRBase-registered microRNAs true? A structure- and expression-based re-examination in plants. RNA Biology. 9 (3), 249-253 (2012).
Berezikov, E., et al. Evolutionary flux of canonical microRNAs and mirtrons in Drosophila. Nature Genetics. 42 (1), author reply 9-10 6-9 (2010).

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

MirMachine MicroRNA Annotation Computational Identification Sensitivity Specificity Genome wide Distribution MsRNA Sequencing Software Dependencies Installation GitHub Scripts Output Files Predicted MiRNAs Hairpins FASTQ Format Trimming Tool Abundance Table Conversion

This article has been published

Video Coming Soon

Keep me updated:

mirMachine: A One-Stop Shop for Plant miRNA Annotation

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Materials

References

Reprints and Permissions

Explore More Articles