MiRDeep2 identification of novel and known miRNAs

Hi
After i done Mirdeep2 mapper and quantifier, in Mirdeep2 identifier i got an error which i think the genome was the problem. what should i put i that box. I tried Hg38 human transcriptome but it does not solve the problem

1 Like

Related Q&A: Fasta format genome file in mirdeep2 (custom genome)

Hi @amir

If you used the built-in human genome hg38 for upstream steps, then you must also use that same genome in later steps. It is not built-in for the last tool. And human is very large, which can lead to errors due to exceeding resources. Both are explained in the prior post.

In short, there is either a mismatch between the genome/transcriptome used at each step, or the job is exceeding resources.

The human genome hg38 is sourced from UCSC. The version at Galaxy Main can be found here: http://datacache.galaxyproject.org/.

I think that Galaxy EU uses the same genome build/version/format, but I’ve pinged the EU admins below so they can confirm.

There isn’t a tutorial that uses these tools. The help is on the tool forms. The usual rules about mismatched inputs do apply for these tools and all others. See prior post for the link.

@hexylena @bjoern.gruening – Any more advice?

@amir It would be useful if you posted back the error message. You could also send that in as a bug report. You’ll get a copy of the bug report that include the stdout & stdin output (also available on the Job Details page – “i” icon in an expanded dataset).

Jobs that exceed resources can sometimes be due to input mismatches, and it looks as if that is your root issue (genome used for one step, transcriptome used for another), and that must be solved first.

Jobs that truely exceed resources, and not due to input problems, mean that you should set up your own Galaxy server where more resources can be allocated. Pretty sure you have those links but here they are again for others reading, or if you need a reminder:

Thanks!

I used hg38 which downloaded my self in mapping step. So both genome are same now. but still this error shows up. I got a copy of all of it

Dataset Error

An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/rnateam/mirdeep2/rbc_mirdeep2/2.0.0 .

Error Details

Execution resulted in the following messages:

Fatal error: Exit code 1 ()

Fatal error: Matched on Error:

Tool generated the following standard error:

#Starting miRDeep2 /usr/local/tools/_conda/envs/__mirdeep2@2.0.0.8/bin/miRDeep2.pl /data/dnb02/galaxy_db/files/010/968/dataset_10968523.dat /data/dnb02/galaxy_db/files/010/954/dataset_10954785.dat /data/dnb02/galaxy_db/files/010/968/dataset_10968524.dat /data/dnb02/galaxy_db/files/010/722/dataset_10722770.dat none /data/dnb02/galaxy_db/files/010/730/dataset_10730067.dat -t hsa -g 50000 -b 0 miRDeep2 started at 19:8:54 mkdir mirdeep_runs/run_06_09_2019_t_19_08_54 e[1;31mError: e[0mmiRNA reference this species file /data/dnb02/galaxy_db/files/010/722/dataset_10722770.dat has not allowed whitespaces in its first identifier

Troubleshoot This Error

There are a number of help resources to self diagnose and correct problems. Start here: My job ended with an error. What can I do?

Report This Error

Usually the local Galaxy administrators regularly review errors that occur on the server However, if you would like to provide additional information (such as what you were trying to do when the error occurred) and a contact e-mail address, we will be better able to investigate your problem and get back to you.

Did you run your genome through the NormalizeFasta tool to remove description line content? Any fasta dataset used as a custom genome/transcriptome/exomere must be format correctly before any other step is done, including mapping. The same exact reference fasta must be used for all steps in an analysis project.

If not done, expect tool errors/content issues and the need to fix the formatting first then to start completely over. The errors will not always be easy to interpret and some jobs may be putatively successful but actually contain scientific content problems (which are even more difficult to detect/interpret).

The other potential issue is the presence of unmapped reads in your inputs. Fastq reads can contain spaces, and those are included in unmapped reads in BAM inputs. Remove unmapped reads, if that is the problem, with a tool like BAMtools >> Filter BAM. MACS2 also has an issue with space in unmapped sequence lines (when in SAM format, but not BAM), and I’m not 100% sure about MiRDeep2 requirements, but that is worth testing out. If/how that works out would be welcomed in a reply.

And finally, check any labels/headers in your inputs or entered on the tool form. Do these contain spaces? Try removing them and see if that works.

FAQs:

I did normalized genome, still came with that error ,that isnt the solution. So about the space in fastq. can you explain how remove them exactly ?

@amir Guessing about what is going wrong isn’t working well.

Please create share a link to the history and send it to me in a direct message here (unless you are Ok with sharing it publically). Note which dataset number/s are from the error that presents with the space issue message and be sure to leave all inputs undeleted.

This should be the original history where the job was run so I can match up the inputs to the form options based on the “i” Job Details page information.

It is fine for fastq data to have spaces. These are passed on to unmapped hits in BAM datasets, which can cause a problem with some tools (usual solution: remove the unmapped data lines) but that seems unlikely to be the problem.

When you generate the share link, please also check the box to share the histories “objects” so I can review the entire input datasets for format issues if needed.

That should speed this troubleshooting up. There is either a usage/input problem or a tool bug and I can’t tell yet, as I am not as familiar with these wrappers than most others. But should be able to figure it out and help once I take a look at the actual data/jobs.

Ok im sending it now in your in a direct messages

1 Like

The custom genome has two problems (dataset 1004):

  1. The fasta still contains description line content. You need to set the NormalizeFasta option “Truncate sequence names at first whitespace” to “Yes” to remove these. This is what is causing the immediate error.

  2. The fasta is not a genome but a set of (exome?) fragments. Close to 200k reads. This will cause the tool to fail for memory reasons, or if you do manage to get hits they will be sparse due to the short, unassembled lengths. Hg38 is sourced from UCSC https://genome.ucsc.edu from their Downloads area. Don’t use the Table Browser, it is too much data to extract. Or, you can get a copy here: http://datacache.galaxyproject.org/

Hope that helps get you past this step.

1 Like

Just one more question. if i want to download from here http://datacache.galaxyproject.org/, which one should i download ? should i go to download file and get the 20140313_seq_hg38/
or the hg38Full and download those file ? Im tired of downloading the wronge genome

1 Like

The hg38 genome fasta is here: http://datacache.galaxyproject.org/indexes/hg38/seq/

Choose the file: hg38.fa

More about the directory structure is in this FAQ: https://galaxyproject.org/admin/reference-data-repo/

1 Like

Unfortunately it came up with FATAL ERROR again :cold_sweat:
I really dont know what could possibly wrong this time

1 Like

Is the new error in the same history again?

If in a new history, please share a link to it in our direct message thread.

Make sure that all inputs are in that history and that they are undeleted. Note the new error dataset number please so that I am looking at the correct, new problem.

Hg38 is very large and I suspect there is a memory issue but let’s confirm that. I didn’t go through all of your prior analysis/inputs first round - once the custom genome was discovered to incorrect, I stopped since that needed to be addressed first.

We can bring in the EU team if necessary. I am not an admin at that server so have a somewhat limited view of all the details. But let’s see if we can figure it out first this way.

1 Like

Ok. I am sending it again .
I did normalizeFasta and get the genome from where your addressed.
Thank you so much for giving your time for my problem again.

1 Like

I wrote back directly in our message and pointed out the problem with more details. Punch line: description line content in other fasta inputs needs to be removed.

Clean them all up so that the fasta “>” title lines only contain “one word” – the sequence identifier.

Tools are picky – if almost every case fasta data cannot contain description content and NormalizeFasta can be used fix up the formatting.

So your saying i should normalizeFASTA for all my fasta inputs… those including genome,mature and precursor ?? my own data after mapping change to FASTA ? they dont need normalize… ? right ?

1 Like

I’m not sure, but it certainly couldn’t hurt to remove the description line content from all fasta inputs. With any tool/workflow when working in Galaxy, and usually line command as well.

The error you are getting is from the underlying 3rd party tool, not some error trapping/format checks from the Galaxy wrapper around it. Meaning, you would hit these same format issues no matter where/how you use this tool.

So finally i solve the white space problem, and convert my mature and hairpin mirna to DNA sequence, but still i saw this error. what is this about ?

#Starting miRDeep2
/usr/local/tools/_conda/envs/__mirdeep2@2.0.0.8/bin/miRDeep2.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat none /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat -t hsa -g 50000 -b 0

miRDeep2 started at 12:31:46

mkdir mirdeep_runs/run_28_10_2019_t_12_31_46

#testing input files
started: 12:32:23
sanity_check_mature_ref.pl /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat

ended: 12:32:23
total:0h:0m:0s

sanity_check_reads_ready_file.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat

started: 12:32:23

ended: 12:32:38
total:0h:0m:15s

started: 12:32:38
sanity_check_genome.pl /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat

ended: 12:33:24
total:0h:0m:46s

started: 12:33:24
sanity_check_mapping_file.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat

ended: 12:33:29
total:0h:0m:5s

started: 12:33:29
sanity_check_mature_ref.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat

ended: 12:33:29
total:0h:0m:0s

started: 12:33:29
Quantitation of expressed miRNAs in data

quantifier.pl -p /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat -m /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat -r /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat -t hsa -y 28_10_2019_t_12_31_46 -k
getting samples and corresponding read numbers

Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics

#desc total mapped unmapped %mapped %unmapped
total: 5366795 5366795 0.000 1.000
seq: 5366795 5366795 0.000 1.000
analyzing data
Expressed miRNAs are written to expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRNA_not_expressed.csv

Creating miRBase.mrd file

make_html2.pl -q expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRBase.mrd -k dataset_11714382.dat -t human -y 28_10_2019_t_12_31_46 -o -i expression_analyses/expression_analyses_28_10_2019_t_12_31_46/dataset_11714382.dat_mapped.arf -m hsa -M miRNAs_expressed_all_samples_28_10_2019_t_12_31_46.csv
miRNAs_expressed_all_samples_28_10_2019_t_12_31_46.csv file with miRNA expression values
parsing miRBase.mrd file finished
creating PDF files

ended: 12:35:27
total:0h:1m:58s

started: 12:35:27
rna2dna.pl /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11714382.dat

rna2dna.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723258.dat

ended: 12:35:27
total:0h:0m:0s

#parsing genome mappings
parse_mappings.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat -a 0 -b 18 -c 25 -i 5 > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723285.dat_parsed.arf

started: 12:35:27

ended: 12:35:27
total:0h:0m:0s

#excising precursors
started: 12:35:27
excise_precursors_iterative_final.pl /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723285.dat_parsed.arf mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/precursors.fa mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/precursors.coords 50000
1 0

ended: 12:35:44
total:0h:0m:17s

No precursors excised

https://usegalaxy.eu/u/amir_sabbaghian/h/mir [Edit Share Url](javascript:void(0))

1 Like

Hi @amir

Some references for context:

What to try to resolve the error:

The “Precursor sequences” input fasta, dataset 244 in your shared history, has “RNA” encoding (U instead of T).

Try changing this input to have “DNA” encoding and rerun to test if that resolves the issue. Use the same method as before (tool: RNA/DNA converter (Galaxy Version 1.0.2) – the way you used it to produce dataset 245 earlier).

I see that you attempted that in dataset 247 (using input dataset 243). Try that again, but use the unwrapped version of the same data in dataset 244 (produced with the tool: FASTA Width formatter (Galaxy Version 1.0.1).

The RNA/DNA converter tool does not accept “wrapped” fasta format. Tools from the FASTA-toolkit are great for data manipulation but like many tools, can be a bit picky – but almost always have very informative error messages. The error for dataset 247 is very specific and it looks like you acted on the advice, but didn’t follow through with the next step (actually converting to DNA encoding before running MiRDeep2).

fasta_nucleotide_changer: Invalid input: This looks like a multi-line FASTA file.
Line 3 contains a nucleotides string instead of a '>' prefix.
FASTX-Toolkit can't handle multi-line FASTA files.
Please use the FASTA-Formatter tool to convert this file into a single-line FASTA.

Technically, the source tool can “work” with RNA encoding, but evidently not the Galaxy wrapped version from what I can tell. Or maybe all inputs need to be either RNA or DNA encoding – I’m not really sure. Perhaps all inputs except the reference genome/fastq (those should have DNA encoding for most tools, regardless of other input requirements). Synching all the data input encoding, to be the same (DNA), seems like a logical way forward, and is how the test data for this tool is formatted.

It looks like you are getting close to having this work, which is great!! Galaxy helps to make tools easier to use but sometimes a bit of data reformatting/testing out troubleshooting solutions is needed. The same is true when running tools line-command. Each tool author has a set of formatting rules in mind, and clearly these assumptions can vary widely. Navigating these little hurdles is a skill set you’ve been building up at a very fast pace – so feel good about it! Everyone goes through some process similar to yours when trying out new tools/protocols for the first time… and the next time… and the next time… :slight_smile: Let us know if this works out – I didn’t notice anything else obvious that might be problematic but let’s see what happens with an actual rerun with fixed-up inputs.

Thanks!

cc @bjoern.gruening – Am I missing anything or do you have more advice?

@jennaj Thanks for your help on this. I get rid of the wrong nucleotide with TEXT TRANSFORMATION tool, because this precursor mirna has many not standard nucleotide’s. and after that i did tried the RNA/DNA converter but the error message was about i should turn that sequence in single line format and i did so. (as you mentioned). but when i did this i dont know how get it back to the multi line Fasta format. i think this maybe is problem. because maybe the mirdeep2 does not identify this sequence in single line format as a processor mirna. i thought if in some way i can turn it into the same format of the processor mirna as it was which it is convert to DNA now, maybe help to solve this problem . So how can i change the format to the original format of mirna processor was ?

1 Like

Hi @amir

I’m not sure if all the fasta inputs need to be wrapped or not for this tool. You’ll need to test both formats and see what works.

The reference genome should definitely be wrapped (any tool).