MiRDeep2 identification of novel and known miRNAs

Hi
After i done Mirdeep2 mapper and quantifier, in Mirdeep2 identifier i got an error which i think the genome was the problem. what should i put i that box. I tried Hg38 human transcriptome but it does not solve the problem

1 Like

Related Q&A: Fasta format genome file in mirdeep2 (custom genome)

Hi @amir

If you used the built-in human genome hg38 for upstream steps, then you must also use that same genome in later steps. It is not built-in for the last tool. And human is very large, which can lead to errors due to exceeding resources. Both are explained in the prior post.

In short, there is either a mismatch between the genome/transcriptome used at each step, or the job is exceeding resources.

The human genome hg38 is sourced from UCSC. The version at Galaxy Main can be found here: http://datacache.galaxyproject.org/.

I think that Galaxy EU uses the same genome build/version/format, but I’ve pinged the EU admins below so they can confirm.

There isn’t a tutorial that uses these tools. The help is on the tool forms. The usual rules about mismatched inputs do apply for these tools and all others. See prior post for the link.

@hxr @bjoern.gruening – Any more advice?

@amir It would be useful if you posted back the error message. You could also send that in as a bug report. You’ll get a copy of the bug report that include the stdout & stdin output (also available on the Job Details page – “i” icon in an expanded dataset).

Jobs that exceed resources can sometimes be due to input mismatches, and it looks as if that is your root issue (genome used for one step, transcriptome used for another), and that must be solved first.

Jobs that truely exceed resources, and not due to input problems, mean that you should set up your own Galaxy server where more resources can be allocated. Pretty sure you have those links but here they are again for others reading, or if you need a reminder:

Thanks!

I used hg38 which downloaded my self in mapping step. So both genome are same now. but still this error shows up. I got a copy of all of it

Dataset Error

An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/rnateam/mirdeep2/rbc_mirdeep2/2.0.0 .

Error Details

Execution resulted in the following messages:

Fatal error: Exit code 1 ()

Fatal error: Matched on Error:

Tool generated the following standard error:

#Starting miRDeep2 /usr/local/tools/_conda/envs/__mirdeep2@2.0.0.8/bin/miRDeep2.pl /data/dnb02/galaxy_db/files/010/968/dataset_10968523.dat /data/dnb02/galaxy_db/files/010/954/dataset_10954785.dat /data/dnb02/galaxy_db/files/010/968/dataset_10968524.dat /data/dnb02/galaxy_db/files/010/722/dataset_10722770.dat none /data/dnb02/galaxy_db/files/010/730/dataset_10730067.dat -t hsa -g 50000 -b 0 miRDeep2 started at 19:8:54 mkdir mirdeep_runs/run_06_09_2019_t_19_08_54 e[1;31mError: e[0mmiRNA reference this species file /data/dnb02/galaxy_db/files/010/722/dataset_10722770.dat has not allowed whitespaces in its first identifier

Troubleshoot This Error

There are a number of help resources to self diagnose and correct problems. Start here: My job ended with an error. What can I do?

Report This Error

Usually the local Galaxy administrators regularly review errors that occur on the server However, if you would like to provide additional information (such as what you were trying to do when the error occurred) and a contact e-mail address, we will be better able to investigate your problem and get back to you.

Did you run your genome through the NormalizeFasta tool to remove description line content? Any fasta dataset used as a custom genome/transcriptome/exomere must be format correctly before any other step is done, including mapping. The same exact reference fasta must be used for all steps in an analysis project.

If not done, expect tool errors/content issues and the need to fix the formatting first then to start completely over. The errors will not always be easy to interpret and some jobs may be putatively successful but actually contain scientific content problems (which are even more difficult to detect/interpret).

The other potential issue is the presence of unmapped reads in your inputs. Fastq reads can contain spaces, and those are included in unmapped reads in BAM inputs. Remove unmapped reads, if that is the problem, with a tool like BAMtools >> Filter BAM. MACS2 also has an issue with space in unmapped sequence lines (when in SAM format, but not BAM), and I’m not 100% sure about MiRDeep2 requirements, but that is worth testing out. If/how that works out would be welcomed in a reply.

And finally, check any labels/headers in your inputs or entered on the tool form. Do these contain spaces? Try removing them and see if that works.

FAQs:

I did normalized genome, still came with that error ,that isnt the solution. So about the space in fastq. can you explain how remove them exactly ?

@amir Guessing about what is going wrong isn’t working well.

Please create share a link to the history and send it to me in a direct message here (unless you are Ok with sharing it publically). Note which dataset number/s are from the error that presents with the space issue message and be sure to leave all inputs undeleted.

This should be the original history where the job was run so I can match up the inputs to the form options based on the “i” Job Details page information.

It is fine for fastq data to have spaces. These are passed on to unmapped hits in BAM datasets, which can cause a problem with some tools (usual solution: remove the unmapped data lines) but that seems unlikely to be the problem.

When you generate the share link, please also check the box to share the histories “objects” so I can review the entire input datasets for format issues if needed.

That should speed this troubleshooting up. There is either a usage/input problem or a tool bug and I can’t tell yet, as I am not as familiar with these wrappers than most others. But should be able to figure it out and help once I take a look at the actual data/jobs.

Ok im sending it now in your in a direct messages

1 Like

The custom genome has two problems (dataset 1004):

  1. The fasta still contains description line content. You need to set the NormalizeFasta option “Truncate sequence names at first whitespace” to “Yes” to remove these. This is what is causing the immediate error.

  2. The fasta is not a genome but a set of (exome?) fragments. Close to 200k reads. This will cause the tool to fail for memory reasons, or if you do manage to get hits they will be sparse due to the short, unassembled lengths. Hg38 is sourced from UCSC https://genome.ucsc.edu from their Downloads area. Don’t use the Table Browser, it is too much data to extract. Or, you can get a copy here: http://datacache.galaxyproject.org/

Hope that helps get you past this step.

1 Like

Just one more question. if i want to download from here http://datacache.galaxyproject.org/, which one should i download ? should i go to download file and get the 20140313_seq_hg38/
or the hg38Full and download those file ? Im tired of downloading the wronge genome

1 Like

The hg38 genome fasta is here: http://datacache.galaxyproject.org/indexes/hg38/seq/

Choose the file: hg38.fa

More about the directory structure is in this FAQ: https://galaxyproject.org/admin/reference-data-repo/

1 Like

Unfortunately it came up with FATAL ERROR again :cold_sweat:
I really dont know what could possibly wrong this time

1 Like

Is the new error in the same history again?

If in a new history, please share a link to it in our direct message thread.

Make sure that all inputs are in that history and that they are undeleted. Note the new error dataset number please so that I am looking at the correct, new problem.

Hg38 is very large and I suspect there is a memory issue but let’s confirm that. I didn’t go through all of your prior analysis/inputs first round - once the custom genome was discovered to incorrect, I stopped since that needed to be addressed first.

We can bring in the EU team if necessary. I am not an admin at that server so have a somewhat limited view of all the details. But let’s see if we can figure it out first this way.

1 Like

Ok. I am sending it again .
I did normalizeFasta and get the genome from where your addressed.
Thank you so much for giving your time for my problem again.

1 Like

I wrote back directly in our message and pointed out the problem with more details. Punch line: description line content in other fasta inputs needs to be removed.

Clean them all up so that the fasta “>” title lines only contain “one word” – the sequence identifier.

Tools are picky – if almost every case fasta data cannot contain description content and NormalizeFasta can be used fix up the formatting.

So your saying i should normalizeFASTA for all my fasta inputs… those including genome,mature and precursor ?? my own data after mapping change to FASTA ? they dont need normalize… ? right ?

1 Like

I’m not sure, but it certainly couldn’t hurt to remove the description line content from all fasta inputs. With any tool/workflow when working in Galaxy, and usually line command as well.

The error you are getting is from the underlying 3rd party tool, not some error trapping/format checks from the Galaxy wrapper around it. Meaning, you would hit these same format issues no matter where/how you use this tool.