MiRDeep2 identification of novel and known miRNAs

Did you run your genome through the NormalizeFasta tool to remove description line content? Any fasta dataset used as a custom genome/transcriptome/exomere must be format correctly before any other step is done, including mapping. The same exact reference fasta must be used for all steps in an analysis project.

If not done, expect tool errors/content issues and the need to fix the formatting first then to start completely over. The errors will not always be easy to interpret and some jobs may be putatively successful but actually contain scientific content problems (which are even more difficult to detect/interpret).

The other potential issue is the presence of unmapped reads in your inputs. Fastq reads can contain spaces, and those are included in unmapped reads in BAM inputs. Remove unmapped reads, if that is the problem, with a tool like BAMtools >> Filter BAM. MACS2 also has an issue with space in unmapped sequence lines (when in SAM format, but not BAM), and I’m not 100% sure about MiRDeep2 requirements, but that is worth testing out. If/how that works out would be welcomed in a reply.

And finally, check any labels/headers in your inputs or entered on the tool form. Do these contain spaces? Try removing them and see if that works.

FAQs:

I did normalized genome, still came with that error ,that isnt the solution. So about the space in fastq. can you explain how remove them exactly ?

@amir Guessing about what is going wrong isn’t working well.

Please create share a link to the history and send it to me in a direct message here (unless you are Ok with sharing it publically). Note which dataset number/s are from the error that presents with the space issue message and be sure to leave all inputs undeleted.

This should be the original history where the job was run so I can match up the inputs to the form options based on the “i” Job Details page information.

It is fine for fastq data to have spaces. These are passed on to unmapped hits in BAM datasets, which can cause a problem with some tools (usual solution: remove the unmapped data lines) but that seems unlikely to be the problem.

When you generate the share link, please also check the box to share the histories “objects” so I can review the entire input datasets for format issues if needed.

That should speed this troubleshooting up. There is either a usage/input problem or a tool bug and I can’t tell yet, as I am not as familiar with these wrappers than most others. But should be able to figure it out and help once I take a look at the actual data/jobs.

Ok im sending it now in your in a direct messages

1 Like

The custom genome has two problems (dataset 1004):

  1. The fasta still contains description line content. You need to set the NormalizeFasta option “Truncate sequence names at first whitespace” to “Yes” to remove these. This is what is causing the immediate error.

  2. The fasta is not a genome but a set of (exome?) fragments. Close to 200k reads. This will cause the tool to fail for memory reasons, or if you do manage to get hits they will be sparse due to the short, unassembled lengths. Hg38 is sourced from UCSC https://genome.ucsc.edu from their Downloads area. Don’t use the Table Browser, it is too much data to extract. Or, you can get a copy here: http://datacache.galaxyproject.org/

Hope that helps get you past this step.

1 Like

Just one more question. if i want to download from here http://datacache.galaxyproject.org/, which one should i download ? should i go to download file and get the 20140313_seq_hg38/
or the hg38Full and download those file ? Im tired of downloading the wronge genome

1 Like

The hg38 genome fasta is here: http://datacache.galaxyproject.org/indexes/hg38/seq/

Choose the file: hg38.fa

More about the directory structure is in this FAQ: https://galaxyproject.org/admin/reference-data-repo/

1 Like

Unfortunately it came up with FATAL ERROR again :cold_sweat:
I really dont know what could possibly wrong this time

1 Like

Is the new error in the same history again?

If in a new history, please share a link to it in our direct message thread.

Make sure that all inputs are in that history and that they are undeleted. Note the new error dataset number please so that I am looking at the correct, new problem.

Hg38 is very large and I suspect there is a memory issue but let’s confirm that. I didn’t go through all of your prior analysis/inputs first round - once the custom genome was discovered to incorrect, I stopped since that needed to be addressed first.

We can bring in the EU team if necessary. I am not an admin at that server so have a somewhat limited view of all the details. But let’s see if we can figure it out first this way.

1 Like

Ok. I am sending it again .
I did normalizeFasta and get the genome from where your addressed.
Thank you so much for giving your time for my problem again.

1 Like

I wrote back directly in our message and pointed out the problem with more details. Punch line: description line content in other fasta inputs needs to be removed.

Clean them all up so that the fasta “>” title lines only contain “one word” – the sequence identifier.

Tools are picky – if almost every case fasta data cannot contain description content and NormalizeFasta can be used fix up the formatting.

So your saying i should normalizeFASTA for all my fasta inputs… those including genome,mature and precursor ?? my own data after mapping change to FASTA ? they dont need normalize… ? right ?

1 Like

I’m not sure, but it certainly couldn’t hurt to remove the description line content from all fasta inputs. With any tool/workflow when working in Galaxy, and usually line command as well.

The error you are getting is from the underlying 3rd party tool, not some error trapping/format checks from the Galaxy wrapper around it. Meaning, you would hit these same format issues no matter where/how you use this tool.

So finally i solve the white space problem, and convert my mature and hairpin mirna to DNA sequence, but still i saw this error. what is this about ?

#Starting miRDeep2
/usr/local/tools/_conda/envs/__mirdeep2@2.0.0.8/bin/miRDeep2.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat none /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat -t hsa -g 50000 -b 0

miRDeep2 started at 12:31:46

mkdir mirdeep_runs/run_28_10_2019_t_12_31_46

#testing input files
started: 12:32:23
sanity_check_mature_ref.pl /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat

ended: 12:32:23
total:0h:0m:0s

sanity_check_reads_ready_file.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat

started: 12:32:23

ended: 12:32:38
total:0h:0m:15s

started: 12:32:38
sanity_check_genome.pl /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat

ended: 12:33:24
total:0h:0m:46s

started: 12:33:24
sanity_check_mapping_file.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat

ended: 12:33:29
total:0h:0m:5s

started: 12:33:29
sanity_check_mature_ref.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat

ended: 12:33:29
total:0h:0m:0s

started: 12:33:29
Quantitation of expressed miRNAs in data

quantifier.pl -p /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat -m /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat -r /data/dnb02/galaxy_db/files/011/723/dataset_11723284.dat -t hsa -y 28_10_2019_t_12_31_46 -k
getting samples and corresponding read numbers

Converting input files
building bowtie index
mapping mature sequences against index
mapping read sequences against index
Mapping statistics

#desc total mapped unmapped %mapped %unmapped
total: 5366795 5366795 0.000 1.000
seq: 5366795 5366795 0.000 1.000
analyzing data
Expressed miRNAs are written to expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRNA_expressed.csv
not expressed miRNAs are written to expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRNA_not_expressed.csv

Creating miRBase.mrd file

make_html2.pl -q expression_analyses/expression_analyses_28_10_2019_t_12_31_46/miRBase.mrd -k dataset_11714382.dat -t human -y 28_10_2019_t_12_31_46 -o -i expression_analyses/expression_analyses_28_10_2019_t_12_31_46/dataset_11714382.dat_mapped.arf -m hsa -M miRNAs_expressed_all_samples_28_10_2019_t_12_31_46.csv
miRNAs_expressed_all_samples_28_10_2019_t_12_31_46.csv file with miRNA expression values
parsing miRBase.mrd file finished
creating PDF files

ended: 12:35:27
total:0h:1m:58s

started: 12:35:27
rna2dna.pl /data/dnb02/galaxy_db/files/011/714/dataset_11714382.dat > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11714382.dat

rna2dna.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723258.dat > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723258.dat

ended: 12:35:27
total:0h:0m:0s

#parsing genome mappings
parse_mappings.pl /data/dnb02/galaxy_db/files/011/723/dataset_11723285.dat -a 0 -b 18 -c 25 -i 5 > mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723285.dat_parsed.arf

started: 12:35:27

ended: 12:35:27
total:0h:0m:0s

#excising precursors
started: 12:35:27
excise_precursors_iterative_final.pl /data/dnb02/galaxy_db/files/011/710/dataset_11710212.dat mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/dataset_11723285.dat_parsed.arf mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/precursors.fa mirdeep_runs/run_28_10_2019_t_12_31_46/tmp/precursors.coords 50000
1 0

ended: 12:35:44
total:0h:0m:17s

No precursors excised

https://usegalaxy.eu/u/amir_sabbaghian/h/mir [Edit Share Url](javascript:void(0))

1 Like

Hi @amir

Some references for context:

What to try to resolve the error:

The “Precursor sequences” input fasta, dataset 244 in your shared history, has “RNA” encoding (U instead of T).

Try changing this input to have “DNA” encoding and rerun to test if that resolves the issue. Use the same method as before (tool: RNA/DNA converter (Galaxy Version 1.0.2) – the way you used it to produce dataset 245 earlier).

I see that you attempted that in dataset 247 (using input dataset 243). Try that again, but use the unwrapped version of the same data in dataset 244 (produced with the tool: FASTA Width formatter (Galaxy Version 1.0.1).

The RNA/DNA converter tool does not accept “wrapped” fasta format. Tools from the FASTA-toolkit are great for data manipulation but like many tools, can be a bit picky – but almost always have very informative error messages. The error for dataset 247 is very specific and it looks like you acted on the advice, but didn’t follow through with the next step (actually converting to DNA encoding before running MiRDeep2).

fasta_nucleotide_changer: Invalid input: This looks like a multi-line FASTA file.
Line 3 contains a nucleotides string instead of a '>' prefix.
FASTX-Toolkit can't handle multi-line FASTA files.
Please use the FASTA-Formatter tool to convert this file into a single-line FASTA.

Technically, the source tool can “work” with RNA encoding, but evidently not the Galaxy wrapped version from what I can tell. Or maybe all inputs need to be either RNA or DNA encoding – I’m not really sure. Perhaps all inputs except the reference genome/fastq (those should have DNA encoding for most tools, regardless of other input requirements). Synching all the data input encoding, to be the same (DNA), seems like a logical way forward, and is how the test data for this tool is formatted.

It looks like you are getting close to having this work, which is great!! Galaxy helps to make tools easier to use but sometimes a bit of data reformatting/testing out troubleshooting solutions is needed. The same is true when running tools line-command. Each tool author has a set of formatting rules in mind, and clearly these assumptions can vary widely. Navigating these little hurdles is a skill set you’ve been building up at a very fast pace – so feel good about it! Everyone goes through some process similar to yours when trying out new tools/protocols for the first time… and the next time… and the next time… :slight_smile: Let us know if this works out – I didn’t notice anything else obvious that might be problematic but let’s see what happens with an actual rerun with fixed-up inputs.

Thanks!

cc @bjoern.gruening – Am I missing anything or do you have more advice?

@jennaj Thanks for your help on this. I get rid of the wrong nucleotide with TEXT TRANSFORMATION tool, because this precursor mirna has many not standard nucleotide’s. and after that i did tried the RNA/DNA converter but the error message was about i should turn that sequence in single line format and i did so. (as you mentioned). but when i did this i dont know how get it back to the multi line Fasta format. i think this maybe is problem. because maybe the mirdeep2 does not identify this sequence in single line format as a processor mirna. i thought if in some way i can turn it into the same format of the processor mirna as it was which it is convert to DNA now, maybe help to solve this problem . So how can i change the format to the original format of mirna processor was ?

1 Like

Hi @amir

I’m not sure if all the fasta inputs need to be wrapped or not for this tool. You’ll need to test both formats and see what works.

The reference genome should definitely be wrapped (any tool).

I think i just need to return the processor mirna format to the original format after converting RNA/DNA(original format cant be used for RNA/DNA convector). I think the reference and mature and my data are fine. Do understand what im saying ?
question is how change the processor mirna to the original format/ I think you misunderstanding me in last post . Off course my mistake , my English need more practice:sweat_smile:

1 Like

Do you mean that you want to wrap this fasta sequence again?

Either the NormalizeFasta or FASTA Width formatter tool can be used. 80 bases is the most common (and original) wrapped length for fasta data.

Also – all bases need to be ATCGN. If you have IUPAC characters in your data, change those to “N”. The tool manual links I shared before have example data. Lines that have any characters other than ATCGNU are rejected. And Ts in some of your inputs and Us in others will create conflicts (last error you reported).

You see this error
fasta_nucleotide_changer: Invalid input: This looks like a multi-line FASTA file.
Line 3 contains a nucleotides string instead of a ‘>’ prefix.
FASTX-Toolkit can’t handle multi-line FASTA files.
Please use the FASTA-Formatter tool to convert this file into a single-line FASTA.

cat: write error: Broken pipe

So if i change this format the Mirdeep2 doesnt recognize this data as processor mirna. So i have to change it with some tool like FASTA Width formatter and then convert it to DNA and after than return it to original format as it be at the first time . Do you think this idea is wrong ?