Hi this Namarig Elmalih, Second year graduate student in A&T university with Dr Perpetua Muganda. You’ve found errors in the MiRDeep2 quantifier and identifier tools in certain samples but not others. This my history link Galaxy
Hi @namarig
Thanks for posting all the details, very helpful!
Review your mapping input for this job. You probably need to run MiRDeep2 Mapper with the correct files again.
The error is reporting that the target “genome” contains identifiers that are not found in the mapping file (and the reverse!). The reads part of the mapping appears to be Ok, so adjusting the target to be the same reference genome as you are including in this step is what to adjust with the upstream tool.
Where is this documented?
Under that mapping input area on the tool form:
Reads mapped against genome. Mappings should be in ARF format.
Then, down in the Help section of the tool form:
What it does
MiRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples.
Input
A FASTA file with deep sequencing reads, a FASTA file of the corresponding genome, a file of mapped reads to the genome in miRDeep2 arf format, an optional fasta file with known miRNAs of the analysing species and an option fasta file of known miRNAs of related species.
Arf format: Is a proprietary file format generated and processed by miRDeep2. It contains information of reads mapped to a reference genome. Each line in such a file contains 13 columns:
- read identifier
- length of read sequence
- start position in read sequence that is mapped
- end position in read sequence that is mapped
- read sequence
- identifier of the genome-part to which a read is mapped to. This is either a scaffold id or a chromosome name
- length of the genome sequence a read is mapped to
- start position in the genome where a read is mapped to
- end position in the genome where a read is mapped to
- genome sequence to which a read is mapped
- genome strand information. Plus means the read is aligned to the sense-strand of the genome. Minus means it is aligned to the antisense-strand of the genome.
- Number of mismatches in the read mapping
- Edit string that indicates matches by lowercase ‘m’ and mismatches by uppercase ‘M’
Hope this helps!
Hi jennaj
I repeat my work again still give me errors in both mirdeeper2 quentifier and Identifier.
Hi @namarig
You are not getting meaningful results at the mapping step.
Find that scientific result in the job logs (using the i-icon).
Screenshot
I played around a bit with your data. I can get about 20k of all the reads (out of over a million!!) to map with BLASTN using very permissive settings – specifically, by allowing multi-mappings and very short weak hits. That independent mapped length was as short as 14 bases. You could explore this, too, by using BLAST, as a type of sanity check.
Now, your source data is not the same species (from a traditional perspective). If you are hunting, that’s science! But you’ll need to have an “exploratory result” perspective. What you are trying to do is not so simple, and the very low mapping rates may be actual (once you verify the QA and correct the redundancy steps).
- Genome: Human gammaherpesvirus 4, complete genome - Nucleotide - NCBI (virus, from human source, no human DNA present, correct?)
- Reads: https://www.ncbi.nlm.nih.gov/sra/?term=SRR7547891 (tumor, from human source, with all the other human DNA still present!)
What to do
- Review how QA was performed on the original reads.
- Review how you are collapsing the reads. If you don’t get rid of the duplications, you will have non-specific mapping results in the downstream step, and depending on the map settings, those alignments will not pass through and be reported.
- You don’t need to attempt to use the downstream tools until all of the results that are input to those tools are useable. So far, that appears to maybe be the quality of the reads, and how they map to your reference after collapsing them.
We don’t want to provide too much scientific advice here since our focus is on using Galaxy. Maybe visit a scientific forum where other scientists spend time. Or, if you can find a publication, you can probably replicate that in Galaxy.
These tools also have a tutorial. I know it is complicated, but understanding what is happening at each step will provide you with an example of what to check with your own data. You could even just use parts of the tutorial data to run through your own custom workflow just to make sure that the basics are intact. Then you can focus on the data interpretation parts.
Hope this helps!