Hello, my raw data were microRNAs, which were downloaded by the NCBI application from the Galaxy server. After that, I utilized the FASTQC tool and checked my data. On the other hand, when I used the STAR application to check my data and analyze my mapping, which was great, with more than 80% mapped, the STAR could not give me any counts, and I do not know why.
Please someone give me some advice.
Welcome, @Hossein_Poursheykhi
Thanks for explaining what is going wrong. There are two potential things going on when a count result is unexpected.
-
This is an actual scientific result.
Counts could be zero for some or all features if the reads did not map to the same genomic places (species, then assembly’s chromosomes and coordinates) as the features you are counting up.
So, make sure the reference data and sequencing targets are synched up! By this I mean: do you expect those reads to map to those features? Have you looked at the BAM and GTF in a genome browser to see if that reveals anything? If you do this at a site like UCSC, you can also toggle on other annotation tracks to help clarify what you might be observing.
Or, reads can multi-map to so many places that the counting tool discards the hits for being non-specific. Ironically, this can be due to trying to save reads that should be rejected during QA. Meaning: fewer high quality reads can be more informative than a zillion low quality reads.
An experiment will usually have a combination of all these going on – and a “good” experiment has “most” reads that map uniquely to the genomic regions of interest.
Good sequencing library design, good reference data, good technical execution == good scientific results!
Maybe your annotation features really do not have any coverage from the sequenced library, and that can be explored for both scientific insights and potential ways to adjust techniques for next time.
-
This is a poor scientific result.
The problem could be with any of these parts:
Good sequencing library design, good reference data, good technical execution
If the job is failing, not just producing a poor result, that is also a bit informative since the tool is probably providing some details about what it guesses is going wrong. Not always correct but those are usually useful hints about where to start!
We can help with decoding those messages at this forum, plus explain the things to check for if the results are just “odd” and not actually failing.
We have tutorials with examples. Find these at the main site or review the bottom of tool forms for mini examples and links into those full length tutorials. You can also explore our FAQs, and this forum.
- All → https://training.galaxyproject.org/
- RNA-Star → Galaxy Training!
- FAQ → FAQ: Extended Help for Differential Expression Analysis Tools
My guess is that there is some mismatch with the reference data. If you would like help with this, you can share back your work for feedback. Generate the share link to your history with the problem results and the inputs, and post that back here. You can unshare after.
Let us know if you are able to solve this, and I’ll watch for your reply if you choose to share your history! Thanks!
Update
The tool form options were set to parse GTF-style attributes but a GFF3 file was provided.
Solution: covert to GTF format with gffread or source the GTF version of the annotation directly from the same place where the GFF3 was sourced (if available).
Hi, thank you for your useful tips but I did everything you mentioned. I checked my FASTQC and the quality of all my reference microRNAs (SRR) data was good. additionally, I utilized the miRbase database for downloading my annotation file which was rno.gff3, after that I converted it to a GTF file with the gffread tool. Even I checked the version of my annotation with my references that were rn6 and I paid more attention to choosing the right species in the STAR tool. I am sure my reference data has good quality and my annotation is relevant to my data.
PS: I approximately checked all the tutorials of the Galaxy server and searched a lot but unfortunately no one can assist me, I working on this data for a month and I could not find any solution.
after doing all these things I do not know what should I do.
PS: I attached my history link below and I hope you can assist me in solving this problem.
thank you so much for your time and attention.
History link:
Great, thank you for sharing the history!
Your annotation is a bit special – this is based on transcripts, not genes. The gffread tool will add in a sort of “placeholder” gene_id attribute but that is really just the transcript_id value again. So, while the tool was failing for a technical reason, that was a good clue that something in the query didn’t make sense, and that can lead to poor scientific results (not always so easy to detect!).
What you could try is setting up the form like this.
Your counts will be based on transcripts instead of genes, but you can use this with downstream tools that do gene DE (probably). And since these all appear to be “single exon transcripts” and these are just slices of the original full transcripts (the header of your annotation describes this better!), and the relationship between these slices is not summarized in your reference data in a standard way, that might be as far as you can go using standard transcriptomics tools.
To see what I mean by these observation comments – using your exact data as a test – process the GFF3 through gffread again and toggle the option for full GFF attribute preservation (all attributes are shown).
Then use the Select search with the keyword MI0000865
(as an example). The “Derives from” value is a bit like a “gene”, yes? It represents a footprint on the genomic strand, then there can be multiple “transcripts” associated. But this annotation doesn’t organize the data that way. Probably because the data provider couldn’t provide that summary level for every feature line or didn’t think the nesting would be useful (guesses!).
chrom | . | feature | start | end | . | strand | . | attrributes |
---|---|---|---|---|---|---|---|---|
chr1 | . | transcript | 38238904 | 38238991 | . | + | . | transcript_id MI0000865; gene_id MI0000865; Alias MI0000865; Name rno-mir-29c-1 |
chr1 | . | exon | 38238904 | 38238991 | . | + | . | transcript_id MI0000865; |
chr1 | . | transcript | 38238919 | 38238940 | . | + | . | transcript_id MIMAT0003154; gene_id MIMAT0003154; Alias MIMAT0003154; Name rno-miR-29c-5p; Derives_from MI0000865 |
chr1 | . | transcript | 38238957 | 38238978 | . | + | . | transcript_id MIMAT0000803; gene_id MIMAT0000803; Alias MIMAT0000803; Name rno-miR-29c-3p; Derives_from MI0000865 |
The point is that you can process this data in an exploratory way any way that you want to – just be aware that the “standard” expected usage might need a little bit of extra fiddling on your part to avoid logic errors.
If interested in discovery within your own samples, or a different DE approach, you could next review tutorials like these two, and the tools included (there is some overlap!), to get a better idea of how others might approach analyzing similar data to yours. Then, of course, publications are the best source! You might find the same tools in Galaxy or similar open-source tools but the overall flow should be about the same.
- Hands-on: Differential abundance testing of small RNAs / Differential abundance testing of small RNAs / Transcriptomics
- Hands-on: Whole transcriptome analysis of Arabidopsis thaliana / Whole transcriptome analysis of Arabidopsis thaliana / Transcriptomics
Hope this helps!