I attempted aligning my RNA-seq data to the reference genome using RNA_STAR. I found that wihtout trimming adapter (Illumina universal adapter) and PolyG which was overrepresented sequence, I got 25% read alignment on average in all my samples. But after using cutadapt to trim them and then performing alignment, the percentage went upto 50%. I am amazed if this is normal or is there something bogus with my data. I am anyway not getting good DE of genes
Welcome, @Saksham_Jain
It sounds like you removed artifact from your reads, and that helps to get those reads aligned with specificity to your reference genome. This is usually the desired result.
Now, if in later steps, that data is not what you expect it to be, that can be broken down into a few areas, all with different scientific, and possibly technical, reasons. These are the top two.
-
Your alignments are not overlapping with your expected and known annotated regions.
- Try using a tool like Featurecounts. This tool reports some statistics that explain why read alignments are not being counted. We have a lot of discussion about those reports at this forum, many are under featurecounts or just search with that tool name.
- Double check your reference annotation (GTF, GFF3) to eliminate technical problems (mismatched identifiers, format problems)
- Then maybe try visualizing the reference genome, reference annotation, and BAM in a genome viewer to see if you can notice what might be going on scientifically. IGV and UCSC are good choices.
- If UCSC is available for your genome with other annotated tracks (created by them), the repeats and conservation tracks are my personal choices for this kind of custom deeper-dive scientific review.
-
Your alignments are overlapping with your expected and known annotated regions but DE tools are not detecting meaningful expression differences between the samples groups (conditions).
- The same sanity checks about how those counts were generated from above is still the first step.
- One or more samples may have a quality problem, or content problem, or be mixed up, or be contaminated.
- Or, this might be the actual scientific result!
We have tutorials that process transcriptomics data through a few different workflows with different but similar tool choices. Those examples can be used for exploratory reasons, too. Maybe check if your current DE result is specific to your current tool choices, or if it can it be replicated across several different tool choices, then make decisions from there.
Hope this helps!