This is my first time working on bulk RNA seq data and I’m having a few issues. Basically I’m doing a small analysis on just 2 samples to try and learn.This is briefly the pipeline:
1. Create a paired collection
2. Flatten collection
3. Quality control with Falco and then Cutadapt
4. RNA STAR for mapping. The samples are human, so as reference genome I used “Human Dec. 2013 (GRCh38/hg38) (hg38). As gene model file for splice junctions, I downloaded this file: “gencode.v38.annotation.gtf.gz” from GENEODE (first one of the list of annotation files for release 38). On this file, I used the Select tool to remove headers (#) and I used that selected file as gene model on Star. For length of genomic sequence around annotated junctions, I chose 150 (maximum length of reads -1). The per gene/transcript output is “Per gene read counts (GeneCounts).
Here, I obtain 69% and 64% of uniquely alligned reads, which from my understanding is not a very good result. A good amount of reads are Mapped to multiple loci .
5. Convert GTF to BED12 to convert the “select” version of the annotation file + InferExperiment to determine strandness of BAM fil e.
Here the fractions of reads failed to determine are 0.8 and 0.6, but from the few determined reads I can tell that they are reverse strand ed.
6. I tried featureCounts even though something was clearly wrong and I obtained 20% and 28% of assigned reads. I chose 10 as minimum mapping quality per read. From MultiQC, I see that the majority of reads fall under “Unassigned: mapping qual ity”
What could be wrong? I think I could maybe change the annotation file, but I’m not sure how to choose one. Genecode website reported my file as the main annotation file for most users, so this is simply why I chose it. I would be happy to recieve more suggestions, thank you!!
Consider running FastQC/Falco after the Cutadapt trimming step, then view all the reports together in MultiQC. This step makes sure that the trimming did what you expected. (actually remove adaptor/artifact and nothing else odd was detected). The reports are discussed in this tutorial with linkouts to the author’s site with more! → Hands-on: Quality Control / Quality Control / Sequence analysis
We have a public demonstration template workflow here that you can try! The input is just the original raw paired end collection of reads, the output are all the reports and trimmed reads. Be sure to allow it to process completely (data will show/hide throughout the runtime, and that’s expected!)
Finally, for the annotation choice, limiting to the basic genes on the primary chromosomes might help. This is the 4th option down on the Gencode website here → GENCODE - Human Release 49 and has in the description: This is the main annotation file for most users
We hope this feedback helps! There could be an issue with the reads of course! If this was from a public sample like SRA, you could review the metrics they capture and maybe there is discussion online at a site like Biostars.org. If these were your own samples, chemistry/sequencing issues are possible (potential items to modify for next time). That said, for both cases, I would try the items above first since these are where you would need to tune to use this data anyway. Poor quality happens, and people still use the data sometimes, so you are checking that there isn’t bias that could impact the scientific result interpretations!