I have three sets of RNA-Seq data that I have treated exactly the same, I have aligned to my genome (HISAT2), merged (Merge BAM files) and used to annotated my genome (BRAKER3). These gave good mapping rates (all >90%).
I am having trouble using featureCounts on my BAM files. I noticed that one of the BAM files worked perfectly with featureCounts, while the other two gave 0 assigned reads. I then noticed that the sequence name in the header of the other two BAM files did not match the sequence name in the GTF annotation file. Assuming this is the source of my problem, is there a way to fix this? There are many sequence names in each BAM file, e.g.:
ABCD01000000001.1
ABCD01000000002.1
ABCD01000000003.1
etc.
The only part of the sequence name that differs are the letters, i.e. the mismatched headers look as follows:
ABCE01000000001.1
ABCE01000000002.1
ABCE01000000003.1
etc.
While yes, you could edit a BAM file directly with text manipulation tools if you convert it to plain text first (SAM format), that is usually a really bad idea!
Those lines in your BAM file are the sequence identifiers from the reference genome you mapped against. Those map to the fasta sequences in your reference genome – and those labels for specific nucleotide strings matter!
Then the GTF reference annotation is where other features have been placed along those same nucleotide strings: genes, transcripts, other features.
Featurecounts is counting up hits against coordinates on those sequences where the features in the annotation are. Mixing this up means that all your counts will be wrong (if they even process, this tends to fail!).
In short, you have a Reference data mismatch problem. You may have mapped against the wrong genome for a few runs. You can double check what you used for parameters per job on the Job Details view (using the i-icon) and make changes when you run the job again (using the rerun-icon).
This FAQ has more details but I’m guessing a rerun with a different target database choice for your case is enough. Use the genome that is a match with the GTF for all steps.
If you need more help with confirming the problem, you are welcome to share back your history for more exact feedback. If you are working through a tutorial, please link that back too for context. See → How to get faster help with your question
Let’s start there, and let us know if you solve this!
Thanks so much for your help! I checked the input genome for the two mismatched RNA-Seq, and it is indeed a slightly different input to the correctly header-ed RNA-Seq (unfortunately, it’s one I’ve since deleted and purged so I can’t check exactly where I went wrong)! I’ve rerun the alignments and the sequence names now match.
Just wondering, do I now have to rerun BRAKER3? I assume some of the names in the annotation will be incorrect?
If you were using those mapping results with any tool, yes, you should rerun using the corrected data. The chromosome names and the sequences and the mapping coordinates will all be different with a different mapping target genome.