I have run into a problem when trying to count features from a genomic alignment.
I have used RNA-star to align single end FASTQ data from the rabbit. The mapping goes well and delivers a good unique mapping of 74%. To then count features (using featurecounts) I use the same genome imported via UCSC table browser as ncbi-refseq-genome-all in GTF format. So far so good.
At first inspection everything looks great, genes I expect highly represented are indeed so. However, one key gene I expected to be highly represented was missing from the counts and was given as being zero, something that seemed very odd.
I converted the aligned BAM from the RNA-star output to a bigwig for UCSC import and could clearly see this gene should certainly be present in terms of transcript counts. I also examined my GTF file and could also confirm that all the exon locations and gene annotations were present as expected for this gene and all matched the gene_id locations for the observed BigWig output under UCSC. I proceeded to inspect other genes that according to the featurecount table were zero. Many, were indeed zero, however, I came across many entries that were clearly not zero according to the aligned BigWig. Since I previously came across this issue (by chance) when performing a mouse RNA-seq experiment (but at the time just disregarded it a a ‘glitch’), I am now more concerned. I have inspected all the files and inputs I am using but as for myself, can find no good explanation for this output mismatch.
Retrieving GTFs from the UCSC Table Browser can introduce two problems. UCSC explains about why not to get GTFs using the TB on their site but sometimes that is missed.
The gene_id and transcript_id values will be the same value (scientific problem wherever it is used, and probably explains the odd results)
The output can be truncated due to output size limits (technical problem once in Galaxy)
You didn’t state which genome you are using, but check in their Downloads area for GTF annotation. If your genome doesn’t have this already available, write into UCSC’s help and they can assist you to get a proper file.
Yes, I had seen this cut off on some file formats (such as BED) but have never noticed it from on gtfs especially when directly sent to Galaxy.
Still seems strange that the genes I am looking for I can clearly see listed in the GTF I sent to Galaxy and all have the correct annotations and positional locations in the gtf file. If they had been missing I would have understood your explanation. Further, since the large majority of genes do have good counts that make sense compared to the observed BigWig of the BAM, I find your possible explanation in your point 1 somewhat hard to follow.
I am using the Broad April 2009 OryCun version which I believe is the same as the one in Star. I know there is an October 2009 update version, so I tried that, but that gives zero for all entries, hence my conclusion that April 2009 is good.
If you are saying that there is code I cannot read and check in the GTF which will be cut out, then this is of course something I cannot check.
I shall try cutting the GTF to just the chromosome I am interested in, to check you hypothesis regarding output size limits, but from what I am seeing I don’t really think this is the problem.
Thanks for all your help and your discussion in this and all my previous questions to you.
So I have now downloaded and used the GTF from the index path on ncbi. This indeed now solves the problem. So many many thanks for that tip! Still don’t fully grasp why it doesn’t work for all genes in the way I was previously performing this analysis, but I shall not break my head about it.