Troubleshooting FeatureCounts Error

I teach RNA-seq analysis using Galaxy, following the Reference-based RNA-Seq data analysis tutorial. Currently, my students encounter an error after using featureCounts:

ERROR: failed to find the gene identifier attribute in the 9th
 column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
An example of attributes included in your GTF
 annotation is 'exon ENST00000832824.1;'.

You can find an example of this error in one of my students’ histories: https://usegalaxy.org/u/sina_m04/h/gc

I noticed that the mapping rate is very low, which is an issue that needs to be addressed. However, this issue doesn’t seem to be causing the error mentioned above. The error appears to be related to the GTF file itself.

Interestingly, my other students have used this GTF file for the analysis of other datasets (they are single-end, as opposed to the paired-end history that I shared), but they don’t encounter the same error.

I would be grateful if you could help me identify the cause of this error.

Hi @Maryam_Momeni

I think I looked at one of your student’s histories this morning.

This is the link to the topic → Unknown data type in infer experiment

The problem is the same as the other history – mapping against the wrong reference genome. I explained how to check that in the other topic.

For the reference data issues in both histories… I see a few different versions of this data … and while that worked, I don’t think all those steps are really needed.

You can use the GTF file from Gencode directly with tools. However, some might complain about the # header lines since those are technically out of specification. I’m guessing that is why you are pulling in the GFF3 for some steps? Instead, you can removed the # lines with a single step, and the resulting GTF can be used with RNA-STAR and Featurecounts (or HISAT2, or HTSeq-count or really any others I can think of), and you can convert that standardized GTF to a BED12 for Infer Experiment.

You could also start with the GFF3 → GTF → BED12. I think getting both the GFF3 and GTF seems confusing… but you decide how you want to teach this. :slight_smile: Learning how to do all the manipulation is valuable.

And, you could use the Gencode GTF for all of the core analysis steps, and get a BED12 from UCSC just for Infer Experiment (tool: UCSC main). The strand trends can be seen with any primary gene prediction track – which specific annotation source shouldn’t matter since it is an isolated step, just for those statistics. But maybe better to not confuse students – some species won’t be available from UCSC, and a GTF should convert fine.

We have some help for how to get all of the reference data synched up. Those resource guides focus on these three main points:

  1. confirm data is all based on the same assembly version (the nucleotides)
  2. confirm that the the chromosome naming is consistent, or adjust it to be consistent (Gencode will be the same as UCSC, but Ensembl will be different)
  3. then confirm that the the formatting is in specification (RNA-STAR is picky, other tools may not be but the errors can be odd – better to just avoid those to start with).

That’s why I think it is a good idea to just standardize everything at the very start. It is sort of like making sure the reads are retrieved and organized/labeled in a collection at the start – it makes everything downstream predictable, so you can focus on parameters and scientific interpretation, not chasing down odd technical issues from opaque error messages.

Getting this kind of data organized is a common task when doing bioinformatics – really, any pipeline, not just DE analysis. It looks like that is also part of what you are teaching too! Great to see it! What has been done for far is very close to working.

Resources

We have a lot of Q&A about getting reference data organized in different topics, across tools. Find these with reference-annotation reference-genome reference-transcriptome and tools names like featurecounts or rna_star or hisat2

Please give that a try, and ask follow up questions if I missed something. A clean history with the correct mapping and reference data that still presents with an error would be something we can help with (I think the current errors are due to that mismatch, and the tool is “guessing” about what to report back, and it was a bit off). Please know that most people in the US will be away Thursday/Friday this week (holiday :turkey:) but our EU and AU friends will probably be available. :rocket: