I think I looked at one of your student’s histories this morning.
This is the link to the topic → Unknown data type in infer experiment
The problem is the same as the other history – mapping against the wrong reference genome. I explained how to check that in the other topic.
For the reference data issues in both histories… I see a few different versions of this data … and while that worked, I don’t think all those steps are really needed.
You can use the GTF file from Gencode directly with tools. However, some might complain about the # header lines since those are technically out of specification. I’m guessing that is why you are pulling in the GFF3 for some steps? Instead, you can removed the # lines with a single step, and the resulting GTF can be used with RNA-STAR and Featurecounts (or HISAT2, or HTSeq-count or really any others I can think of), and you can convert that standardized GTF to a BED12 for Infer Experiment.
You could also start with the GFF3 → GTF → BED12. I think getting both the GFF3 and GTF seems confusing… but you decide how you want to teach this. Learning how to do all the manipulation is valuable.
And, you could use the Gencode GTF for all of the core analysis steps, and get a BED12 from UCSC just for Infer Experiment (tool: UCSC main). The strand trends can be seen with any primary gene prediction track – which specific annotation source shouldn’t matter since it is an isolated step, just for those statistics. But maybe better to not confuse students – some species won’t be available from UCSC, and a GTF should convert fine.
We have some help for how to get all of the reference data synched up. Those resource guides focus on these three main points:
- confirm data is all based on the same assembly version (the nucleotides)
- confirm that the the chromosome naming is consistent, or adjust it to be consistent (Gencode will be the same as UCSC, but Ensembl will be different)
- then confirm that the the formatting is in specification (RNA-STAR is picky, other tools may not be but the errors can be odd – better to just avoid those to start with).
That’s why I think it is a good idea to just standardize everything at the very start. It is sort of like making sure the reads are retrieved and organized/labeled in a collection at the start – it makes everything downstream predictable, so you can focus on parameters and scientific interpretation, not chasing down odd technical issues from opaque error messages.
Getting this kind of data organized is a common task when doing bioinformatics – really, any pipeline, not just DE analysis. It looks like that is also part of what you are teaching too! Great to see it! What has been done for far is very close to working.
Resources
-
Datatypes - Galaxy Community Hub (gtf) << start here
-
FAQ: Extended Help for Differential Expression Analysis Tools
-
Reference genomes at public Galaxy servers: GRCh38/hg38 example
We have a lot of Q&A about getting reference data organized in different topics, across tools. Find these with reference-annotation reference-genome reference-transcriptome and tools names like featurecounts or rna_star or hisat2
Please give that a try, and ask follow up questions if I missed something. A clean history with the correct mapping and reference data that still presents with an error would be something we can help with (I think the current errors are due to that mismatch, and the tool is “guessing” about what to report back, and it was a bit off). Please know that most people in the US will be away Thursday/Friday this week (holiday ) but our EU and AU friends will probably be available.