Ok, the second reply helps me to understand.
-
Stringtie
– the inputs are mapped reads. Whatever ever was mapped against is where the coordinates are derived from. All other inputs that are coordinate-based need to be based on that exact same reference genome. -
You mapped against the
hg38
reference genome at this step in the tutorial. -
You can incorporate reference annotation during mapping (
HISAT2
) and/or transcript assembly (Stringtie
) and/or transcript merging (Stringtie merge
), then will also include the reference annotation for later steps (differential expression:Featurecounts
andDeSeq2
if following the tutorial). -
The
reference annotation
that is first used represents the known genes/transcripts for your genome. You will still capture novel transcripts if you use one. The different tool forms will state the choices: create transcripts that are known (only – those included in the annotation) OR create known plus novel. You can run it both ways in different tools and compare. -
As you move through the analysis, the reference annotation will be updated and reformatted at certain steps. The updated version at an earlier step is what you want to use in the next downstream steps.
-
The
hg38
reference annotationGTF
is now best sourced from UCSC. Not from the Table Browser but from the Downloads area here. There are a few Gene tracks to choose from. You can review what each of those represents then upload the selectedGTF
to Galaxy with theUpload
tool. Just copy and paste in the URL and leave all other settings at the default. The correct datatype will be assigned.
I’m not sure what that annotation represents or where you sourced it, but some data providers create GTF
data that are not quite in the correct format. It is also possible you have annotation with chromosome identifers that are a mismatch for hg38
. I would stick with the UCSC GTF
if this analysis is new to you – nothing special needs to be done to prepare it for use with tools. But if you really want to use another source, Gencode is probably the best alternative. You’ll need to remove the header lines and then update the datatype first. Instructions are in this prior Q&A.
Note: you may find prior Q&A that specifically states to avoid the UCSC GTF
. That was about using the version from the Table Browser before UCSC started generating properly formatted GTF
data available in the Downloads area for all of their genomes that had a Gene annotation track.
In my opinion, the version of hg38
reference annotation in the UCSC Downloads area is now absolutely the best choice. Especially if you choose the RefSeq Gene annotation (hg38.ncbiRefSeq.gtf.gz) – as that one is updated regularly, about monthly. Other gene tracks are fixed at prior releases. Other data sources may need reformatting and some are missing important attributes. But you can review and decide. Or maybe run distinct analysis paths with the different GTF
choices and compare results. However you do this – start with a specific annotation GTF
then keep using that throughout the same analysis. Meaning: don’t switch from RefSeq to Ensembl in the middle of an analysis or expect problems.
For this, my guess is that you are mixing up where to input a reference annotation (GTF
) versus a reference genome or reference transcriptome (fasta
) on the tool forms. Your analysis will include all three.
- A reference annotation will be selected from the history for your use case.
Featurecounts
does have annotation available for a few genomes, includinghg38
, but that is for when you don’t need annotation for any other steps – and it won’t match external sources – so don’t mix that in for now. - A reference transcriptome will be selected from the history (always).
- A reference genome for your case is the already indexed built-in version of
hg38
. Technically you can input a reference genome from the history, too, but that only works for very small genomes and there is no need if Galaxy has already indexed it.
That should cover all the questions/problems you were having. Please try this out