If you review the “fasta” you chose and the
gff3 dataset, you’ll notice that the base sequence identifier names are not the same. The
fasta for “ncrna” is a subset of non-coding RNA sequences – not genome chromosomes. The annotation is based on genome chromosomes.
Using the human genome as a custom reference genome will probably run out of memory during the mapping step at any public Galaxy server.
GRCh38 is the Ensembl version of the human genome. It is also released from UCSC as
hg38. These two releases have different chromosome identifiers.
Try mapping against
hg38 natively indexed and using the built-in
hg38 genome annotation available in
Featurecounts. The Gene IDs will be in Entrez format, but the tool
annotateMyIds can be used to convert those to Ensembl format.
If you really would prefer to use the Ensembl-sourced annotation, choose the
gtf version of the data instead of the
This will load the data with the datatype
gff (autodetected by the
Upload tool) because of the header lines, so remove those first. The format of
gft data is much different than
gff3 – and it is not easy to convert one to the other and most tools work better with annotation in
Featurecounts does not work with
Next, convert the chromosome names from Ensembl format to be in UCSC’s format. Use the tool
Replace column by values which are defined in a converted file (Galaxy Version 0.2). See the tool form help for where to source a “convert” mapping file for the IDs.
Once both are done, “redetect” the datatype (pencil icon > Edit attributes forms > “Datatypes” tab). It should result as
gtf if all was done correctly. Avoid directly assigning the “datatype” to datasets whenever possible – if Galaxy cannot detect the expected datatype, then there is almost always some formatting problem that needs to be addressed.
I added some tags to your post that cover very similar Q&A. Click on any to review. This FAQ is also a useful resource: