HOw do you create a CTF/GFF file in Galaxy?

Sonenshine · July 4, 2024, 8:36pm

igor · July 5, 2024, 2:41am

If you are after a genome annotation, check any appropriate GTN tutorial.
It depends on what you want and what data do you have. For RNAseq StringTie followed by SringMerge is a good option. TransDecoder produces GTF/GFF files for transcriptome.

Kind regards,
Igor

Sonenshine · July 5, 2024, 12:11pm

Thanks. Will try them.

Dr. Sonenshine

Sonenshine · July 5, 2024, 12:20pm

Hi Igor:

I’m using the genes to counts tutorial (under transcriptomes). I got as far as HISAT2 when I encountered the GFF/GTF issue. I uploaded a genome from history because it was not one of the small number of embedded choices. I made sure it was in fasta format. So, which tool do you recommend to create the conversion to a GFF/GTF file?

Thanks

Daniel

jennaj · July 5, 2024, 7:51pm

Hi @Sonenshine

Jumping in here to clarify the difference between a reference genome (fasta) file and a reference annotation (GTF/GFF/GFF3) file.

Start with this → FAQ: Extended Help for Differential Expression Analysis Tools

In short, these data represent different but related content, and cannot be created from each other.

The fasta are the nucleotide bases of your genome.
The annotation are the coordinates (on those nucleotide strings) of features.

Some tools can create annotation based on the inputs – those are what @igor was specifying. This is a de-novo method.

For known annotation, the usual source for a reference annotation file is the same place where the reference genome was sourced. It is a good idea to get both at the same time (along with a reference transcriptome, if you are using one). Then to make sure all the format/content is correct, and then to finally start the analysis project.

Hope this helps but if you need more help, maybe start by explaining exactly where you sourced the reference genome. URL or similar is usually best. Then explain whether you are attempting de-novo, or known, or a hybrid annotation path.

Sonenshine · July 7, 2024, 4:31pm

Hi Jennifer:

I greatly appreciate all this help and will try to follow up with these suggestions in the Galaxy Tool (under transcriptomes) “Reads to counts”.

For your information, I am trying to compare two endosymbiotic bacteria, species of the genus Rickettsia, using the first tool, reads to counts, in order to progress to the second tool, differential expression with limma-voom.

To do this, I selected the first species, Rickettsia conorii, from the list of embedded genomes that can be found when you do upload to import something. However, after progressing through the series of steps, e.g., cutadapt, fastqc, multiqc…etc. to HISAT2, the second genome, Rickettsia rickettsii RML, is not available. All that shows up from the embedded list is a small subset of the upload list and HISAT2 will not run. And tool asks for a GFF or GTF file. So, I will try to create using your suggestions.

I like one of your suggestions which is to load both at the beginning, during the initial upload. I will try that.

Does all this make sense to you?

Thanks

Daniel

Dr. Daniel E. Sonenshine

Guest Researcher,

NIAID, NIH

jennaj · July 8, 2024, 8:20pm

Hi @Sonenshine

For this part

if you have two different species, then these tools will not work for you unless you are mapping both sets of species reads to the exact same reference genome (along with the same reference annotation).

And, about this part

Correct. Sometimes there is a placeholder for a reference genome in the drop-down, but the full genome is not actually installed as a fasta index, nor as a tool-specific index e.g. for mapping tools. There are some legacy reasons for this, and is safely ignored.

Instead, load up your custom genome fasta, create a novel database name, and let tools index it for you at runtime. How to → FAQ: How to use Custom Reference Genomes?

You will know more about your genomes than we will but in general: if you are able to locate an appropriate common reference genome for both sets of reads, that is what could be loaded up to Galaxy and used. You could do transcript discovery to capture novel content from read samples. I’m not sure if that will produce what you want … but what I am describing are the same usage guidelines that using the same Bioconductor tools outside of Galaxy would require. Meaning, you need a common baseline genome and annotation, then where the sample reads map (and in what abundances) is the variable part of the experiment.

Hopefully this helps!

Sonenshine · July 9, 2024, 11:58am

Hi Jenna:

Yes, helps a lot. So, I loaded both the sample and the reference genome (from the drop down list) using fetch, start and close. I will work through the tools and see whether HISAT2 will accept it.
thanks

Daniel

Topic		Replies	Views
Why HISAT2 indexer builder requires gtf but not gff in the advanced indexing option?	1	1979	June 3, 2019
Reference annotation for TAIR10 reference-index , reference-annotation , reference-genome , reference-transcriptome	1	94	November 6, 2024
Adding new Reference genomes to the DeepVariant deep learning-based variant caller usegalaxy.eu support custom-genome , mapping , transcriptomics , reference-annotation , reference-genome , custom-build , featurecounts	1	772	January 26, 2023
How to add a new reference-genome on HISTAT2? I need S. agalactiae BM110 usegalaxy.eu support reference-genome	5	229	July 1, 2024
How to use a gtf file from ensemble? usegalaxy.org support reference-annotation	1	224	January 16, 2024

HOw do you create a CTF/GFF file in Galaxy?

Related topics