HOw do you create a CTF/GFF file in Galaxy?

HOw do you create a CTF/GFF file in Galaxy?

Hi @Sonenshine

If you are after a genome annotation, check any appropriate GTN tutorial.
It depends on what you want and what data do you have. For RNAseq StringTie followed by SringMerge is a good option. TransDecoder produces GTF/GFF files for transcriptome.

Kind regards,
Igor

Thanks. Will try them.

Dr. Sonenshine

Hi Igor:

I’m using the genes to counts tutorial (under transcriptomes). I got as far as HISAT2 when I encountered the GFF/GTF issue. I uploaded a genome from history because it was not one of the small number of embedded choices. I made sure it was in fasta format. So, which tool do you recommend to create the conversion to a GFF/GTF file?

Thanks

Daniel

Hi @Sonenshine

Jumping in here to clarify the difference between a reference genome (fasta) file and a reference annotation (GTF/GFF/GFF3) file.

Start with this → FAQ: Extended Help for Differential Expression Analysis Tools

In short, these data represent different but related content, and cannot be created from each other.

  • The fasta are the nucleotide bases of your genome.
  • The annotation are the coordinates (on those nucleotide strings) of features.

Some tools can create annotation based on the inputs – those are what @igor was specifying. This is a de-novo method.

For known annotation, the usual source for a reference annotation file is the same place where the reference genome was sourced. It is a good idea to get both at the same time (along with a reference transcriptome, if you are using one). Then to make sure all the format/content is correct, and then to finally start the analysis project. :slight_smile:

Hope this helps but if you need more help, maybe start by explaining exactly where you sourced the reference genome. URL or similar is usually best. Then explain whether you are attempting de-novo, or known, or a hybrid annotation path.

Hi Jennifer:

I greatly appreciate all this help and will try to follow up with these suggestions in the Galaxy Tool (under transcriptomes) “Reads to counts”.

For your information, I am trying to compare two endosymbiotic bacteria, species of the genus Rickettsia, using the first tool, reads to counts, in order to progress to the second tool, differential expression with limma-voom.

To do this, I selected the first species, Rickettsia conorii, from the list of embedded genomes that can be found when you do upload to import something. However, after progressing through the series of steps, e.g., cutadapt, fastqc, multiqc…etc. to HISAT2, the second genome, Rickettsia rickettsii RML, is not available. All that shows up from the embedded list is a small subset of the upload list and HISAT2 will not run. And tool asks for a GFF or GTF file. So, I will try to create using your suggestions.

I like one of your suggestions which is to load both at the beginning, during the initial upload. I will try that.

Does all this make sense to you?

Thanks

Daniel

Dr. Daniel E. Sonenshine

Guest Researcher,

NIAID, NIH

Hi @Sonenshine

For this part

if you have two different species, then these tools will not work for you unless you are mapping both sets of species reads to the exact same reference genome (along with the same reference annotation).

And, about this part

Correct. Sometimes there is a placeholder for a reference genome in the drop-down, but the full genome is not actually installed as a fasta index, nor as a tool-specific index e.g. for mapping tools. There are some legacy reasons for this, and is safely ignored.

Instead, load up your custom genome fasta, create a novel database name, and let tools index it for you at runtime. How to → FAQ: How to use Custom Reference Genomes?

You will know more about your genomes than we will but in general: if you are able to locate an appropriate common reference genome for both sets of reads, that is what could be loaded up to Galaxy and used. You could do transcript discovery to capture novel content from read samples. I’m not sure if that will produce what you want … but what I am describing are the same usage guidelines that using the same Bioconductor tools outside of Galaxy would require. Meaning, you need a common baseline genome and annotation, then where the sample reads map (and in what abundances) is the variable part of the experiment.

Hopefully this helps!

Hi Jenna:

Yes, helps a lot. So, I loaded both the sample and the reference genome (from the drop down list) using fetch, start and close. I will work through the tools and see whether HISAT2 will accept it.
thanks

Daniel

1 Like