Differential Expression from Salmon Quant outputs without GTF/GFF3 annotation file

Hi!
I’m trying to find differentially expressed genes from RNA-seq data in a species for which I only have a transcriptome. Have been able to map my RNA-seq data to the transcriptome using both Salmon and Kallisto quant but am unable to use DeSeq2 on the mapped data without a GTF/GFF3 annotation file.

I’d prefer to use DeSeq2 to generate differential expression data because I’ve already used it within the same project for a different species, but any ideas for how to proceed with the mapping data would be appreciated.

Hi,

A tabular mapping file can be used instead of a GTF with DEseq2.

The place where you sourced the transcriptome fasta might have a mapping file for transcript-to-gene even if there is no genome sequence available. And, sometimes the identifiers in a transcriptome are encoded with predicted gene information, and that could be parsed out into a tabular format.

If absolutely not available, you could start over and try assembling your own transcriptome. Both Trinity and rnaSPAdes output predicted gene information.

Thanks,

How would I parse the predicted gene info into a tabular format? The genes in my transcriptome have gene IDs, but when I try running DeSeq2 it removes the actual IDs and displays the actual sequence instead.

image

Hi @Kat_Schmidt

I’m not sure where you are entering the transcriptome fasta. There isn’t a place for the entire file on the DESeq2 tool form.

The input the tool is expecting would be a two column file with a datatype of “tabular”.

  • The first column has the transcript IDs (that exactly match any content before the first white-space on the transcriptome’s fasta > title lines)
  • The second column has the geneID associated. This can be parsed out from those > lines or is that data from a different source?
  • GeneIDs should be unique in the first column. If you have duplicated gene IDs, then you need to do one of these or both. Since you don’t already have that tabular file, then I’m guessing both are needed for your case.
    • Rerun Salmon with the same two column tabular file entered for File containing a mapping of transcripts to genes. This creates the additional output for DESeq2.
    • Make sure to input the quant.genes.sf file to DEseq2.

Tabular file with Transcript-ID to Gene-ID mapping

If this file is provided Salmon will output both quant.sf and quant.genes.sf files, where the latter contains aggregated gene-level abundance estimates. The transcript to gene mapping should be provided as either a GTF file, or a in a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab.

This tutorial explains common data manipulations data-manipulation-olympics.

If you need more help after trying that, please post back a few of the > lines from your transcript fasta file and whatever extra file you have with geneIDs (if not in the fasta > lines). Please quote the content so it renders correctly. You could also post back a share link #sharing-your-history to your history and note which dataset(s) this content is in. Please leave your attempts at manipulation undeleted.