Salmonquant didnot work using my Reference transcriptome

lida-soltanii · November 28, 2024, 1:44pm

Hi everyone

I have a problem with salmon quant tool

I am analysis RNAseq data using salmon quant to get TPM value, i used the cDNA file as reference transcriptome for my bacteria (sinorhizobium meliloti strain 1021) but salmon quant does not work(378 & 400 in my shared history), i tried salmon
quant with another cDNA file and worked (356 in my shared history), but did not work with my bacteria cDNA, could you please what is the problem with this file?

i attached the link of bacteria cDNA in ENSEMBL Dataset:

https://ftp.ensemblgenomes.ebi.ac.uk/pub/release-60/bacteria//fasta/bacteria_0_collection/sinorhizobium_meliloti_1021_gca_000006965/cdna/

Can i share my history URL here? Is it enough safe to share publicly?

jennaj · December 2, 2024, 5:43pm

Welcome @lida-soltanii

Yes, we’ll need to see all of the data in place inside your history to offer specific advice. You can post the share link back here, then unshare once we are done.

This guide includes most of the technical details that we’ll be helping to review.

FAQ: Extended Help for Differential Expression Analysis Tools

Some guesses: You mention the reference transcriptome, but not the reference annotation. You will want to use both at the Salmon step if the goal is to run a tool like DESeq2 after. The features in the annotation will have common identifiers with the transcriptome fasta – so be sure to check that is true and simplify the fasta > title lines as needed.

Also, most people do not need to include the reference genome at this stage. But you can share what you have and explain a bit more about your goals as we walk through some suggestions.

jennaj · December 9, 2024, 6:27pm

Hi @lida-soltanii

Thanks for sharing your history, this made it so much easier to help with exactly what is going wrong!

This is your message from the tool in the job logs (find these logs using the i-con inside of a dataset).

The tool is stating that it found two or more transcripts with the same sequence identifier. You should extract all the identifiers and count them up to find the duplicates. Then make adjustments. Don’t forget to also update your transcripts-to-genes mapping data too, or you will run into more problems with downstream steps.

As a reference, this FAQ has a bit more about the format of fasta files. → Datatypes - Galaxy Community Hub

Then this recent post has more about Salmon in general.

Using the salmon tool I´ve got UCSC ids instead of gene_ids

The best advice I have is to get all of your reference data organized at the very start. The UseGalaxy servers will host the genome indexed, but you’ll need to supply the two other files, and UCSC hosts all of this data.

Start here Correct reference transcriptome for Salmon quant on existing RNASTAR alignments - #2 by jennaj

Then this post has an example where I loaded all the file choices from UCSC, reformatted, and tagged the “matched files”. Correct reference transcriptome for Salmon quant on existing RNASTAR alignments - #4 by jennaj

Suggested data formatting for these tools. FAQ: Extended Help for Differential Expression Analysis Tools

More is under reference-transcriptome and reference-genome and reference-annotation

Human data has extra tips in this guide Reference genomes at public Galaxy servers: GRCh38/hg38 example

What to do from here

Double check that you do not have any sequences in your transcriptome that have the same name: the tool thinks that you have at least one duplicate, so at a minimum that needs to be solved.
Consider incorporating reference annotation at the Salmon step. You will need that “transcript-to-gene” mapping file when using DESeq2 later anyway. Both forms have details about what the data is and how it is formatted, and we have prior Q&A about it, but please ask more questions if you get stuck.
You have been manipulating your fasta file already to create the hybrid transcriptome but if that wasn’t in Galaxy, I can let you know that you can do that in Galaxy, too! Converting to a tabular format, making changes, then converting back to fasta format is a pretty common way to do this. Your GTF or tabular transcripts-to-gene data is already tabular.
- Tutorials → Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Introduction to Galaxy Analyses

Hope this helps! Let us know if you get this working, or have more questions