Generating tx2gene map and filtering for rnaSPAdes (de novo)

I am performing de novo transcriptome analysis on a halophyte plant (no reference genome/GTF) using rnaSPAdes and Salmon.

I need help with two specific steps in Galaxy:

  1. Gene-to-Transcript Map: How do I generate a tx2gene file from rnaSPAdes headers for use in tximport/DESeq2?

  2. Filtering: What tool can filter low-expression transcripts (TPM/FPKM) for rnaSPAdes assemblies, since Trinity-specific scripts are incompatible with the header format?

If anyone has a workflow or a specific tool suggestion for non-Trinity de novo data, I’d appreciate the help!

Thanks!

Welcome @Chamara_Lakshitha

For this question:

In the absence of a reference genome or any other annotation, I can think of two choices.

Transcript-level differential expression analysis

rnaSPAdes doesn’t perform a gene clustering step, so you can get around that by using the transcript again as the “gene”.

The idea is to put the assembly’s transcript identifiers into a two column file where both columns are the transcript identifier, one per line.

Steps

  • Convert FASTA → tabular (extracts identifiers and sequence)
  • Isolate the transcript identifier (first column) with a tool like Cut
  • Make a copy of that single column dataset with Copy Datasets (gear icon in history panel)
  • Use Join two files
  • The result is a tabular two column file with transcript identifiers that match the identifiers in your assembly → this is your tx2gene

Infer transcript groupings for “pseudo-genes” differential expression analysis

We don’t have a Galaxy-specific protocol for this tool, but any applied outside of Galaxy could be performed in Galaxy.

Steps

  • cd-hit then Format cd-hit outputs
  • Convert the clustering output to the simple two column format in a similar way as for the first example above
  • Use this for the tx2gene input

Then for this question

Do mean how to parse the raw Salmon result?

If yes, you can use one of the text manipulation filter tools to refine which transcripts to retain for the clustering or other downstream steps. There are too many to list here, but try a search with that keyword and you’ll see the “filter a tabular dataset” choices – simple or with regular expressions.

More about Text Manipulation tools.

We hope this helps but please let us know if it actually does. Follow up questions are welcome! :slight_smile:

Dear @jennaj
Thank you very much for your invaluable guidance and for taking the time to explain the available options so clearly. It was very helpful.

Following your suggestion, I first ran DESeq2 using a transcript-to-gene map where both columns contained the same transcript IDs generated by Salmon (i.e., treating each transcript as an independent unit). This analysis was completed successfully, and the resulting volcano plot showed a regular and expected shape.

However, I wanted to further refine the analysis and reduce redundancy. Therefore, I clustered isoforms using CD-HIT and generated a new transcript-to-gene map, where the cluster IDs were used as the gene IDs and the corresponding transcript IDs were mapped to each cluster. Using this cluster-based tx2gene file, I re-ran DESeq2. While the workflow ran without errors, the resulting volcano plot did not show the expected pattern.

At this stage, I am exploring ways to further reduce redundancy and noise in order to improve the biological interpretability of the results from my de novo RNA-seq analysis. If you have any additional guidance or best practices on refining this approach (e.g., clustering strategies, filtering thresholds, or alternative summarization methods), I would be very grateful for your advice.

Thank you again for your support.

Hi @Chamara_Lakshitha

I’ll try to help a tiny bit more with the big disclaimer that we cannot help with data interpretation at this forum. To get feedback from more scientists, you should review the discussions at forums such as Biostars.org. Many have tried to replace Trinity’s functionality! Then, once you know what you want to do and are unclear about how to perform the step in Galaxy, we can jump in to help you again.

With that context:

The data in your second graphic appears to be over clustered from the high view but in the details it may even be under clustered at the same time! We can’t tell from the graphic alone.

For this one, I may spot a protocol issue

Did you use your new transcript-gene file with Salmon, too? To recalculate the expression values in the presence of the clustering? If not, I would try that next, mostly to see what happens, but not as a final result.

Big picture considering both

I think you’ll need to also review the clustering parameters closer. If someone else has clustered your species or a similar species before (publication, topic at a forum somewhere) that can help to guide you, and would be important to review. Clustering will likely take several runs until you think the results are meaningful!

Developing a few “truth sets” of transcripts that you know should be all merged into the same genes (not under clustered), and not merged (not over clustered), is probably one of your first goals. Cross-species genome mappings and other types of annotation can be useful additions to layer in. Since your genome is a plant, polyploidy duplications seem important to know about.

So, there isn’t an easy answer! What I can state is that if the transcript clustering changes, you’ll probably want to recalculate the expression values using the updated clustering.

I hope this helps! :slight_smile:


:light_bulb: Bonus: You could put the core tools into a workflow, so you can cluster your data interactively, then push the results into a simple workflow to process the rest without too much clicking. This doesn’t mean starting completely over, instead, you can extract what you already did into a workflow (once you have the tools used in the order that you want to use them for future rounds).

Workflows - Extract, then edit, then run the reusable pipeline