Generate gene to transcript map: WARNING output

Hi,
I am trying to use Generate gene to transcript map for Trinity assembly (Galaxy Version 2.8.4) tool. I end up with this warning.
image
Can anyone say what is the problem and how to rectify.

Thank you

1 Like

The tool is specifically having trouble parsing the “>” fasta assembly title lines.

This is an example of the type of “>” title lines this particular tool is designed to parse: https://github.com/galaxyproject/tools-iuc/blob/master/tools/trinity/test-data/raw/Trinity.fasta

If you ran Trinity successfully in Galaxy the format will be Ok by default. If you ran Trinity someplace else and uploaded the result, compare your title lines (transcript IDs) with the example.

Assemblies generated by other methods would not be appropriate inputs for this tool. That said, it could still be possible to parse out transcript identifiers and genes using other tools (general Text Manipulation tools) – it depends on the content of your data.

Note: This tool wasn’t always available in Galaxy. This is prior Q&A from about a year ago when this specific tool’s functionality was recreated with a simple workflow. It could be used now as an example for custom parsing (IF your title lines have transcript nomenclature that encodes some type of transcript-to-gene relationship/grouping):

If you still need help after reviewing:

Post back 10-20 of your assembly “>” title lines for some help in figuring out how to parse the data or some help in determining if this type of parsing is even possible with your given data’s content. Please don’t include the sequences, just the “>” title lines – and enough lines that the data is representative of the whole.

The Select tool can be used to isolate title lines from large fasta datasets. Use the option “Matching” with the regular expression: ^>

The regular expression means:

  • ^ the start of a line
  • > a “greater than” symbol (how fasta title lines are designed)

Fasta format FAQ: Datatypes - Galaxy Community Hub

Small disclaimer: How Trinity formats title lines changed a bit between releases, so that may be where your problem is. But the help here should help even in that case – the workflow could definitely be tuned for any Trinity-based format.

Hi @jennaj
As Generate gene to transcript map tool description says it is alternative to Trinity. I have tried trinity directly by changing format of my input from fastq to fasta.
Successfully run Trinity, Align reads and estimate abundance without any error. Further I have used output from Align reads and estimate abundance for Build expression matrix (In Build matrix tool for Abundance estimates: I have given input of two output files from align reads and for Gene to transcript correspondence (‘gene(tab)transcript’ lines): output from generate gene to transcript map) and I ended with following error. Before I reloaded the job also same error.
image
At first I have executed the job without Input for Gene to transcript correspondence (‘gene(tab)transcript’ lines. I ended with error. And with Input too same error.

What is TPM?
Any problem with my Input files? where I am doing wrong?
Can you suggest the way out for this error please?

Thank you

1 Like

Hi @YKV

The tool accepts a Trinity assembly as an input. It does not replace first assembling with Trinity itself – rather some versions of Trinity as wrapped for Galaxy will create this output when Trinity is run and some do not. And if you ran Trinity somewhere else and uploaded that assembly to Galaxy, this particular mapping data may not have been created/uploaded.

Run Trinity using fastq reads if you have them – the quality scores will help to produce a higher quality assembly. The assembly result from Trinity is always in fasta format.

The tool Align reads and estimate abundance on a de novo assembly of RNA-Seq data is for mapping RNA-seq datasets against a previously created assembly. If that assembly was from Trinity, then there is no need to supply a gene-to-transcript map – the tool will parse the assembly’s sequence identifiers during runtime.

In short, it is summarizing abundance when a de-novo assembled transcriptome is available. This is similar to how Featurecounts or HTseq-count summarize abundance counts. It can also summarize abundance TMP values. See below for more.

The next tool Build expression matrix for a de novo assembly of RNA-Seq data by Trinity accepts multiple abundance outputs from Align reads and estimate abundance. One input per sample. At least two distinct samples. All sample inputs should either be summarized by transcript OR all should be by gene – don’t mix the two in any particular job.

It sounds like you had just one sample – and input the transcript and gene counts together to this tool. That won’t work. You need more than one sample – the idea is to build a matrix of counts representing multiple samples. Then in later steps, those samples in the matrix are compared to each other (other samples for other conditions/factors) for differential expression purposes.

Any differential expression tool from the Bioconductor tool authors will require at least two conditions with at least two samples per condition. (DESeq2, DEXseq, EdgeR, Limma). FAQ: Extended Help for Differential Expression Analysis Tools

TPM (“transcripts per million”) values are a distinct type of normalized abundance metric. Tools like Salmon, Sailfish and Kalisto produce these. You can also choose to generate this kind of metric when using the Align reads and estimate abundance on a de novo assembly of RNA-Seq data tool, instead of counts.

I’m not sure what settings/inputs will lead to this particular error but suspect some of the input steps were not consistent. If you choose one way of generating abundance metrics with an upstream tool, the same should be set for downstream tools or your inputs won’t match up with what the tool is expecting as an input.

Review the tool forms again – each usually explains what the expected inputs are (and what tool produces it) and the outputs (and which tool or tools can use that output).

2 Likes

Hi @jennaj
My apologies for wrong statement about Generate gene to transcript map is alternative to trinity I have run trinity in galaxy and got two outputs one is assembled transcripts and the other is gene to transcript map. I have uploaded assembled transcript output from trinity in generate gene to transcript map tool it had given same output as trinity’s second output(obiviously as the name says gene to transcript map). So before I don’t know that either no need to run gene to transcript map or problem with my input(I got the answer with your reply).

I have downloaded file from fast download NCBI SRA and faster download NCBI SRA (not all files able to download, in output file it says one list is available but the list is empty) tool in fastsanger format. I changed data format from fastqsanger.gz to fastq and the output file is not available in trinity tool to upload as input. So I had to change it to fasta format.

Thank you so much for explanation of Build Expression matrix. I will try with two samples in trinity (If I can able to get suitable format through SRA tool).

Due to lack of knowledge on this area either the inputs are not available or if it is, format is the problem and if it’s matches at last after run for more than one or two days or more (in some cases) some how or the other way error in output. I am trying to know more about it so more questions and in need of help. Sorry for simple questions but that’s where I stand now.

Thank you so much for your help.

Many tools work with compressed fastq data but Trinity expects uncompressed fastq. It also expects fastq data that has quality scores scaled as “Sanger Fastq+33” (as do most other tools). In Galaxy, that datatype is labeled as “fastqsanger” (uncompressed fastq) or “fastqsanger.gz” (compressed fastq).

Galaxy will “implicitly convert” some inputs with a given datatype to a different datatype during runtime (and create a hidden dataset of that type in your history to use). The input datasets will appear in the tool form’s select menu with a “(as NNN)” added on to the name where NNN is the datatype the data will be converted to at runtime. Be aware that this can unexpectedly increase your quota usage!

Trinity will convert fastqsanger.gz to fastqsanger this way.

The tool was probably not finding your fastqsanger.gz inputs because the tool you used to extract the data from NCBI SRA organized the output into a Dataset Collection. If you do not specifically set the input type as being in collection, the dataset(s) will not be discovered by the tool. So, there are a few choices (instead of converting to fasta, which results in information loss):

Use this tool instead Download and Extract Reads in FASTA/Q format from NCBI SRA, and set the option to extract the data in uncompressed format. This has two changes. A) The sequences will not be in a collection, which may be more useful to you, although learning how to work with collections at some point is a very good idea. And B) The sequences will be already in an uncompressed state, meaning that you will avoid data duplication/quota increases from the pre-step of converting compressed to uncompressed – this is more important if you intend to use Trinity directly.

That said, you really should do some QA on data as the first step. The usual cycle will be FastQC (“before” data quality) > Trimmomatic > `FastQC (“after” data quality). Then at the end decide if you want to uncompress the data yourself (pencil icon > Edit attributes > Convert > uncompress) or allow the next tool to do that (if that tool requires uncompressed fastq – most don’t).

These are decisions you’ll need to make yourself. And managing data (removing intermediate data no longer needed as an input) is also something that you will want to learn how to do.

If you are ever not sure what datatype a tool’s input is filtering on to determine which datasets in the current history are appropriate inputs (could be more than one datatype!), the tool form may state what is expected, but to know exactly, there is a very simple tip: Create a new empty history and then bring up the tool form. The expected datatypes will be listed in the select field. Then go back to the working history where your data is, and double-check the assigned datatypes if the tool form isn’t finding the data. And if the data is in a collection, make sure to set the tool form up to look for data in a collection.

I know this seems complicated, and it may be at first – after it will be automatic. Everyone doing informatics has to learn how to get their inputs set up correctly, in Galaxy or anywhere else. The filters in Galaxy actually are helping to avoid problems (example: running jobs with inappropriate inputs that will just cause the tool to fail, and not always with error messages that explain what is going wrong at a detailed level). It just isn’t possible to trap and report every possible usage problem, and even when that is done, what you need to do to fix the inputs can vary. Jobs running out of resources (memory/runtime), odd errors, unexpected results – all of these are almost always problems related to inputs (format & content).

There is much prior Q&A around various problems with inputs at this forum. I added more tags to your post. Click on those to review or just search with keywords. So many have detailed help, FAQ links, tutorial links, etc. The input troubles you are running into are commonly reported by newer Galaxy users – if I tried to individually point you to all that might help, it would be a long list of topics! Better for you to just spend some time reviewing – it will be worth the effort.

That said, here are the FAQ links that will apply the most. But I really suggest that you review the prior Q&A too. FAQs are abstract – prior Q&A breaks that down and gets into specifics, with context.

You already know where the GTN tutorials are and while those can help with usage, once you are not using the tutorial data anymore, or using tools not covered by a tutorial, usually tool form help, FAQs, and prior Q&A is more useful when addressing specific issues with inputs. I don’t think you are running into actual tool bugs right now so skip that part – but the rest of the advice in this particular post should really help.

I duplicated/reworded some of that help in our recent Q&A (in this reply and across other topics) – and in other prior Q&A you’ll find – because sometimes that helps a bit too. But in the end, you are going to need to learn how to get your own inputs correct. Otherwise, the input problems are going to just lead to even more job delays, odd errors, weird results, and overall frustration.

I think you are getting close to solving this!

1 Like

Hi @jennaj,
I have couple of reads from each sample and I have done trinity on each read followed by align reads and estimate abundance. I have got two files(for each pair of reads) 1) gene counts 2) isoform counts. As to start I have used two files from each sample i.e., two gene count files as input in “build expression matrix tool” still I get the error like below. and I left Gene to transcript correspondence (‘gene(tab)transcript’ lines) option as empty.(I have tried in multiple ways like: with isoforms alone(one set as well all sets); all sets of gene counts; trinity assembly and gene to transcript file to ‘gene(tab)transcript’ lines).
Due to last time TPM error have tried with Kallisto and RSEM both the ways (I don’t understand that though TPM option available with RSEM method, then why did I got error about TPM?). But none worked.
image
Can you suggest the way out please

Thank you

1 Like

I’m going to bring in the Galaxy EU team since it may be that you are running into technical issues (I can’t check for those since I am not an admin at that server).

Ping @wm75 @bjoern.gruening

Hi @jennaj,
Thank you
Hi @wm75, @bjoern.gruening,
Can you please suggest a way out for my above problem?

Thank you