Generate gene to transcript map: WARNING output

Many tools work with compressed fastq data but Trinity expects uncompressed fastq. It also expects fastq data that has quality scores scaled as “Sanger Fastq+33” (as do most other tools). In Galaxy, that datatype is labeled as “fastqsanger” (uncompressed fastq) or “fastqsanger.gz” (compressed fastq).

Galaxy will “implicitly convert” some inputs with a given datatype to a different datatype during runtime (and create a hidden dataset of that type in your history to use). The input datasets will appear in the tool form’s select menu with a “(as NNN)” added on to the name where NNN is the datatype the data will be converted to at runtime. Be aware that this can unexpectedly increase your quota usage!

Trinity will convert fastqsanger.gz to fastqsanger this way.

The tool was probably not finding your fastqsanger.gz inputs because the tool you used to extract the data from NCBI SRA organized the output into a Dataset Collection. If you do not specifically set the input type as being in collection, the dataset(s) will not be discovered by the tool. So, there are a few choices (instead of converting to fasta, which results in information loss):

Use this tool instead Download and Extract Reads in FASTA/Q format from NCBI SRA, and set the option to extract the data in uncompressed format. This has two changes. A) The sequences will not be in a collection, which may be more useful to you, although learning how to work with collections at some point is a very good idea. And B) The sequences will be already in an uncompressed state, meaning that you will avoid data duplication/quota increases from the pre-step of converting compressed to uncompressed – this is more important if you intend to use Trinity directly.

That said, you really should do some QA on data as the first step. The usual cycle will be FastQC (“before” data quality) > Trimmomatic > `FastQC (“after” data quality). Then at the end decide if you want to uncompress the data yourself (pencil icon > Edit attributes > Convert > uncompress) or allow the next tool to do that (if that tool requires uncompressed fastq – most don’t).

These are decisions you’ll need to make yourself. And managing data (removing intermediate data no longer needed as an input) is also something that you will want to learn how to do.

If you are ever not sure what datatype a tool’s input is filtering on to determine which datasets in the current history are appropriate inputs (could be more than one datatype!), the tool form may state what is expected, but to know exactly, there is a very simple tip: Create a new empty history and then bring up the tool form. The expected datatypes will be listed in the select field. Then go back to the working history where your data is, and double-check the assigned datatypes if the tool form isn’t finding the data. And if the data is in a collection, make sure to set the tool form up to look for data in a collection.

I know this seems complicated, and it may be at first – after it will be automatic. Everyone doing informatics has to learn how to get their inputs set up correctly, in Galaxy or anywhere else. The filters in Galaxy actually are helping to avoid problems (example: running jobs with inappropriate inputs that will just cause the tool to fail, and not always with error messages that explain what is going wrong at a detailed level). It just isn’t possible to trap and report every possible usage problem, and even when that is done, what you need to do to fix the inputs can vary. Jobs running out of resources (memory/runtime), odd errors, unexpected results – all of these are almost always problems related to inputs (format & content).

There is much prior Q&A around various problems with inputs at this forum. I added more tags to your post. Click on those to review or just search with keywords. So many have detailed help, FAQ links, tutorial links, etc. The input troubles you are running into are commonly reported by newer Galaxy users – if I tried to individually point you to all that might help, it would be a long list of topics! Better for you to just spend some time reviewing – it will be worth the effort.

That said, here are the FAQ links that will apply the most. But I really suggest that you review the prior Q&A too. FAQs are abstract – prior Q&A breaks that down and gets into specifics, with context.

You already know where the GTN tutorials are and while those can help with usage, once you are not using the tutorial data anymore, or using tools not covered by a tutorial, usually tool form help, FAQs, and prior Q&A is more useful when addressing specific issues with inputs. I don’t think you are running into actual tool bugs right now so skip that part – but the rest of the advice in this particular post should really help.

I duplicated/reworded some of that help in our recent Q&A (in this reply and across other topics) – and in other prior Q&A you’ll find – because sometimes that helps a bit too. But in the end, you are going to need to learn how to get your own inputs correct. Otherwise, the input problems are going to just lead to even more job delays, odd errors, weird results, and overall frustration.

I think you are getting close to solving this!

1 Like