Troubleshooting Trinity inputs + QA/QC steps for RNA-seq assembly

jrenart · December 2, 2020, 4:26pm

I have sent two jobs to Trinity, on November 28 and 29 respectively. I have read in the Galaxy Help that machines are rather busy, but I wonder how long it will take to start the jobs. Data is in green, so I suppose they are OK, and the jobs are in gray, with the “This job is waiting to run” tag. There are severall posts on this issue, but seems to me that 4 days are a rather long time for the queue. Is it possible to get any hints about the time that the jobs will start?
Regards,
Jaime

jennaj · December 2, 2020, 6:19pm

Hi @jrenart

Few questions:

Did you try refreshing the history panel? Sometimes the most updated view is not automatically displayed (any Galaxy server)
The inputs are green and contain content? What is the size for each dataset in the paired set input to the tool? Expand the datasets to check.
Are you working at usegalaxy.org? If somewhere else, please note the public server URL or describe if using a local/cloud Galaxy (the version of Galaxy, where/when sourced).
How many total jobs do you have queued (all histories)? There is a limit on how many resource-intensive current jobs will run per person, to ensure fair usage of resources. That means if you start up a batch of jobs, a few of your queued jobs will run, then a few of other people’s queued jobs will run, then yours again, repeat. You can set up an email notification on a tool form and/or within a workflow so you know when results are ready.

Let’s follow up from there. The queue is busy at usegalaxy.org but 4 days seems a bit too long if you really just have these two jobs queued. We can take a look at your account if needed.

Note: Trinity has some known issues right now, in particular, any that use collection inputs will fail immediately right now. That does not seem to be your situation if the jobs are queued. We are working on that fix with priority, details here: https://github.com/galaxyproject/usegalaxy-playbook/issues/308

jrenart · December 2, 2020, 7:33pm

Hi Jennaj, thank you for your fast answer.
Regarding the points you raise:

Yes, I refresh the history (eshark), and continues the same.
I am using galaxy.org. Data set #7 (for jobs 8 and 9) is 6.6 GB. Data sets 16 and 17 are 13.0 GB each (for jobs 22 and 23). These data represents 15% (upper right information)
As mentioned above, there are 4 jobs, two for each data sets (Gene to Transcripts and Assembled Transcripts), created automatically by Trinity.
After seeing several posts, I understand the difficulty of estimate time of starting the jobs; but is it possible to know the order in the queue?
Regards,
Jaime

jennaj · December 2, 2020, 8:38pm

Ok, thanks for clarifying @jrenart

I found your account and will take a look to see what the status is. I won’t be able to tell you what your exact placement is in the queue (much is dynamic) but can double-check this particular queue itself and that something else isn’t going on that is preventing the jobs from executing by now. Don’t rerun yet – that is usually a poor choice for any situation unless there is a confirmed input issue or server issue (places the new jobs back at the end of the queue – further extending wait time).

More feedback in a bit

Update:

I see a few problems with your inputs and prior failed jobs. All were good ways to attempt to troubleshoot but won’t actually resolve the problems. Given the other problems with Trinity right now and that these jobs would fail anyway, you should fix up your inputs and rerun. Each job has a specific input problem, but the same solution will work for both. Permanently delete (purge) the prior work to recover disc space – in particular the queued jobs. You don’t want to waste time in trying to get those to run/fail or stalling your new jobs.

First, some advice about how Trinity works:

Trinity as run at usegalaxy.org requires distinct forward and reverse read inputs that are paired (no interleaved/interlaced inputs and no extra single reads in either input that is a missing complimentary end in the other input). Also, Trinity processes uncompressed fastq. If given compressed fastq as an input, the tool will uncompress reads at runtime (as a hidden dataset/s) and use those as inputs. That can increase your quota usage with duplicated data in different formats, and can in some cases (very large dataset) trigger a job to fail for resource reasons since the job has two steps: 1) uncompress then 2) actually run the assembly on the uncompressed version. All of this can be addressed.

The input dataset 7 is an interleaved/interlaced fastq dataset.

Two choices, with the first strongly recommended.

Run some QA/QC on your reads. FastQC > Trimmomatic > FastQC > MultiQC (to compare the before and after) > input reads that are still paired after QA to Trinity.

Advantages of this method include:
- Trimmomatic will create 4 uncompressed outputs (2 for surviving paired, 2 for leftover singletons).
- Data prep is done without extra manipulation methods.
- Better assembly quality due to better quality read inputs, and the jobs will be much less likely to fail for exceeding resources.

Alternatively, you can uncompress yourself (pencil > Convert) then split the reads.

FAQs: Galaxy Support - Galaxy Community Hub
- How to format fastq data for tools that require .fastqsanger format?
- Understanding compressed fastq data (fastq.gz)
- Reformatting fastq data loaded with NCBI SRA >> NCBI SRA Fastq - Galaxy Community Hub (includes a suggested tool to split up reads (method 1 will work for you), but seqtk_seq (run twice) is another tool choice)
You can permanently delete (purge) the original compressed fastq to recover disk space/avoid duplicated data in slightly different formats from consuming quota space.
Note: The tool Faster Download and Extract Reads in FASTQ format from NCBI SRA instead of Download and Extract Reads in FASTA/Q will output split paired + unpaired reads sorted into dataset collections, but since collection inputs are part of the current issues with this tool – using the Faster version of the tool is not a good choice until the problems are resolved. You could drag-n-drop individual datasets from inside those collections into the tool – but that may be more work than is needed, and doesn’t take care of the QA steps (would still be recommended). It looks like you tried this already (in part) with the middle runs (failed and deleted) and the second queued run. The failed Trinity jobs are related to the current collection input issues covered in the ticket I sent in the first reply.

This job is staged in a strange way. It appears that you manipulated the output from Faster Download and Extract Reads in FASTQ format from NCBI SRA. This resulted in odd datasets all unhidden with repeated dataset numbers (the repeated dataset numbers are expected, but if there is more than one dataset with the same dataset number, don’t input both or the job will fail. Click on the rerun icon (double-circle) icon for job 22 (or 23), and you’ll see that the tool form had two datasets selected for the forward input and two datasets selected for the reverse input. The job details (“i” icon) report does list just one dataset per input – but that is also related to the way the collection datasets were manipulated.

Start over and follow the method for the first job (preferably the first option). That should solve all of the problems you have been having.

There is an unpublished GTN tutorial that covers assembly with Trinity – most of that should work fine for a larger overview. Just avoid any portions where collections are involved and use the alternatives above.

QA/QC is covered in several tutorials. Choose those with an RNA-seq focus (different analysis goals utilize different QA processes). This is the most basic and is appropriate for your purposes:

I also added a few tags to your post that link to prior Q&A around topics like this one. Review the “qa-qc” tag first if you want more advice, in context with those particular topic’s troubleshooting/analysis goals. Several are also about Trinity.

Thanks!

jrenart · December 5, 2020, 6:56pm

Thanks, Jennaj!
I have deleted everything and will start all over again with your advice.
Jaime

jrenart · June 7, 2021, 10:02am

Hi Jennaj,
Last time I wrote wos for too much time to start a Trinity job.
Today I wonder if two and a half months are not too much for a Trinity job running. I lounched two Trinity jobs on April 16th (jobs 219, 219, 220 and 221). At the beggining I was courious to see how much time it takes, but perhaps something is wrong.
RKing regards,
Jaime

Topic		Replies	Views
Trinity problems? Use rnaSPAdes instead! Resources resources , tool-help , trinity , rnaspades	14	463	April 13, 2024
Trinity queued for long usegalaxy.eu support assembly , transcriptomics , queued-gray-datasets	1	390	April 19, 2022
3 day+ queue time for Trinity Assembly usegalaxy.eu support queued-gray-datasets	5	432	January 17, 2024
Queue time for Trinity usegalaxy.org support queued-gray-datasets	3	735	March 24, 2020
Trinity job not running usegalaxy.eu support queued-gray-datasets	4	58	December 31, 2024

Troubleshooting Trinity inputs + QA/QC steps for RNA-seq assembly

Related topics