too much time for start a job? Troubleshooting Trinity inputs + QA/QC steps for RNA-seq assembly

I have sent two jobs to Trinity, on November 28 and 29 respectively. I have read in the Galaxy Help that machines are rather busy, but I wonder how long it will take to start the jobs. Data is in green, so I suppose they are OK, and the jobs are in gray, with the “This job is waiting to run” tag. There are severall posts on this issue, but seems to me that 4 days are a rather long time for the queue. Is it possible to get any hints about the time that the jobs will start?
Regards,
Jaime

1 Like

Hi @jrenart

Few questions:

  1. Did you try refreshing the history panel? Sometimes the most updated view is not automatically displayed (any Galaxy server)
  2. The inputs are green and contain content? What is the size for each dataset in the paired set input to the tool? Expand the datasets to check.
  3. Are you working at usegalaxy.org? If somewhere else, please note the public server URL or describe if using a local/cloud Galaxy (the version of Galaxy, where/when sourced).
  4. How many total jobs do you have queued (all histories)? There is a limit on how many resource-intensive current jobs will run per person, to ensure fair usage of resources. That means if you start up a batch of jobs, a few of your queued jobs will run, then a few of other people’s queued jobs will run, then yours again, repeat. You can set up an email notification on a tool form and/or within a workflow so you know when results are ready.

Let’s follow up from there. The queue is busy at usegalaxy.org but 4 days seems a bit too long if you really just have these two jobs queued. We can take a look at your account if needed.

Note: Trinity has some known issues right now, in particular, any that use collection inputs will fail immediately right now. That does not seem to be your situation if the jobs are queued. We are working on that fix with priority, details here: https://github.com/galaxyproject/usegalaxy-playbook/issues/308

Hi Jennaj, thank you for your fast answer.
Regarding the points you raise:

  1. Yes, I refresh the history (eshark), and continues the same.
  2. I am using galaxy.org. Data set #7 (for jobs 8 and 9) is 6.6 GB. Data sets 16 and 17 are 13.0 GB each (for jobs 22 and 23). These data represents 15% (upper right information)
  3. As mentioned above, there are 4 jobs, two for each data sets (Gene to Transcripts and Assembled Transcripts), created automatically by Trinity.
    After seeing several posts, I understand the difficulty of estimate time of starting the jobs; but is it possible to know the order in the queue?
    Regards,
    Jaime
1 Like

Ok, thanks for clarifying @jrenart :slight_smile:

I found your account and will take a look to see what the status is. I won’t be able to tell you what your exact placement is in the queue (much is dynamic) but can double-check this particular queue itself and that something else isn’t going on that is preventing the jobs from executing by now. Don’t rerun yet – that is usually a poor choice for any situation unless there is a confirmed input issue or server issue (places the new jobs back at the end of the queue – further extending wait time).

More feedback in a bit

Update:

I see a few problems with your inputs and prior failed jobs. All were good ways to attempt to troubleshoot but won’t actually resolve the problems. Given the other problems with Trinity right now and that these jobs would fail anyway, you should fix up your inputs and rerun. Each job has a specific input problem, but the same solution will work for both. Permanently delete (purge) the prior work to recover disc space – in particular the queued jobs. You don’t want to waste time in trying to get those to run/fail or stalling your new jobs.

First, some advice about how Trinity works:

Trinity as run at usegalaxy.org requires distinct forward and reverse read inputs that are paired (no interleaved/interlaced inputs and no extra single reads in either input that is a missing complimentary end in the other input). Also, Trinity processes uncompressed fastq. If given compressed fastq as an input, the tool will uncompress reads at runtime (as a hidden dataset/s) and use those as inputs. That can increase your quota usage with duplicated data in different formats, and can in some cases (very large dataset) trigger a job to fail for resource reasons since the job has two steps: 1) uncompress then 2) actually run the assembly on the uncompressed version. All of this can be addressed.

The input dataset 7 is an interleaved/interlaced fastq dataset.

Two choices, with the first strongly recommended.

  1. Run some QA/QC on your reads. FastQC > Trimmomatic > FastQC > MultiQC (to compare the before and after) > input reads that are still paired after QA to Trinity.
  • Advantages of this method include:
    • Trimmomatic will create 4 uncompressed outputs (2 for surviving paired, 2 for leftover singletons).
    • Data prep is done without extra manipulation methods.
    • Better assembly quality due to better quality read inputs, and the jobs will be much less likely to fail for exceeding resources.
  1. Alternatively, you can uncompress yourself (pencil > Convert) then split the reads.

This job is staged in a strange way. It appears that you manipulated the output from Faster Download and Extract Reads in FASTQ format from NCBI SRA. This resulted in odd datasets all unhidden with repeated dataset numbers (the repeated dataset numbers are expected, but if there is more than one dataset with the same dataset number, don’t input both or the job will fail. Click on the rerun icon (double-circle) icon for job 22 (or 23), and you’ll see that the tool form had two datasets selected for the forward input and two datasets selected for the reverse input. The job details (“i” icon) report does list just one dataset per input – but that is also related to the way the collection datasets were manipulated.

Start over and follow the method for the first job (preferably the first option). That should solve all of the problems you have been having.

There is an unpublished GTN tutorial that covers assembly with Trinity – most of that should work fine for a larger overview. Just avoid any portions where collections are involved and use the alternatives above.

QA/QC is covered in several tutorials. Choose those with an RNA-seq focus (different analysis goals utilize different QA processes). This is the most basic and is appropriate for your purposes:

I also added a few tags to your post that link to prior Q&A around topics like this one. Review the “qa-qc” tag first if you want more advice, in context with those particular topic’s troubleshooting/analysis goals. Several are also about Trinity.

Thanks!

Thanks, Jennaj!
I have deleted everything and will start all over again with your advice.
Jaime

1 Like