Troubleshooting Trinity inputs + QA/QC steps for RNA-seq assembly

Ok, thanks for clarifying @jrenart :slight_smile:

I found your account and will take a look to see what the status is. I won’t be able to tell you what your exact placement is in the queue (much is dynamic) but can double-check this particular queue itself and that something else isn’t going on that is preventing the jobs from executing by now. Don’t rerun yet – that is usually a poor choice for any situation unless there is a confirmed input issue or server issue (places the new jobs back at the end of the queue – further extending wait time).

More feedback in a bit

Update:

I see a few problems with your inputs and prior failed jobs. All were good ways to attempt to troubleshoot but won’t actually resolve the problems. Given the other problems with Trinity right now and that these jobs would fail anyway, you should fix up your inputs and rerun. Each job has a specific input problem, but the same solution will work for both. Permanently delete (purge) the prior work to recover disc space – in particular the queued jobs. You don’t want to waste time in trying to get those to run/fail or stalling your new jobs.

First, some advice about how Trinity works:

Trinity as run at usegalaxy.org requires distinct forward and reverse read inputs that are paired (no interleaved/interlaced inputs and no extra single reads in either input that is a missing complimentary end in the other input). Also, Trinity processes uncompressed fastq. If given compressed fastq as an input, the tool will uncompress reads at runtime (as a hidden dataset/s) and use those as inputs. That can increase your quota usage with duplicated data in different formats, and can in some cases (very large dataset) trigger a job to fail for resource reasons since the job has two steps: 1) uncompress then 2) actually run the assembly on the uncompressed version. All of this can be addressed.

The input dataset 7 is an interleaved/interlaced fastq dataset.

Two choices, with the first strongly recommended.

  1. Run some QA/QC on your reads. FastQC > Trimmomatic > FastQC > MultiQC (to compare the before and after) > input reads that are still paired after QA to Trinity.
  • Advantages of this method include:
    • Trimmomatic will create 4 uncompressed outputs (2 for surviving paired, 2 for leftover singletons).
    • Data prep is done without extra manipulation methods.
    • Better assembly quality due to better quality read inputs, and the jobs will be much less likely to fail for exceeding resources.
  1. Alternatively, you can uncompress yourself (pencil > Convert) then split the reads.
  • FAQs: Galaxy Support - Galaxy Community Hub
  • You can permanently delete (purge) the original compressed fastq to recover disk space/avoid duplicated data in slightly different formats from consuming quota space.
  • Note: The tool Faster Download and Extract Reads in FASTQ format from NCBI SRA instead of Download and Extract Reads in FASTA/Q will output split paired + unpaired reads sorted into dataset collections, but since collection inputs are part of the current issues with this tool – using the Faster version of the tool is not a good choice until the problems are resolved. You could drag-n-drop individual datasets from inside those collections into the tool – but that may be more work than is needed, and doesn’t take care of the QA steps (would still be recommended). It looks like you tried this already (in part) with the middle runs (failed and deleted) and the second queued run. The failed Trinity jobs are related to the current collection input issues covered in the ticket I sent in the first reply.

This job is staged in a strange way. It appears that you manipulated the output from Faster Download and Extract Reads in FASTQ format from NCBI SRA. This resulted in odd datasets all unhidden with repeated dataset numbers (the repeated dataset numbers are expected, but if there is more than one dataset with the same dataset number, don’t input both or the job will fail. Click on the rerun icon (double-circle) icon for job 22 (or 23), and you’ll see that the tool form had two datasets selected for the forward input and two datasets selected for the reverse input. The job details (“i” icon) report does list just one dataset per input – but that is also related to the way the collection datasets were manipulated.

Start over and follow the method for the first job (preferably the first option). That should solve all of the problems you have been having.

There is an unpublished GTN tutorial that covers assembly with Trinity – most of that should work fine for a larger overview. Just avoid any portions where collections are involved and use the alternatives above.

QA/QC is covered in several tutorials. Choose those with an RNA-seq focus (different analysis goals utilize different QA processes). This is the most basic and is appropriate for your purposes:

I also added a few tags to your post that link to prior Q&A around topics like this one. Review the “qa-qc” tag first if you want more advice, in context with those particular topic’s troubleshooting/analysis goals. Several are also about Trinity.

Thanks!