The FastQC report shows that your reads need 5’ trimming
The logs from Trinity are still showing that reads <25 bases also remain
You haven’t used fastp yet as @igor recommended. Trimmomatic can do this too. Both can address items 1 and 2. I think we have already shared the QA tutorial, but here is the link again: Quality Control
rnaSPAdes was successful, and is a better tool choice overall, but remember: not failing only means that nothing technically went wrong and doesn’t indicate a quality scientific result.
Start with the QA steps first. Leave the FastQC result from before and after fastp undeleted and in that same history.
I got the history. Thank you! You can delete the link or keep it for some time, it is up to you.
I could not spot any any obvious issue. Nucleotide composition changes along the read length. Phred quality scores use just several values. I submitted couple test jobs. The jobs will take some time. @jennaj I am not familiar with setup of the main server. What does this error means?
ocean/projects/mcb140028p/xcgalaxy/main/staging/50953944/.cvmfsexec/mountrepo: line 70: cd: /ocean/projects/mcb140028p/xcgalaxy/main/staging/50953944/.cvmfsexec/dist/cvmfs/cvmfs-config.cern.ch/etc/cvmfs: No such file or directory
If a tool does not produce any output at all, that error can result from this specific tool.
I think the reads need to be 5’ trimmed, then length filtered. Trimmomatic with the right settings will do that, or fastp
If just getting any result is the goal, that is already achieved with the other tool. No idea how downstream tools will work on that output though. At ORG, rnaSPAdes is more robust and will “do something” with reads that fail Trinity. Eg: with insufficient QA. Specifically, a lot of very short reads clogs up the Trimmomatic assembly process and the tool dies due to limited runtime memory available. We won’t be allocating more, or at least not now.
Before I sent my former history, I had already used the fastp tool to filter the data. In that case, when the output files of Fastp were used as the input files for trinity analysis, the results still display in red.
In order to simplify and fully display the entire process, I created a new history and uploaded two original sequencing data (forward and reverse reads) of a sample, used the fastqc tool to control the quality of the data before and after using fastp, and leaved these results in the history. According to the fastqc report and your suggestion, 5 'trim was performed with the fastp tool, and the first 10 bases of each read were deleted. I believe that the cleaned data should meet the analysis requirements of the Trinity tool. As a control, the rnaSPAdes tool was used to assemble transcript reads as well. Unfortunately, the result of Trinity analysis showed red color, while the rnaSPAdes result was blank (0 byte). I uploads the history of this analysis, as shown below:
I also have impression that rnaSPAdes is more forgiving on reads compared to Trinity.
I have access only to Trinity job submitted on untrimmed reads. The job was submitted with digital normalization. This should reduce amount of data used for assembly, but I cannot check requested memory on ORG.
Biased nucleotide composition at 5" ends is common for illumina RNA-Seq data. I don’t know how much problem it causes.
Dear @igor@jennaj ,
I have an intuition that the reason why Trinity analysis results always report errors is that the Galaxy platform allocates insufficient memory for each file. Trinity needs huge memory compared with other tools. This understanding comes from a literature titled “De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. GigaScience, 2019,8:1–16”.
Table 3 of this article was screenshot as follows:
A very interesting thing. As shown in the history (16), the results of rnaSPAdes analysis had no sequence if each reads were 5’ trimmed for 10 bp with fastp while other paramters remains unchanged, but contained 25071 assemble sequences (21) if using the default parameters of the fastp.
Dear @igor ,
Thank you very much. I saw the datasets #14 and #15 in the history. How can I choose the older version of the trinity software in the Galaxy platform?
I am currently dividing both files of a sample into four small files to reduce the memory requirement for calculations and hope this works.
during job setup click at three blocks (versions) icon at the top right corner of the middle window and select any available version from the pull-down menu.
You can use assembled transcripts and transcript to genes map from history I shared.
You can copy datasets from one history to another using See histories side by side (in history menu)