de novo assembly using Trinity in Galaxy

Welcome, @snt

In short, yes you can assemble within Galaxy. Trinity is designed to assemble RNA-seq reads into a Transcriptome Assembly (not Genome). Genome assembly from WGS reads works best with smaller genomes (procaryotic) when working at public Galaxy servers due to resources (Unicyler is one tool choice for that purpose).

The reads need to be prepared and input properly. The upper limit of what can be assembled is usually based more on the read content and genome size/characteristics, rather than the volume/size of the read data. You won’t know until you try, to see how your data will assemble at any particular server (your own or a public site). The help below covers QA, input format/structures, and advice about downsampling as needed.

Also, keep in mind that some plant genomes can be quite complicated to assemble for biological reasons, and can require more resources (especially memory). Some will require employing advanced assembly methods. Much discussion about plant assembly and strategies can be found online with a few searches. Issues around large genomes in general, and sub-features like high repetitive content (example: Lettuce), high ploidy content (example: Strawberry), large chromosomes (example: Wheat) can all be factors to consider. But try the simple approach described below first, certainly can’t hurt.

Requirements/Best practices for Reference Transcriptome assembly using RNA-seq reads:

  1. Reads must be in fastqsanger format (Sanger Phred+33 quality scores, what results from Illumina 1.8+ sequencing) with that datatype assigned. If the data is in compressed fastqsanger.gz format, uncompressed versions of the inputs will be created as hidden datasets. This can unexpectedly increase quota usage. Sometimes uploading uncompressed fastq can be a better choice – it depends on the tools you plan to use – some work with compressed fastq and some do not (but Galaxy will “manage” that for you). Just something to be aware of.

  2. Load reads by URL or with FTP. If “autodetect” for datatype assignment is used, fastqsanger will be assigned automatically if the reads are actually in that format. If you get a different autodetected datatype, there is an input problem that needs to be addressed first (examples: fastqillumia== needs quality score rescaling OR just fastq == probably a format issue).

    • NOTE: If you decide to upload and directly assign fastqsanger.gz to preserve compression, you should definitely run the QA steps to verify that datatype is assigned correctly, or expect problems (assembly jobs failing due to unexpected quality score scaling – will look like any other memory failure, and require that you have to back up to the start, run QA, and fix the reads if that is the actual issue). If the data is actually uncompressed and assigned that datatype, any tool can also fail, and most but not all will report what the problem is. FastQC will guess the quality score scaling. That can be compared with the specific datatypes. See the Support FAQs to learn how to interpret the report and how to convert to fastqsanger if needed (tool: "Fastq Groomer`).

    • Punch line: It is more satisfying/less frustrating to get the inputs correct at the start, so you don’t need to hunt for basic issues later on, possibly after much other work is already done. No one likes to “start over”.

  3. Run some QA on your reads. FastQC followed by MultiQC (to summarize the raw FastQC reports). Running FastQC on the individual original datasets, in pairs (R1+R2) will be more informative. This can be run in batch or by using Dataset Collection(s), if wanted.

  4. Next, run Trimmomatic if adaptor or low-quality ends are reported by FastQC. You may want to run this anyway to get your reads paired up and catch content issues not found by FastQC (only checks a subset of the data, first 200k reads or so, and can be biased). There will be four outputs per paired-end input. Two will be the reads that are still paired after QA. This is important – Trinity requires paired-end fastq inputs to be in intact pairs & tends to run better (not fail for resource reasons) with cleaned-up reads.

  5. Use Concatenate to merge all R1 (forward) reads into one dataset and all R2 (reverse) reads into another. two distinct runs.

  6. If the Trinity run fails for exceeding resources, then subsampling your reads can help. You may decide to do this per-pair or after concatenating (after QA is done) – your choice. See the Seqtk tools: seqtk_sample or seqtk_seq are good choices. The first is a specific function, the latter more comprehensive with that function included. In most cases, you’ll need to run the tool twice (once to output R1 forward reads, once to output R2 reverse reads). Be conservative first, to preserve as much of the original data as possible, then move to more restrictive as needed.

  7. If the data will not assemble at all then one of these is going on: a) some input problem is present, b) not enough QA was done, c) the data exceeds processing needs that a public Galaxy server, or your own, has allocated. Public resources are fixed for most tools, including Trinity. A cloud Galaxy may be more appropriate.

Be aware that if you combined the data into a Dataset Collection (R1 forward collection + R2 reverse collection, or in an R1+R2 paired-end collection) at the start, you’ll need to use Collection Operations tools to manipulate the read data to fit the tools (example: Collapse Collection will perform the same manipulation as Concatenate). See the Dataset Collection “Collection Operations” tutorials linked below to understand how to work with collections. Help is also on the tool forms. With so many inputs, Collections will be very useful, but also a bit trickier to use if you haven’t tried them yet. That said, these are definitely worth learning how to use, now or later. Often a mix of paired-end and single-end collections, used at different steps, is needed to work through an analysis.

Tutorials & Support FAQs & Related:

  • GTN tutorials: https://training.galaxyproject.org/ (all)
    • Start with those in the groups “Assembly”, “Transcriptomics”, and “Data Manipulation” (collection manipulations are covered).
  • Support FAQs: https://galaxyproject.org/support/#troubleshooting
    • Start with the first few in “Unexpected Results” and “Getting Inputs Right” to understand format, datatypes, error messages (and what to do about each!), etc. See “Loading Data” if you are not sure how to use FTP (usually only needed with slower internet connections, but is fast/simple for batch uploads too).
  • FastQC is covered in many tutorials, but reading the tool’s FAQs about specific statistics reported is informative, see: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (Docs & Example reports)

I also added some tags to your post. Clicking on any may provide more clues about what may be going wrong if you run into trouble.

Thanks!

4 Likes