de novo assembly using Trinity in Galaxy

Hi I’m studying on a plant that has no reference genome and only has one scaffold assembly and one gff3 annotation file. If I want to do de novo assembly using the Galaxy Can I import the links of 20 fastq files from the RNA-seq of the samples together to construct the Reference Genome with Trinity ? If this is possible, please help me import the links of these files together into Galaxy and make the reference genome. Maximum number of samples or data can be used for Deno Assembly in Galaxy?

Thanks a lot

1 Like

Welcome, @snt

In short, yes you can assemble within Galaxy. Trinity is designed to assemble RNA-seq reads into a Transcriptome Assembly (not Genome). Genome assembly from WGS reads works best with smaller genomes (procaryotic) when working at public Galaxy servers due to resources (Unicyler is one tool choice for that purpose).

The reads need to be prepared and input properly. The upper limit of what can be assembled is usually based more on the read content and genome size/characteristics, rather than the volume/size of the read data. You won’t know until you try, to see how your data will assemble at any particular server (your own or a public site). The help below covers QA, input format/structures, and advice about downsampling as needed.

Also, keep in mind that some plant genomes can be quite complicated to assemble for biological reasons, and can require more resources (especially memory). Some will require employing advanced assembly methods. Much discussion about plant assembly and strategies can be found online with a few searches. Issues around large genomes in general, and sub-features like high repetitive content (example: Lettuce), high ploidy content (example: Strawberry), large chromosomes (example: Wheat) can all be factors to consider. But try the simple approach described below first, certainly can’t hurt.

Requirements/Best practices for Reference Transcriptome assembly using RNA-seq reads:

  1. Reads must be in fastqsanger format (Sanger Phred+33 quality scores, what results from Illumina 1.8+ sequencing) with that datatype assigned. If the data is in compressed fastqsanger.gz format, uncompressed versions of the inputs will be created as hidden datasets. This can unexpectedly increase quota usage. Sometimes uploading uncompressed fastq can be a better choice – it depends on the tools you plan to use – some work with compressed fastq and some do not (but Galaxy will “manage” that for you). Just something to be aware of.

  2. Load reads by URL or with FTP. If “autodetect” for datatype assignment is used, fastqsanger will be assigned automatically if the reads are actually in that format. If you get a different autodetected datatype, there is an input problem that needs to be addressed first (examples: fastqillumia== needs quality score rescaling OR just fastq == probably a format issue).

    • NOTE: If you decide to upload and directly assign fastqsanger.gz to preserve compression, you should definitely run the QA steps to verify that datatype is assigned correctly, or expect problems (assembly jobs failing due to unexpected quality score scaling – will look like any other memory failure, and require that you have to back up to the start, run QA, and fix the reads if that is the actual issue). If the data is actually uncompressed and assigned that datatype, any tool can also fail, and most but not all will report what the problem is. FastQC will guess the quality score scaling. That can be compared with the specific datatypes. See the Support FAQs to learn how to interpret the report and how to convert to fastqsanger if needed (tool: "Fastq Groomer`).

    • Punch line: It is more satisfying/less frustrating to get the inputs correct at the start, so you don’t need to hunt for basic issues later on, possibly after much other work is already done. No one likes to “start over”.

  3. Run some QA on your reads. FastQC followed by MultiQC (to summarize the raw FastQC reports). Running FastQC on the individual original datasets, in pairs (R1+R2) will be more informative. This can be run in batch or by using Dataset Collection(s), if wanted.

  4. Next, run Trimmomatic if adaptor or low-quality ends are reported by FastQC. You may want to run this anyway to get your reads paired up and catch content issues not found by FastQC (only checks a subset of the data, first 200k reads or so, and can be biased). There will be four outputs per paired-end input. Two will be the reads that are still paired after QA. This is important – Trinity requires paired-end fastq inputs to be in intact pairs & tends to run better (not fail for resource reasons) with cleaned-up reads.

  5. Use Concatenate to merge all R1 (forward) reads into one dataset and all R2 (reverse) reads into another. two distinct runs.

  6. If the Trinity run fails for exceeding resources, then subsampling your reads can help. You may decide to do this per-pair or after concatenating (after QA is done) – your choice. See the Seqtk tools: seqtk_sample or seqtk_seq are good choices. The first is a specific function, the latter more comprehensive with that function included. In most cases, you’ll need to run the tool twice (once to output R1 forward reads, once to output R2 reverse reads). Be conservative first, to preserve as much of the original data as possible, then move to more restrictive as needed.

  7. If the data will not assemble at all then one of these is going on: a) some input problem is present, b) not enough QA was done, c) the data exceeds processing needs that a public Galaxy server, or your own, has allocated. Public resources are fixed for most tools, including Trinity. A cloud Galaxy may be more appropriate.

Be aware that if you combined the data into a Dataset Collection (R1 forward collection + R2 reverse collection, or in an R1+R2 paired-end collection) at the start, you’ll need to use Collection Operations tools to manipulate the read data to fit the tools (example: Collapse Collection will perform the same manipulation as Concatenate). See the Dataset Collection “Collection Operations” tutorials linked below to understand how to work with collections. Help is also on the tool forms. With so many inputs, Collections will be very useful, but also a bit trickier to use if you haven’t tried them yet. That said, these are definitely worth learning how to use, now or later. Often a mix of paired-end and single-end collections, used at different steps, is needed to work through an analysis.

Tutorials & Support FAQs & Related:

  • GTN tutorials: https://training.galaxyproject.org/ (all)
    • Start with those in the groups “Assembly”, “Transcriptomics”, and “Data Manipulation” (collection manipulations are covered).
  • Support FAQs: https://galaxyproject.org/support/#troubleshooting
    • Start with the first few in “Unexpected Results” and “Getting Inputs Right” to understand format, datatypes, error messages (and what to do about each!), etc. See “Loading Data” if you are not sure how to use FTP (usually only needed with slower internet connections, but is fast/simple for batch uploads too).
  • FastQC is covered in many tutorials, but reading the tool’s FAQs about specific statistics reported is informative, see: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (Docs & Example reports)

I also added some tags to your post. Clicking on any may provide more clues about what may be going wrong if you run into trouble.

Thanks!

Hi
Thank you for replying to my email
I have many questions but most importantly can I upload a volume of about 50 gigs? Does the Galaxy allow this? If not, what is the maximum size of data I can import?
best regard

1 Like

Hi @snt

The maximum size per uploaded dataset is 50 GB.

However, if those are the reads that you intend to assemble, then Trinity will probably fail for exceeding resources. It depends somewhat on the read content, but the maximum size for an input fastq file for the forward and reverse (each) is around 12-15 GB (uncompressed).

It sounds like you will need to downsample your reads as described above before assembling. You may need to do that in stages, to stay under the 250GB default account quota at usegalaxy.org. If you are an academic (and can verify that with your primary academic email address), increasing your quota is possible.

Be aware that more quota space to store data is unrelated to the memory used during any job execution. Meaning, if your data is very large, the extra quota space can help with manipulating data, but it needs to be within the range the public server can process before attempting an assembly job (and other compute-intensive operations). Since it sounded as if your data was in smaller files, and combined it is 50 GB, this may not be a problem (yet). Once QA is done in individual smaller datasets, then review that combined size and make decisions about downsampling after.

FAQ at the Galaxy Hub: https://galaxyproject.org/support/account-quotas/

(You can ignore warnings about the “unsecure connection” – our cert for the Galaxy Hub server just expired about 3 hours ago! Or, you can wait … we’ll be updating it as quickly as possible. Your choice!)

hi
thank you so much

best regard