de novo assembly using Trinity in Galaxy

Hi I’m studying on a plant that has no reference genome and only has one scaffold assembly and one gff3 annotation file. If I want to do de novo assembly using the Galaxy Can I import the links of 20 fastq files from the RNA-seq of the samples together to construct the Reference Genome with Trinity ? If this is possible, please help me import the links of these files together into Galaxy and make the reference genome. Maximum number of samples or data can be used for Deno Assembly in Galaxy?

Thanks a lot

1 Like

Welcome, @snt

In short, yes you can assemble within Galaxy. Trinity is designed to assemble RNA-seq reads into a Transcriptome Assembly (not Genome). Genome assembly from WGS reads works best with smaller genomes (procaryotic) when working at public Galaxy servers due to resources (Unicyler is one tool choice for that purpose).

The reads need to be prepared and input properly. The upper limit of what can be assembled is usually based more on the read content and genome size/characteristics, rather than the volume/size of the read data. You won’t know until you try, to see how your data will assemble at any particular server (your own or a public site). The help below covers QA, input format/structures, and advice about downsampling as needed.

Also, keep in mind that some plant genomes can be quite complicated to assemble for biological reasons, and can require more resources (especially memory). Some will require employing advanced assembly methods. Much discussion about plant assembly and strategies can be found online with a few searches. Issues around large genomes in general, and sub-features like high repetitive content (example: Lettuce), high ploidy content (example: Strawberry), large chromosomes (example: Wheat) can all be factors to consider. But try the simple approach described below first, certainly can’t hurt.

Requirements/Best practices for Reference Transcriptome assembly using RNA-seq reads:

  1. Reads must be in fastqsanger format (Sanger Phred+33 quality scores, what results from Illumina 1.8+ sequencing) with that datatype assigned. If the data is in compressed fastqsanger.gz format, uncompressed versions of the inputs will be created as hidden datasets. This can unexpectedly increase quota usage. Sometimes uploading uncompressed fastq can be a better choice – it depends on the tools you plan to use – some work with compressed fastq and some do not (but Galaxy will “manage” that for you). Just something to be aware of.

  2. Load reads by URL or with FTP. If “autodetect” for datatype assignment is used, fastqsanger will be assigned automatically if the reads are actually in that format. If you get a different autodetected datatype, there is an input problem that needs to be addressed first (examples: fastqillumia== needs quality score rescaling OR just fastq == probably a format issue).

    • NOTE: If you decide to upload and directly assign fastqsanger.gz to preserve compression, you should definitely run the QA steps to verify that datatype is assigned correctly, or expect problems (assembly jobs failing due to unexpected quality score scaling – will look like any other memory failure, and require that you have to back up to the start, run QA, and fix the reads if that is the actual issue). If the data is actually uncompressed and assigned that datatype, any tool can also fail, and most but not all will report what the problem is. FastQC will guess the quality score scaling. That can be compared with the specific datatypes. See the Support FAQs to learn how to interpret the report and how to convert to fastqsanger if needed (tool: "Fastq Groomer`).

    • Punch line: It is more satisfying/less frustrating to get the inputs correct at the start, so you don’t need to hunt for basic issues later on, possibly after much other work is already done. No one likes to “start over”.

  3. Run some QA on your reads. FastQC followed by MultiQC (to summarize the raw FastQC reports). Running FastQC on the individual original datasets, in pairs (R1+R2) will be more informative. This can be run in batch or by using Dataset Collection(s), if wanted.

  4. Next, run Trimmomatic if adaptor or low-quality ends are reported by FastQC. You may want to run this anyway to get your reads paired up and catch content issues not found by FastQC (only checks a subset of the data, first 200k reads or so, and can be biased). There will be four outputs per paired-end input. Two will be the reads that are still paired after QA. This is important – Trinity requires paired-end fastq inputs to be in intact pairs & tends to run better (not fail for resource reasons) with cleaned-up reads.

  5. Use Concatenate to merge all R1 (forward) reads into one dataset and all R2 (reverse) reads into another. two distinct runs.

  6. If the Trinity run fails for exceeding resources, then subsampling your reads can help. You may decide to do this per-pair or after concatenating (after QA is done) – your choice. See the Seqtk tools: seqtk_sample or seqtk_seq are good choices. The first is a specific function, the latter more comprehensive with that function included. In most cases, you’ll need to run the tool twice (once to output R1 forward reads, once to output R2 reverse reads). Be conservative first, to preserve as much of the original data as possible, then move to more restrictive as needed.

  7. If the data will not assemble at all then one of these is going on: a) some input problem is present, b) not enough QA was done, c) the data exceeds processing needs that a public Galaxy server, or your own, has allocated. Public resources are fixed for most tools, including Trinity. A cloud Galaxy may be more appropriate.

Be aware that if you combined the data into a Dataset Collection (R1 forward collection + R2 reverse collection, or in an R1+R2 paired-end collection) at the start, you’ll need to use Collection Operations tools to manipulate the read data to fit the tools (example: Collapse Collection will perform the same manipulation as Concatenate). See the Dataset Collection “Collection Operations” tutorials linked below to understand how to work with collections. Help is also on the tool forms. With so many inputs, Collections will be very useful, but also a bit trickier to use if you haven’t tried them yet. That said, these are definitely worth learning how to use, now or later. Often a mix of paired-end and single-end collections, used at different steps, is needed to work through an analysis.

Tutorials & Support FAQs & Related:

  • GTN tutorials: https://training.galaxyproject.org/ (all)
    • Start with those in the groups “Assembly”, “Transcriptomics”, and “Data Manipulation” (collection manipulations are covered).
  • Support FAQs: https://galaxyproject.org/support/#troubleshooting
    • Start with the first few in “Unexpected Results” and “Getting Inputs Right” to understand format, datatypes, error messages (and what to do about each!), etc. See “Loading Data” if you are not sure how to use FTP (usually only needed with slower internet connections, but is fast/simple for batch uploads too).
  • FastQC is covered in many tutorials, but reading the tool’s FAQs about specific statistics reported is informative, see: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (Docs & Example reports)

I also added some tags to your post. Clicking on any may provide more clues about what may be going wrong if you run into trouble.

Thanks!

4 Likes

Hi
Thank you for replying to my email
I have many questions but most importantly can I upload a volume of about 50 gigs? Does the Galaxy allow this? If not, what is the maximum size of data I can import?
best regard

1 Like

Hi @snt

The maximum size per uploaded dataset is 50 GB.

However, if those are the reads that you intend to assemble, then Trinity will probably fail for exceeding resources. It depends somewhat on the read content, but the maximum size for an input fastq file for the forward and reverse (each) is around 12-15 GB (uncompressed).

It sounds like you will need to downsample your reads as described above before assembling. You may need to do that in stages, to stay under the 250GB default account quota at usegalaxy.org. If you are an academic (and can verify that with your primary academic email address), increasing your quota is possible.

Be aware that more quota space to store data is unrelated to the memory used during any job execution. Meaning, if your data is very large, the extra quota space can help with manipulating data, but it needs to be within the range the public server can process before attempting an assembly job (and other compute-intensive operations). Since it sounded as if your data was in smaller files, and combined it is 50 GB, this may not be a problem (yet). Once QA is done in individual smaller datasets, then review that combined size and make decisions about downsampling after.

FAQ at the Galaxy Hub: https://galaxyproject.org/support/account-quotas/

(You can ignore warnings about the “unsecure connection” – our cert for the Galaxy Hub server just expired about 3 hours ago! Or, you can wait … we’ll be updating it as quickly as possible. Your choice!)

hi
thank you so much

best regard

Hi @jennaj
I have gone through fastqc tutorials and links you have mentioned for input formats and NCBI SRA download and trinity still I have following doubts.
To access through SRA I have two sets of data one is archive data (where link is available to download) and original data where I have three pairs of data(as read1 and read2).
Can I directly go to fastqc and trimmomatic, and multiqc followed by trinity with the data I have through SRA accession or do I have to go with read pairs. Can you say what would be the difference going directly with archive data and original data?
When I have performed fastqc before trimmomatic and after as you have mentioned for one of my queries I could able to see the difference. But when I take the output of fastqc/trimmomatic input into trinity I end up with error. If I can’t use the output from above tools what is the point of running those tools(just to know what is the quality of data?)

Thank you so much for any help.

1 Like

Trimmomatic makes changes to your data. Cleaner reads assemble better.

It is possible that the job is running out of memory. This isn’t always clear in the error message (so I won’t bother asking for it). Clicking on the “i” icon for the error dataset, then on the stderr and stdout links on that Job Details form will often describe where the job failed out, but not always.

If you are attempting to assemble 50 GB of data in one assembly, that is almost certainly triggering a resource problem. The maximum size of successful inputs (combined) that I have seen are around 15-22 GB in size and have been run though QA steps.

Downsampling reads is one choice. Moving to your own Galaxy that can work with larger data is another. The prior Q&A above has the details and options that will also (probably) apply for your case: de novo assembly using Trinity in Galaxy - #2 by jennaj

Hi @jennaj,
I have tried in all possible ways to import my data(with archive data and original data and through SRA number). Archive data link only able to download in SRA format which I can’t open in galaxy and through SRA number the files (each compressed and uncompressed) which I uploaded double the size than whole 3 pairs of data from original format(when I downloaded through links).
As I have asked the question what is the difference to go with archive data and original data? I have noticed that memory of the data.
For some reason I couldn’t find fastqsanger.gz format for uploading the input for trinity when I used the file which I got through Download and Extract Reads in FASTA/Q format from NCBI SRA.
And somehow it is not the case with files from “original data”(Read1 and Read2) though the format is same (fastqsanger.gz). Which means for trinity Trimmomatic outputs are available(fastqsanger.gz format).
I am happy to say that my problem cleared for now. And thank you so much for your support.

Thank you