16S V3/V4 database needed

Hi there,
I need reference silva database for V3/V4 region already trimmed.
Anyone can help?
Thank you.

Hi @biologisthurkan

Let’s combine your two questions. Xref → File size limit exceeded during align.seqs

You are working at UseGalaxy.org, correct? And the reference data worked for you before at this same server or a different server?

  • If this exact same database worked with different data, and you are working at UseGalaxy.org, then the job is likely exceeding computational resources at the server due to some difference in the query. That could be size but also read composition factors. You could try at a server that can scale a bit larger – UseGalaxy.eu is a good choice.

    Transferring data between servers can be done by URL, individual files or an entire history (consider simplifying that history first for faster compression and transfer: copy the important files into a new simple history, then transfer that).

  • If you are working somewhere else, I would still try at Galaxy.eu to see what happens. You could at least learn if the target database is the problem, or if it is the query.

  • If you think that the reference has a problem, do you want to share where you sourced it for feedback? You could also share what that data looks like in a history with a small testing sort of query to get the bug report with error details captured. How to share your work is in the banner at this forum, also here → How to get faster help with your question. I’ll try to help but we can also bring in other scientists in our community who work with this data more often – they’ll appreciate the context (chat link).

Meanwhile, I am going to look up the current resource allocation for Align.seqs at these two servers to confirm/compare that everything is as expected.

Hope this helps and please let us know how this works out or if you need more help! :slight_smile:

Hi dear Jenna,
Thanks for your interest.

  1. I am using https://metagenomics.usegalaxy.eu/ since it is specially designed for metagenomics. It is inside galaxy.eu. Should I still transfer my history to eu server without metagenomics?

  2. Frankly I am not sure what kind of file do i need for reference. Therefore, I used exactly the same file that I’ve used before.
    Here is the link of my history: Galaxy
    I tried files 80, 83, 112 as references. 80 is which I’ve worked before, 83 is only V4 region, 112 is recommended by mothur.

1 Like

Hi @biologisthurkan

Thanks for sharing the history and explaining what is in it. I played around a bit and confirmed your results. This is all technical feedback. You can reach out at the Micro-Galaxy chat I linked above to reach the scientists who are scientific domain experts!

  1. Your samples have high coverage versus the targets. If downsampled, any of the three target databases process Ok.

  2. The full size sample will process with your V4 region in dataset 83 as the target, but will fail with the other two targets (datasets 80 or 112). You could explore this more. I can see that you started doing that by comparing targets 83 and 112 and you could also explore which contigs are capturing a hit or not to target 83, using the “not” subset against the other targets to see what happens.

  3. Downsampling the query by a factor of 10 will allow all three targets to process. Maybe compare the full query and a subsamples query to target 83 to estimate how much you can downsample without losing valuable information? I just used 10 as a test.

If you want to downsample, do this at the start with the tool Sub-sample sequences files.

The resources at the EU server are considerable and the largest offered publicly at a Galaxy server. Meaning, this is probably the actual limit with Mothur but you could explore this more in their publications, or maybe someone has done a review?

I also ran a small test with FastQC with all of your samples, the factor 10 downsample, and the downsampled publication data we used in the Mothur tutorials. The duplication rates seem consistent and nothing popped out to me here.

Edit: the processing resources at both of these servers are the same, and your account is the same. The meta server just has a filtered tool panel to simplify the usage.

TL;DR – Target databases are intact. Read content/quality seems fine. There is “too much” coverage somewhere, likely with a V3 region, and that is crashing the job using the full samples. Sub-sampling is recommended.

Hope this helps! :slight_smile:

Thanks for your detailed explanation. I have two samples. Then i will analyse one by one. Ok but after calculation of alpha diversity of each sample, can I compare them for beta diversity later?
My other question is I downloaded the latest silva reference 138.2 as fasta from its website. It has Urasil nucleotides. Why? Can I use it as reference? Here it is: https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
Or which reference file can i use from here for ssu? Archive
Thanks

Hi,
Today I subsampled my data. Previously I had 2 fastq pair sets, but now only have one. However the align.seqs gives the same error, file limit exceeds…
Here I share my history: Galaxy

Dear Jenna we need to solve it, otherwise this issue makes galaxy useless for most of the metagenomics studies.
Thanks

Hi @biologisthurkan ,

I can only provide some technical background. We do have some special limits for the mother tools, you can see this here: infrastructure-playbook/files/galaxy/tpv/tools.yml at master · usegalaxy-eu/infrastructure-playbook · GitHub

This means we kill the job, if and only if the tool creates an intermediate file of 1TB. Its some bug in mothur, this should not happen and only happens for special files. I did never figure out under which circumstances Mothur encountered this bug.

I have pinged the microgalaxy community, hopefully they have some better insights for you.

Cheers,
Bjoern