Memory allocation failure during Taxonomy assignment

Thais_Guillen · January 9, 2022, 1:16am

I have seen this issue indicated several times here but I couldn’t find any clear solution. Hope anyone can help me.

I’m following the ITS DADA2 Pipeline Workflow to analyze my data: 36 paired fastqs files obtained using Illumina MiSeq. The input samples were fungal ITS amplicons. I followed the tutorial first in Galaxy and then in Rstudio and I run into the same problem. When assigning taxonomy this happens:

#Assign Taxonomy

path ← file.path(“UNITE_public_10.05.2021.fasta.gz”, “UNITE_public_10.05.2021.fasta”)
taxa ← assignTaxonomy(seqtab.nochim, path, multithread = TRUE, tryRC = TRUE,
taxLevels = c(“Kingdom”, “Phylum”, “Class”, “Order”, “Family”, “Genus”, “Species”),
verbose = FALSE)

Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, :
Memory allocation failed.
In addition: Warning message:
In .Call2(“fasta_index”, filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file UNITE_public_10.05.2021.fasta.gz/UNITE_public_10.05.2021.fasta: ignored 8 invalid one-letter sequence codes

I tried to increase the memory limit and this is what I have

memory.size()
[1] 11921.24
memory.limit()
[1] 31975

These are the characteristics of my laptop:

Processor AMD Ryzen 7 PRO 4750U with Radeon Graphics 1.70 GHz
RAM 32.0 GB (31.2 GB utilizable)
System Sistema operativo de 64 bits, procesador x64

If necessary this is the step before the taxonomy assignment:

#Track reads through the pipeline

getN ← function(x) sum(getUniques(x))
track ← cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) ← c(“input”, “filtered”, “denoisedF”, “denoisedR”, “merged”, “nonchim”)
rownames(track) ← sample.names

head(track)
input filtered denoisedF
1-01A_S63_L001_R1_001.fastq 39949 39663 37575
1-01B_S64_L001_R1_001.fastq 51059 50727 47766
1-01C_S65_L001_R1_001.fastq 71229 70897 68033
1-04A_S78_L001_R1_001.fastq 48126 47701 44850
1-04B_S79_L001_R1_001.fastq 78663 78350 75736
1-04C_S80_L001_R1_001.fastq 67420 67102 64241
denoisedR merged nonchim
1-01A_S63_L001_R1_001.fastq 34082 3900 3831
1-01B_S64_L001_R1_001.fastq 41556 4241 4139
1-01C_S65_L001_R1_001.fastq 54961 16499 16157
1-04A_S78_L001_R1_001.fastq 41881 13130 12954
1-04B_S79_L001_R1_001.fastq 69691 11161 11070
1-04C_S80_L001_R1_001.fastq 59966 25060 24773

Thank you very much in advance!

jennaj · January 10, 2022, 8:48pm

Hi @Thais_Guillen

Memory errors can happen with any tool, in Galaxy or not, for three primary reasons. These are all broad cases, and why there isn’t a single/clear answer.

There is a problem with the inputs
The work actually exceeds computational resources
There is some problem with the command string versus what the tool is expecting – not all tools will report a specific error for all unexpected inputs/usage, some will just try to execute until they run out of resources.

I usually start with troubleshooting the inputs when a problem comes up in the primary Galaxy applications (since the command string is automatically generated by tools). For your case, it looks as if the reference fasta is being input twice. Maybe start checking there yourself?

These are large data, so it is also possible that the tool is actually running out of resources. Or, maybe the fasta wasn’t downloaded completely or has some formatting problem (obtained from here? if not, maybe try this version instead? Taxonomic reference data).

You might also try comparing how this tool and your data process at a large public server versus your laptop. Wrapped versions are available at the usegalaxy.* servers Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub if you want to try that way.

Hope that helps a little bit!

Thais_Guillen · January 11, 2022, 11:11am

Hi jennaj

Thank you very much for your help! I actually used the version of UNITE indicated in Taxonomic reference data. The reference data is extracted in a file with the same name within the R directory, that’s why it looks like a double input. I downloaded it 3 times, thinking of possible errors but I obtained the same result.

Your suggestions make total sense, thank you very much, I will try to process my data on a large server and see how it goes. Have a nice day!

Thais_Guillen · January 11, 2022, 11:17am

UseGalaxy.org (Main) - Galaxy Community Hub gives the same error

jennaj · January 11, 2022, 6:28pm

Please send in the error as a bug report. We can try to see if there is some input/format problem or if the tool is actually running out of resources. Please include the URL for this topic in the comments to help associate the two. How-to: Galaxy Training!

After doing that, try running at UseGalaxy.eu – they have some very large clusters available, so jobs that fail for true resource/memory reasons at one server may work there.

jennaj · January 12, 2022, 7:10pm

Hi @Thais_Guillen

Thanks for sending in the report. The reference fasta has title lines (“>” lines) that Dada2 doesn’t know how to interpret. Help for the expected format is near the bottom of the tool form in Galaxy.

Try this:

Reformat the fasta title lines with the tool Text transformation with sed using the following two lines of script for the “Sed Program”. Note that using Sed is just an example – you can use any tool/method you want for this, as long as the end result is the same.

s/(^>)(.+\|)(.+)(\|.+)/\1\3/
s/[a-z]{1}__//g

That will convert fasta title lines that originally are formatted like this:

>UDB016649|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Thelephorales;f__Thelephoraceae;g__Thelephora;s__Thelephora_albomarginata|SH1502188.08FU

To this format:

>Fungi;Basidiomycota;Agaricomycetes;Thelephorales;Thelephoraceae;Thelephora;Thelephora_albomarginata

Rerun using that reformatted reference fasta input. Be sure to enter the taxonomy levels – it looks like you included that content in your Rscript but it wasn’t included with the job sent in for the bug report.

The option is right under where the reference fasta is input on the tool form, labeled as “Names of the taxonomic levels in the data set” [comma separated list (taxLevels)].

Kingdom,Phylum,Class,Order,Family,Genus,Species

Thais_Guillen · January 12, 2022, 9:19pm

Hi @jennaj
Thank you very much! It worked! I have another question: if I want to do the same in R how could I reformat the reference fasta?

Best wishes
Thais

jennaj · January 12, 2022, 11:06pm

Hi @Thais_Guillen

I’m not as clever with R text/string manipulations but I’m sure there is a way. It would likely involve a few steps: fasta > tabular > do stuff with the strings representing the title lines > fasta

Another option is quick and ready to go already: Capture the URL for the modified dataset in Galaxy and import that into your environment, instead of importing from the original website. If you are using Rstudio within Galaxy, you can also directly move datasets from a history into the IE environment with the gx_get command. Tutorials: Search Tutorials.

Topic		Replies	Views
Proteogenomics fasta file generation w CustomProDB usegalaxy.org support exceeds-memory-error , custom_pro_db	2	21	February 15, 2025
ram allocation for reference genome in STAR usegalaxy.org support troubleshooting , exceeds-memory-error	1	150	May 28, 2024
Error with kallisto -- This job was terminated because it used more memory than it was allocated usegalaxy.org support exceeds-memory-error	4	1653	May 10, 2019
Troubleshooting BWA-MEM2 resources under Docker Galaxy mapping , galaxy-docker , bwa_mem2	1	480	September 20, 2023
CheckM lineage_wf usegalaxy.org support tool-help , checkm_lineage_wf	3	28	February 4, 2025

Memory allocation failure during Taxonomy assignment

Related topics