My colleague and I have been working with RNA STARSolo to process several single-cell RNA-Seq (scRNA-Seq) datasets. We’ve encountered an intermittent issue where one of us runs into a memory-related error while processing a dataset that the other has been able to run successfully without any problems.
The specific error message we’re seeing is this:
EXITING because of fatal ERROR: not enough memory for BAM sorting:
SOLUTION: re-run STAR with at least --limitBAMsortRAM 10436112811
Jun 10 17:20:41 ...... FATAL ERROR, exiting
We’re puzzled by why the tool fails for one of us but runs without issue for the other, even when using the same dataset and STAR parameters. Is there a recommended best practice for setting --limitBAMsortRAM or otherwise ensuring more consistent performance across runs?
If everything else is exactly the same (shared workflow?), then differences with particular runs is due to chance on the individual cluster node(s) where the jobs are running. This would be most noticeable for really large jobs that are near the maximum memory capacity at that public Galaxy server’s cluster. What is going on is that a node might have two (or more) different jobs executing, and if both are “maximally large” the node can be overwhelmed. This is hard to avoid for multi-node and even multi-use clusters.
If there is not a workflow involved, then different versions of the tool could add in more variability. And if these runs are happening at different public servers, that also adds in some runtime variability (same tool, but different cluster resources). Then finally, different samples will have different data characteristics that could definitely lead to different runtime behavior.
It sounds like this is happening rarely, which might be expected when batch processing data. This can’t be mitigated for at a public Galaxy server since the clusters are heterogenous and used on demand (across pools of tools and user jobs). But running a Docker Galaxy with some homogeneous cluster type from a cloud provider might have more consistent performance, but even those resources could have some variability, even if you restricted each node to only process one job at a time (each job has dedicated compute). On your own server you would have more control over administrative tool resource use options like --limitBAMsortRAM as well (this is fixed at public servers, or at least those I know about!).
So: Lot’s of moving parts!
What can help: you aren’t using a workflow yet, those can really help when running performant jobs all in a series and for many samples. If an upstream tool fails, you can restart that single failed job (anywhere in the pipeline), and resume all of the downstream jobs that are dependent on it (find the resume toggle on the rerun tool form). This saves needing to click to restart (and maybe resort) the same data repeatedly. A workflow can include as few as a single tool, but most would have a some data preparations steps, then this tool, then some post. You can find some examples in the Published Workflows section at any server, too, to use exactly or as templates.
Please let us know if that answers your question or if you want to know more about anything I mentioned!