RNA-STAR, hg38 GTF reference annotation, Cloudman/AWS options plus local Galaxy "Cloud Bursting" for memory intensive mapping

Hello, I am having an issue running STAR on some paired end data. I am running galaxy through AWS and cloudman. My data is one set of paired end illumina fastq data. around 2.3 MB each. I also have uploaded the GR38 .gtf file from ensemble. WHen I try to run star with the options
paired-end (as individual datasets,)
use a built-in index
use genome reference without builtin gene-model
hg38 as the reference genome
GR38.gtf (that I uploaded) as the gene model
49 as the length of genomic sequence around annotated junctions (read length was 50 from my QC)

The error I get is
“EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
Possible cause 1: not enough RAM. Check if you have enough RAM 32002674832 bytes
Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 32002674832”

I was wondering how I could go about fixing this issue? Since I am on AWS using a large cluster I don’t see how I could not have enough RAM.

Thanks

1 Like

Hi @jgoldst7

Jobs can error for all sort of odd reasons (including memory reasons) when there are problems with the inputs.

In your case, I suspect the reference annotation is a mismatch for the reference genome. The built-in “hg38” human genome is sourced from UCSC. Please see this prior Q&A for more help and details: RNA-STAR and hg38 GTF reference annotation

FAQ: https://galaxyproject.org/support/

Thanks!

Thank you for the reply, my refer nice annotation is from Ensemble, so that is most likely the reason

1 Like

Hi again,

I downloaded the USCS version of the genome annotation and got the same error. Are there other common problems that could be causing this error?

1 Like

The job is probably actually running out of memory during execution if you are certain that the inputs are a match.

BTW – Using UCSC’s annotation if often not a good choice. Why?

  • The GTF’s extracted from the Table Browser can be truncated (only about 100k lines are downloaded, whether to Galaxy or any other destination, including local file download).
  • Many GTFs can be large, human and mouse in particular. If the download was truncated, it could result in a tool error. Check the end of the file to see if this happened – you’ll see a message. That message can create an out-of-specification data with poor formatting (leading to tool errors) and is an incomplete dataset anyway (not useful). Tool: Select last lines from a dataset (tail) (Galaxy Version 1.1.0)
  • GTF data from this source, even when completely transferred, will have the same value populated in the 9th column attributes field for gene_id and transcript_id. Both will be the transcript name. This effectively means that all counts or other summaries that are putatively “by gene” are actually “by transcript” (a scientific content problem). Use one of the other sources linked above instead for best results. Gencode or iGenomes are both good choices for human/mouse, with iGenomes supporting even more genomes.

I’m sorry, I misspoke about the gene annotation. I meant I got one based on USCS, to match the reference genome for STAR. I downloaded the comprehensive genome annotation from GENCODE. After the 5 header files starting with #, the file is 2.9 million lines and first line looks like this

chr1 HAVANA gene 11869 14409 . + . gene_id “ENSG00000223972.5”; gene_type “transcribed_unprocessed_pseudogene”; gene_name “DDX11L1”; level 2; havana_gene “OTTHUMG00000000961.2”;

So if that is my gene annotation file and I am selecting HG38 at the STAR reference genome, then the only thing causing my error would be actually running out of memory? I am using an m5.xlarge VM from AWS (16 GB OF RAM) with an added 100 GB of persistant storage volume and trying to align a single set of paired end reads that are ~2.5 GB each while in the fastq.gz format. It seems like should be enough to run STAR, as 16 GB is about an average laptop would have.

1 Like

Thanks for the extra info.

Yes, the Gencode GTF looks to be not the problem.

RNA-STAR is a compute-intensive tool. 16 GB is not sufficient for human mapping. The memory needed to run the tool line-command is the same as when running it within Galaxy. See: Mapping RNA-seq Reads with STAR - PMC

Quote from that publication:

Necessary Resources

Hardware

  • A computer with Unix, Linux or Mac OS X operating systems.
  • RAM requirements: at least 10 x GenomeSize bytes. For instance, human genome of ~3 GigaBases will require ~30 GigaBytes of RAM. 32GB is recommended for human genome alignments.
  • Sufficient free disk space (>100 GigaBytes) for storing output files.

Alternative ways to run your own Galaxy server, including Cloud choices. Since you are already using Cloudman, you probably just need to bump up the VM choice to one with more resources: Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub.

Please also see this recent Galaxy Blog post. This is an alternative for using a local Galaxy with on-demand cloud resources incorporated: The Galactic Blog - Galaxy Community Hub > Enabling cloud bursting for Galaxy

Thanks!

Thanks for the info! I bumped up my VM and it worked!

One final question. Since part of running STAR the first time requires that STAR builds the index (which is I believe the most memory intensive step), is there a way to save that index for future use? I would like to save this workflow and reuse it, always with the same index, but I don’t see any output from STAR that is the actual index it built.

Wonderful! Very happy that solved the memory/space issues :rocket:

This was the “built-in index” already available in your cloud server, not a fasta from the history (custom genome), correct?

When using an existing built-in tool specific genome index with tools, any additional indexing a tool does during runtime is based on parameters AND the input content. This cannot be saved back for reuse, as far as I know, since the input content will be different for each mapping run.

For mapping tools, the original output BAM is also indexed (by Samtools “sort”) when creating the final coordinate sorted BAM result. This indexing definitely cannot be saved back as it is based on the output content.

Data Manager created indexes are already available on Cloudman and when using most public Galaxy servers for hg38, and can be created using DMs for a local/cloud Galaxy. And not just the baseline genome index, but other important indexes, including Samtools, Picard, 2bit, plus tool-specific indexes, if a genome you want to use is not already pre-indexed. This prior Q&A has much information: the best order to run DMs, links to resources, troubleshooting various issues that can come up, et cetera if you ever need to run DMs. Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting