Using local galaxy for human paired end rna seq data in RNAStar

Grace_Kim · March 9, 2022, 7:21am

Hello,

Please bear with me I’m very much a galaxy newbie. As described in the title I have human RNAseq raw fastq files that have been quality checked and trimmed on our local galaxy; reads are paired end and raw reads are 75bp long. When we try aligning the fastq to hg19 primary assembly Fasta files and GTF files on RNAStat, the program runs for days without generating any files (it doesn’t quit due to error, it just runs as if it’s spinning).

What do you suggest for running a differential expression pipeline for human rna samples (which are confidential). I’m worried that 60gb of ram still isn’t enough to run this analysis.

jennaj · March 9, 2022, 9:12pm

Hi @Grace_Kim

Some items to check:

Is the hg19 genome natively indexed on your server? If not, using it as a custom genome fasta from the history adds in an indexing step before RNA-Star is even run, for every time it is run. Indexing a large genome is a better approach. The US public Galaxy server at UseGalaxy.org allocates about 60 GB of memory for this tool, and using the human genome as a custom genome (fasta) will fail. If the built-in index is used instead, the jobs execute (note that the the upper limits for success varies based on settings, input sizes, sequence quality, etc). This is a resource intensive tool.
Are your sure that the reference annotation (gtf) is a match for the reference genome (fasta)? UCSC is usually the best place to get annotation for hg19. The format and content is what tools expect for that input. If there is a mismatch, the tool can get “stuck” and never complete.
If you are using a custom genome and/or have numerous/large fastq pairs, the job could take a longer time to execute successfully.

Once everything is confirmed to be technically correct, try running the tool with a single pair as a test on your server.

Of note: Running a tool in Galaxy uses the same resources as running that same tool command line. This is a good starting point for bioinformatic’s community discussions about the resources that RNA-Star and other aligners will need. Hardware requirements for bowtie2/STAR RNA-seq alignment

I added a few tags to your post that point to related discussions at this forum.

Grace_Kim · March 11, 2022, 6:41pm

Hi Jenn,

Thank you so much for pointing out the indexing issue. I decided to try hg38, and I’ve fetched the relevant fasta and processed them with data manager index for Sam, Picard, and two bit. I have started the index build for RNAStar using hg38.ncbiRefSeq.gtf.gz. I couldn’t find an url that worked in the dbmanager fetch gtf tool, so unfortunately I had to upload the gtf from the ucsc golden path link.

Is there an estimated run time or a way to distinguish if the program is just spinning instead of building an index?