Job still running for more than 15 hours

I’ve been running HISAT on the given below job API ID
Kindly provide support

|

11ac94870d0bb33a951ae66565734a8e

Hello!

The logs show that the job has gone over cgroup memory limit.The job was resubmitted with double the memory and, in total (with re-submission), it ran for 8h. This means that the dataset is bigger than usual and it will take more time to run aswell.

Could you please share the history? We can then provide you with better analysis :slight_smile:

Thank you,

Gabriel

Hello !
Thank you for the reply

could you please direct me to a source which teaches how to do that ? or can you teach me how to do it?

Monish

Hello,

You can follow the instructions here: FAQ: Sharing your History

Gabriel

Hi Galaxy team,

I’m running HISAT2 on Galaxy for an RNA-seq splicing project comparing MSI vs MSS samples.

I’m seeing a consistent issue where MSI FASTQ files larger than ~2GB run indefinitely (15+ hours and still “running”), while:

  • MSS samples finish in ~45–60 minutes

  • Smaller MSI FASTQs also complete normally

All jobs use the same workflow, reference genome, and default HISAT2 parameters.

The affected jobs don’t fail — they just stay in the running state. FASTQs appear normal, and read counts aren’t dramatically different.

As advised by one of the admins earlier, I also tried resubmitting the job with higher requested memory, but unfortunately that didn’t change anything — the MSI jobs still keep running without completing.

I’m wondering if this could be related to:

  • Galaxy-side limits (walltime / I/O),

  • HISAT2 behavior with higher mismatch/indel rates in MSI,

  • or something specific to larger input files.

Has anyone encountered similar behavior with HISAT2 on Galaxy?

Are there recommended parameter tweaks, preprocessing steps, or alternative aligners (e.g., STAR) that might help in this situation?

https://usegalaxy.eu/u/monish123/h/rna-analysis

I’ve also made the history acessible, any help would be appreciated

Thanks!

Hi @Monish_V

Thanks for sharing the history! Very helpful.

I don’t see anything special about the mapping run here. It seems to have not started the mapping part at all but instead was quit out before the job was finished with queueing, so it didn’t even get assigned to a cluster node yet. Is this the example you meant to share? Do you want to share the history with the errors from a job that failed during processing? Or have you quit out of all prior jobs?

The best way to get information for a strange job is to allow it to finish processing. A mapping job could queue for a day or so, and then finally start processing. After 15 hours, it may still be queued (how the job above seems to have been) or it may be just starting the actual mapping job.

This doesn’t need to delay you for other analysis. Meaning, you can keep going and start up new work involving other at the same time. That can be in the same history or a different history – either is fine! The server knows how to keep track of tens of thousands of your jobs all at the same time.

I’m wondering if explaining how the clusters work would help?

The color of a dataset can give some clues about the processing stage work is in. The topic below has some short help about computational resources at the public servers.


You’ll find many more explanations in topics tagged with queued-gray-datasets . Some explain how to investigate the server performance, example → How to see the UseGalaxy.eu job queue statistics

Most public servers work about the same way! The best advice is to get your jobs into the queue then to allow them to completely process. If later on an odd error comes in, include a resource issue, you can share the example and we’ll be able to offer advice. This can include reaching out to cluster administrators to learn if resources can be adjusted, but also helping you to organize your data or parameters a bit differently.

As a test, I started up a test history here that I’ll let run over the weekend. It is the same accession you were using. I pulled in the SRR11296739 sample GSM4408849: MSI-H, likely Lynch due MLH1 germline sample 7 tumor tiss... - SRA - NCBI from NCBI the same way you did to start with. Next, I ran a simple generic QA workflow Galaxy on the raw reads to see what happens.

Next, I’ll try to map the reads using HISAT2 (defaults) against the hg38 native index.

This sample isn’t overly large but as @Gabriel explained it will route to a larger and busier cluster node than you may have experienced before (during a training session, or simply when the server was less busy!). My job will probably have the same automatic rerun invoked unless the additional QA helps the reads to map cleaner. In either case, that’s ok, let’s let it complete until the end.

Hope this helps to keep things going and after the weekend we’ll have some more data to look at. You are still welcome to share back any job that failed on its own (not canceled by you) for a closer look.

Thanks! :slight_smile: