Issue with rnaQUAST tool

Looks great! I’m glad the serve could chomp through those RNA-Star jobs! I’ll keep checking back on yours and mine. Then we can follow up if any errors come up (after one rerun try for stray cluster issues during the event). Thanks!!

1 Like

Hey Jennifer,

So I noticed the RNAquast tool still does not work as the job keeps failing, but everything else seems to be running smoothly.

1 Like

Hi @SehajR

The job and the rest of the history look really great! I’m really glad to see the IsoformSwitcher results as successful!

The rnaQuast tool is exceeding the runtime limits UseGalaxy.org server. That runtime limit cannot be extended (for any tool) and is not related to the job memory (which we can sometimes adjust). FAQ: Understanding walltime error messages

Instead, you can try running that particular tool at a server with a longer runtime allocation. UseGalaxy.eu is probably the best choice, although UseGalaxy.org.au might work too.

You don’t need to start over – just copy the inputs for that tool into a new history (using the gear icon above the list of datasets). Then transfer that smaller history over to the EU server and try running the tool there.

I started a test at EU to see what happens. We can follow up about it more. I’m curious if the full sized human run is even possible at all on the public servers.

This is certainly a stress test with rnaQuast! :slight_smile:

Notes:

I happened to notice an example to help with understanding the complexity of this type of analysis. What seemed (just by eye, I don’t “recommend” this method!) to be low complexity regions in the predicted transcripts, and sometimes the entire transcript!

I took the first transcript from the final sample in collection 2104, sample Wt26_Rep3.fastq and ran a BLAT against the human genome at UCSC. The hit was for a simple repeat in multiple places of the genome for over 200 bases. So I clicked through to the browser to examine the first hit in the listing. Can you notice why this is interesting but also maybe difficult for tools to process, even on really massive clusters? Toggle on the GCR Incident track for some interesting details. Clearly conserved, too. The nature of sequences like these are exactly why full scale transcriptomics on humans is so complex – it isn’t automatic, and can require a bit of curation, especially when casting a wider net for discovery purposes. All at once is challenging, anywhere, and can be limited by the tools/methods themselves. This is part of why scientists tend to focus on smaller genomic regions (horizontal slices of the genome) and or feature types/clusters (vertical slices of content meaning).

>STRG.1.1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA
CCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCT
AACCC

How to solve this is complex. Maybe the RNA-Star alignments could have some parameter tunings to avoid this. Maybe low complexity regions can be filtered at later steps. Maybe some data is processed the way to you have done already, and some is put off into a different slice for custom manipulations. But let’s see what happens with my test job – I’d like to see if rnaQuast can report about situations like this in real data, and how well.

Hey Jennifer!

Thank you so much for following up, I really appreciate the troubleshooting on your end! I’ll monitor the rerun on the EU server from the history link you shared, it seems to be running still.

As for the BLAT results, it is indeed really interesting! A lot of these concepts are new to me as I do not come from a bioinformatics background, but definitely insightful, I really wasn’t aware of the level of complexity we’re working with…

Sehaj

Update

Hi @SehajR

I have been working with your data over the last few weeks (along with being away at a conference) at the EU server since the clusters scale a bit differently there. I am not done yet, but wanted to let you know that I haven’t forgotten about the rnaQuast tool behavior.

This is my copy of your data. You’ll be able to see the newest test runs in the active tab, and some of my failed tests in the hidden tab. You don’t need to import it since it is huge and not completed yet. I’ll make a smaller copy for you once (and if!) I can get it to work correctly on this large human data sample set.

I am working with the full transcript set, a reduced transcript set (only 3 of the samples), and the human genome reduced down to the “female” chromosomes, and the gtf also reduced (these references are based on your newest Gencode versions – I just removed any non-primary chromosomes from both). This is to determine the processing thresholds with the simple statistics runs, and to flush out any problems with the additional options. The processing takes some time to complete each time (as you know!) and the runs I started this morning are still processing.

I’m very glad everything else is working for you now! Thank you for sharing the data for these stress tests – very helpful for us, and I think this can be wrapped up soon. I’ll post more updates when a meaningful result is produced. :scientist:

Hey Jennifer!

Thank you so much for all your help and troubleshooting! I hope the data helps with the stress testing and thank you for updating me and keeping me in the loop :slight_smile:

Sehaj

Hi @SehajR

Great!

Updates: It looks like we did uncover a problem with either the Complete Report or Fasta output option however the remainder of the outputs, including the PDF and other two short report outputs, were created.

That was only generated for the smaller test case I created so far (only 3 transcript samples) and the full size 12 transcript data is still processing.

  1. I’m going to let the original 12 transcript run keep processing (dataset 109). The results will be red, but the content for the report you are interested in (PDF) should be intact. I’d like to get this for you at a minimum if at all possible, plus show you how to create it yourself! :scientist: If you want to look at the first 3 transcript sub-sample, you can do that now (dataset 103). Might be interesting to see what those statistics represent. It looks like about 50% of the isoforms are known, and the remainder are novel (when compared to this stricter version of “known” Gencode annotation). The files you can review are tagged with “GOOD”.

  2. I restarted both test sizes with an updated set of options, to see if we can get a full run to report as green (no technical errors). The “PDF only” output is the recommended usage but I think including the logs are always helpful (any tool) as a cross-check.

  3. Once done, I’ll report back to the developers what was found. There is likely a minor path problem creating a small technical issue. I’d prefer that the successful output be created and fully published with content with the green color, then any sub-sections with some issue have just a warning instead. Some of this was complicated to model over to the Galaxy job environment. Your larger data is so great to use to flush those out! So, thank you again.

Some of this is technical – but maybe helpful. Later, dataset 125 will be green but will match the output of dataset 109 (both are forward looking and assuming the clusters can process data this large! Time won’t be an issue, only memory resources).

Overall, if I was to do this again, I would start with the autosome 1-22, plus M and X, as the baseline genome to call novel transcripts from. This removes the variability introduced with the haplotype and unmapped chromosome regions, and is a pretty common way that this type of data is processed. Then, later, the other sections can be analyzed in isolation. This helps to reduce with false positives in the “novel” category when performing a full genome prediction.

More as this processes! :slight_smile: