RSeQC Issues with Genome-wide alternative splicing analysis

AAA-3 · February 14, 2024, 3:49pm

Hello,

I am new to the tools described in this tutorial for genome wide splicing. I have followed the instructions and I am at the step of RSeQC checks but have some concerns:

Infer Experiments (Job ID bbd44e69cb8906b5b26fa353c4f97eb2), Read Distribution (bbd44e69cb8906b5460f20bc1bafb6f7), Junction Annotation (bbd44e69cb8906b57cb9ee739fdf63ea) worked (or at least they produced an output)
Gene Body Coverage (BAM) (bbd44e69cb8906b5270548fe8f18a46e) show up as green but but produced empty outputs, with tool standard error “The number specified to “-l” cannot be smaller than 100.”

image1609×1060 65.2 KB
Junction Saturation (Job API ID bbd44e69cb8906b5d3867e3a0f5885fe) keeps timing out (“This job was terminated because it used more memory than it was allocated.”)
Inner Distance (bbd44e69cb8906b5e2fd101262ff2eba) resulted in an error for the entire list it was run on. Tool standard error:

Get exon regions from /corral4/main/objects/9/2/1/dataset_921f4ded-de40-4e5e-8542-1c7fdf12d992.dat ...
Traceback (most recent call last):
  File "/usr/local/bin/inner_distance.py", line 95, in <module>
    main()
  File "/usr/local/bin/inner_distance.py", line 87, in main
    obj.mRNA_inner_distance(outfile=options.output_prefix,low_bound=options.lower_bound_size,up_bound=options.upper_bound_size,step=options.step_size,refbed=options.ref_gene,sample_size=options.sampleSize, q_cut = options.map_qual)
  File "/usr/local/lib/python3.10/site-packages/qcmodule/SAM.py", line 3580, in mRNA_inner_distance
    for exn in bed_obj.getExon():
  File "/usr/local/lib/python3.10/site-packages/qcmodule/BED.py", line 484, in getExon
    blockSizes = [ int(i) for i in f[10].strip(',').split(',') ]
  File "/usr/local/lib/python3.10/site-packages/qcmodule/BED.py", line 484, in <listcomp>
    blockSizes = [ int(i) for i in f[10].strip(',').split(',') ]
ValueError: invalid literal for int() with base 10: ''

The only changes I made to the tutorial were the inclusion of my own data, changing the minimum intron size, and using GRCm39.primary_assembly.genome.fa.gz and gencode.vM34.primary_assembly.annotation.gtf.gz files since I work with mouse data. I used the american server and this is my history.

My questions are:

Since all these tools use the same input, and some of the tools produce an error and some do not, can I trust the jobs that did run?
Why are some tools able to and some unable to produce an output?
What do these errors mean and how can I solve them?
Are errors in one tool indicative of potential problems in another, since they come from the same suite and use the same inputs?

jennaj · February 14, 2024, 8:22pm

Welcome, @AAA-3

This is the parameter that message is referring to. Do you need to adjust Minimum mRNA length (default: 100) to better fit your data?

Screen Shot 2024-02-14 at 12.03.19 PM

More help is on that tool’s form – scroll down into the Help section to review.

“This job was terminated because it used more memory than it was allocated.”

This message means that the work is too large for the server to process (rare but possible!) or that there is some input/parameter problem.

I’d start by confirming that the reference genome and reference annotation are a “match”. That annotation source you are using includes header lines on the GTF files that some tools might not understand. You can adjust the data yourself to make the interpretation more direct. This FAQ can help with getting reference data into a standardized format that most tools can understand → FAQ: Extended Help for Differential Expression Analysis Tools

I personally would check all of my inputs and the different tools usage expectations if anything fails, then try to figure out why and make adjustments.
The second tool is reporting why it failed (data versus parameter problem that you can investigate, and maybe mitigate), and third is spinning out with a memory failure (indicates a data problem), and the fourth is reporting that it couldn’t interpret some of the values in the data (these last two indicate some reference data issue).
See above
It depends. I always start back at the beginning when troubleshooting. Why? Trying to make scientific data interpretations from data that is tossing out technical errors is … tedious. I’d rather just get everything set up at the start, so I can trust the messages the tools are reporting.

So, I’ll suggest you do the same, and start with the reference data. Then review how your data is loaded and labeled. We can walk through that here with more details as needed. But start with that FAQ I linked – it was just updated and covers the details that I think are “most important”, meaning == solve the vast majority of tool problems.

AAA-3 · February 15, 2024, 2:00pm

Hello Jenna thanks for your help!

Yes I set the value to 30, because that was the size of the smallest read post-trimming but I skipped looking at the documentation and didn’t realise the min. value restriction. On getting an error I didn’t notice the -l was referring to --minimum length. I left the values default and the program is running.

I think they should be - I downloaded Release M34 (GRCm39) files from the same webpage: GENCODE - Mouse Release M34
I used the same primary (PRI) region files for the GTF and FASTA.

I trust that the GTF files are formatted correctly because I downloaded them from gencode and using them for the other tools that did produce an output showed expected outputs (eg: infer experiments able to confirm that the samples come from unstranded data, other job lists using these files does produce outputs and reports).

I went back again to check the Inner Distance run and realised I accidentally used the wrong input

The only remaining problem now is the memory failure. Of the 18 in the list, 6 had memory error, 3 ran successfully and the rest are ‘waiting’.

I guess the best way to determine if I am correct in thinking the files are read correctly would be to wait to see if changing the input parametres of Gene Body Coverage BAM and Inner Distance works. If they do, how would I go about addressing the ‘work being too large for the server’ issue?

jennaj · February 15, 2024, 6:39pm

Yes, the data content should be fine from a data provider. However, sometimes data needs to be “standardized” in small ways. Why? Not every tool author is making the same format assumptions. That is where the extra help in that FAQ comes in. With experience, you’ll discover that tiny differences sometimes need to be made to “help” a tool to better understand the scientific content. Not every tool, just some tools, and certainly not just in Galaxy. Working in Galaxy is actually a bit easier since all the metadata details smooth this out for you when possible, versus using tools directly on the command-line.

So, you could do two specific things:

Use NormalizeFasta on the genome fasta file to standardize the format. Specifically, remove the description content from the fasta > title lines.
Use Select lines matching expression on the GTF file to standardize the format. Specifically, remove the # header lines.
If you do both, nearly all tools will understand the scientific content inside those two files, and be able to combine information between them for whatever algorithm the tool is applying. Meaning, this standardization won’t hurt and can usually help.
Since both files were from the same data provider, you can probably skip the chromosome comparisons explained in the FAQ. But, you still might need to compare the scheme used in those files to any other data files involved. It depends how paranoid you want to be. I check everything for at least the first time through a new workflow, then I’ll start to trust it. But if I get an error, I’ll then backtrack.

So – if you are still getting an out of memory error: that could be a data problem or the data is actually too large to process. Try cleaning up the reference data as a first pass troubleshooting. If that is enough, then you’ll learn that the specific tool you used doesn’t understand data in one format but does in another, and can adjust your analysis going forward to account for the picky tool. Or, you might decide, like me, to just clean everything up at the start to be in the format that all (most?) tools can understand.

After you’ve done the data cleanup, and a tool is still failing for memory reasons, then you can explore the content of your data files next. Are these large? Do the tool’s parameters make sense for that content? Does the server need more resources allocated to the tool? Can those resources be allocated, or do you need a private server? Or, can the same work be done with a different tool instead, or by arranging the data differently, or by using different parameters? You can post back the job details, and your thoughts, and we can follow up more from there. But please do the reference data adjustments first. Everyone would do the same when troubleshooting.

AAA-3 · February 20, 2024, 8:41pm

Hello again!

Firstly, thanks for your detailed answers - I really appreciate them!

To the problem at hand:
I started a new history with the “cleaned” FASTA and GTF files and started from step one to produce the files needed for Junction Saturation. This is taking time for processing and I will get back to you with an update.

While I waited, I also ran the Junction Saturation on my local python using a powerful computer in the lab (compared to a standard computer). I downloaded the files that were producing errors and used the code retrieved from the job details. All the files successfully ran and produced the expected files without error.

I think the problem was a memory issue. I will continue with my analysis and am still curious to see what happens using the “clean” data, but I don’t expect there to be a difference.

jennaj · February 20, 2024, 9:15pm

Hi @AAA-3

I’d be curious if the cleanup was enough or not, too. You can always share back your work for another review. If you have a straightforward example use case that requires more runtime memory, sometimes we can make an adjustment for that exact tool/version on the target server. What you are doing now is how to get to that next step. Not always possible but certainly can be confirmed as a clear yes/no.

AAA-3 · February 25, 2024, 5:41pm

Hello @jennaj ! Clean up did not help - seems like it is a memory/time issue, because different files compared to before had errors.
In fact, on starting from scratch, I got 1 memory error even on a tool that ran seemingly seamlessly before cleanup - might this be because I ran too many jobs at one time?

AAA-3 · February 27, 2024, 10:57am

FYI: new history where the cleaned files are

jennaj · March 4, 2024, 5:11pm

Thanks for posting the link, I’m reviewing today. More feedback soon

Topic		Replies	Views
new to galaxy..help needed troubleshooting	2	749	February 6, 2020
DESeq2 error running htseq-count gtn-tutorial , troubleshooting , transcriptomics , htseq-count , edger	2	2168	November 20, 2020
DESeq2 error using HTseq count	1	539	January 19, 2020
Trinity tool error usegalaxy.eu support transcriptomics	3	590	August 29, 2022
Transcriptomics troubleshooting transcriptomics , rna-seq , goseq	2	578	July 26, 2023

RSeQC Issues with Genome-wide alternative splicing analysis

Related topics