I am new to the tools described in this tutorial for genome wide splicing. I have followed the instructions and I am at the step of RSeQC checks but have some concerns:
Infer Experiments (Job ID bbd44e69cb8906b5b26fa353c4f97eb2), Read Distribution (bbd44e69cb8906b5460f20bc1bafb6f7), Junction Annotation (bbd44e69cb8906b57cb9ee739fdf63ea) worked (or at least they produced an output)
Gene Body Coverage (BAM) (bbd44e69cb8906b5270548fe8f18a46e) show up as green but but produced empty outputs, with tool standard error âThe number specified to â-lâ cannot be smaller than 100.â
Junction Saturation (Job API ID bbd44e69cb8906b5d3867e3a0f5885fe) keeps timing out (âThis job was terminated because it used more memory than it was allocated.â)
Inner Distance (bbd44e69cb8906b5e2fd101262ff2eba) resulted in an error for the entire list it was run on. Tool standard error:
Get exon regions from /corral4/main/objects/9/2/1/dataset_921f4ded-de40-4e5e-8542-1c7fdf12d992.dat ...
Traceback (most recent call last):
File "/usr/local/bin/inner_distance.py", line 95, in <module>
main()
File "/usr/local/bin/inner_distance.py", line 87, in main
obj.mRNA_inner_distance(outfile=options.output_prefix,low_bound=options.lower_bound_size,up_bound=options.upper_bound_size,step=options.step_size,refbed=options.ref_gene,sample_size=options.sampleSize, q_cut = options.map_qual)
File "/usr/local/lib/python3.10/site-packages/qcmodule/SAM.py", line 3580, in mRNA_inner_distance
for exn in bed_obj.getExon():
File "/usr/local/lib/python3.10/site-packages/qcmodule/BED.py", line 484, in getExon
blockSizes = [ int(i) for i in f[10].strip(',').split(',') ]
File "/usr/local/lib/python3.10/site-packages/qcmodule/BED.py", line 484, in <listcomp>
blockSizes = [ int(i) for i in f[10].strip(',').split(',') ]
ValueError: invalid literal for int() with base 10: ''
The only changes I made to the tutorial were the inclusion of my own data, changing the minimum intron size, and using GRCm39.primary_assembly.genome.fa.gz and gencode.vM34.primary_assembly.annotation.gtf.gz files since I work with mouse data. I used the american server and this is my history.
My questions are:
Since all these tools use the same input, and some of the tools produce an error and some do not, can I trust the jobs that did run?
Why are some tools able to and some unable to produce an output?
What do these errors mean and how can I solve them?
Are errors in one tool indicative of potential problems in another, since they come from the same suite and use the same inputs?
This is the parameter that message is referring to. Do you need to adjust Minimum mRNA length (default: 100) to better fit your data?
More help is on that toolâs form â scroll down into the Help section to review.
âThis job was terminated because it used more memory than it was allocated.â
This message means that the work is too large for the server to process (rare but possible!) or that there is some input/parameter problem.
Iâd start by confirming that the reference genome and reference annotation are a âmatchâ. That annotation source you are using includes header lines on the GTF files that some tools might not understand. You can adjust the data yourself to make the interpretation more direct. This FAQ can help with getting reference data into a standardized format that most tools can understand â FAQ: Extended Help for Differential Expression Analysis Tools
I personally would check all of my inputs and the different tools usage expectations if anything fails, then try to figure out why and make adjustments.
The second tool is reporting why it failed (data versus parameter problem that you can investigate, and maybe mitigate), and third is spinning out with a memory failure (indicates a data problem), and the fourth is reporting that it couldnât interpret some of the values in the data (these last two indicate some reference data issue).
See above
It depends. I always start back at the beginning when troubleshooting. Why? Trying to make scientific data interpretations from data that is tossing out technical errors is ⌠tedious. Iâd rather just get everything set up at the start, so I can trust the messages the tools are reporting.
So, Iâll suggest you do the same, and start with the reference data. Then review how your data is loaded and labeled. We can walk through that here with more details as needed. But start with that FAQ I linked â it was just updated and covers the details that I think are âmost importantâ, meaning == solve the vast majority of tool problems.
Yes I set the value to 30, because that was the size of the smallest read post-trimming but I skipped looking at the documentation and didnât realise the min. value restriction. On getting an error I didnât notice the -l was referring to --minimum length. I left the values default and the program is running.
I think they should be - I downloaded Release M34 (GRCm39) files from the same webpage: GENCODE - Mouse Release M34
I used the same primary (PRI) region files for the GTF and FASTA.
I trust that the GTF files are formatted correctly because I downloaded them from gencode and using them for the other tools that did produce an output showed expected outputs (eg: infer experiments able to confirm that the samples come from unstranded data, other job lists using these files does produce outputs and reports).
I went back again to check the Inner Distance run and realised I accidentally used the wrong input
The only remaining problem now is the memory failure. Of the 18 in the list, 6 had memory error, 3 ran successfully and the rest are âwaitingâ.
I guess the best way to determine if I am correct in thinking the files are read correctly would be to wait to see if changing the input parametres of Gene Body Coverage BAM and Inner Distance works. If they do, how would I go about addressing the âwork being too large for the serverâ issue?
Yes, the data content should be fine from a data provider. However, sometimes data needs to be âstandardizedâ in small ways. Why? Not every tool author is making the same format assumptions. That is where the extra help in that FAQ comes in. With experience, youâll discover that tiny differences sometimes need to be made to âhelpâ a tool to better understand the scientific content. Not every tool, just some tools, and certainly not just in Galaxy. Working in Galaxy is actually a bit easier since all the metadata details smooth this out for you when possible, versus using tools directly on the command-line.
So, you could do two specific things:
Use NormalizeFasta on the genome fasta file to standardize the format. Specifically, remove the description content from the fasta > title lines.
Use Select lines matching expression on the GTF file to standardize the format. Specifically, remove the # header lines.
If you do both, nearly all tools will understand the scientific content inside those two files, and be able to combine information between them for whatever algorithm the tool is applying. Meaning, this standardization wonât hurt and can usually help.
Since both files were from the same data provider, you can probably skip the chromosome comparisons explained in the FAQ. But, you still might need to compare the scheme used in those files to any other data files involved. It depends how paranoid you want to be. I check everything for at least the first time through a new workflow, then Iâll start to trust it. But if I get an error, Iâll then backtrack.
So â if you are still getting an out of memory error: that could be a data problem or the data is actually too large to process. Try cleaning up the reference data as a first pass troubleshooting. If that is enough, then youâll learn that the specific tool you used doesnât understand data in one format but does in another, and can adjust your analysis going forward to account for the picky tool. Or, you might decide, like me, to just clean everything up at the start to be in the format that all (most?) tools can understand.
After youâve done the data cleanup, and a tool is still failing for memory reasons, then you can explore the content of your data files next. Are these large? Do the toolâs parameters make sense for that content? Does the server need more resources allocated to the tool? Can those resources be allocated, or do you need a private server? Or, can the same work be done with a different tool instead, or by arranging the data differently, or by using different parameters? You can post back the job details, and your thoughts, and we can follow up more from there. But please do the reference data adjustments first. Everyone would do the same when troubleshooting.
Firstly, thanks for your detailed answers - I really appreciate them!
To the problem at hand:
I started a new history with the âcleanedâ FASTA and GTF files and started from step one to produce the files needed for Junction Saturation. This is taking time for processing and I will get back to you with an update.
While I waited, I also ran the Junction Saturation on my local python using a powerful computer in the lab (compared to a standard computer). I downloaded the files that were producing errors and used the code retrieved from the job details. All the files successfully ran and produced the expected files without error.
I think the problem was a memory issue. I will continue with my analysis and am still curious to see what happens using the âcleanâ data, but I donât expect there to be a difference.
Iâd be curious if the cleanup was enough or not, too. You can always share back your work for another review. If you have a straightforward example use case that requires more runtime memory, sometimes we can make an adjustment for that exact tool/version on the target server. What you are doing now is how to get to that next step. Not always possible but certainly can be confirmed as a clear yes/no.
Hello @jennaj ! Clean up did not help - seems like it is a memory/time issue, because different files compared to before had errors.
In fact, on starting from scratch, I got 1 memory error even on a tool that ran seemingly seamlessly before cleanup - might this be because I ran too many jobs at one time?