"error in importRdata" using isoformswitchanalyzR

Hello,

I’m trying to run isoformswitchanalyzR, my history is here: Galaxy | Europe

I seem to be having a problem with the reference transcriptome not matching with my quantified data.

My error message:

“Step 1 of 7: Checking data…
Step 2 of 7: Obtaining annotation…
importing GTF (this may take a while)…
Error in importRdata(isoformCountMatrix = quantificationData$counts, isoformRepExpression = quantificationData$abundance, :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925).
Either isforoms found in the annotation are not quantifed or vise versa.
Specifically:
63009 isoforms were quantified.
60127 isoforms are annotated.
Only 60049 overlap.
2960 isoforms quantifed had no corresponding annoation”

I’ve double checked my filtering steps and they seem fine, I’m not sure if I’ve missed something or if the wormbasegeneIDs are messing something up?

Thanks for any help

Tutorial:

Hi @Margaret_M

Stringtie will create a type of placeholder transcript and gene ID for novel data. Novel means no known annotation.

Using the setting Use Reference transcripts only? = YES will restrict the data to known annotation, so that is one choice.

Or, an annotation GTF and transcript fasta can be created to include novel predictions. That is what is happening in this section of the tutorial (near the end, please scroll down). https://training.galaxyproject.org/topics/transcriptomics/tutorials/differential-isoform-expression/tutorial.html#transcriptome-assembly-quantification-and-evaluation

I don’t see those manipulations in your history – or did I miss it?

Thank you. I did indeed skip the latter manipulation, because it doesn’t seem to cut the transcripts fully contained in a reference intron out of the reference - just identifies them. Am I misunderstanding the tutorial?

Yes, this is what the analysis is doing. Identifying novels.

I’m not sure what this means. Could you explain more?

Yes, as far as I understand, the problem that IsoformswitchanalyzR is having is that there are unannotated isoforms in one of its inputs, and not in the other. I noticed in the tutorial that "Use Reference transcripts only?” is toggled to No at the beginning of the Stringtie assembly, but later toggled to yes. I’m attempting to re-run the workflow but always toggling “Use reference transcripts only” to “yes”

1 Like

Those settings were specified on purpose since they fit the data in the tutorial example. Setting both to Yes is what I would suggest as well. I had forgotten about that bit (it has come up before) but definitely recognize what is going on now, so thanks for clarifying. Hope that works out!

Well that does seem to resolve that error, thank you! However, now I get the following error, and I’m not sure how to interpret it or what to do with it. Any suggestions would be very gratefully appreciated

My history: Galaxy | Europe

"Step 1 of 3: Identifying which algorithm was used…
The quantification algorithm used was: StringTie
Found 6 quantification file(s) of interest
Step 2 of 3: Reading data…
reading in files with read_tsv
1 Error in tximport::tximport(files = localFiles, type = tolower(dataAnalyed$orign), :
all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE
Calls: importIsoformExpression → → stopifnot
Warning message:
One or more parsing issues, call problems() on your data frame for details,
e.g.:
dat ← vroom(…)
problems(dat) "

Hum… this is about importing a file and the data frame (“table”) doesn’t match what the tool is expecting. Meaning, the data needs to be R friendly. We can catch some of that for free-text custom values entered on the tool form but not all cases. So, simple is better.

What I usually check first:

  1. No empty files or header only files (it happens…)
  2. Do any files have a header? Try removing those, leaving just data lines.
  3. Are all custom values entered on the tool form (labels) formatted Ok? Alphanumeric characters, underscores, and not starting with a number tend to work best. Also, keep them short-ish to avoid another gotcha.

If that doesn’t work, please post back a link to the Dataset Details view (“i” in a circle icon in an expanded dataset). That has all the technical details, plus the inputs, parameters, and full job logs. The data will only have a peak view which is sometimes enough.

I’ve checked and removed the one header I could find. I’ve made sure that the files are not empty and the values are all formatted fine.

Link to the details here.

Thanks for any suggestions

Hi @Margaret_M

This GTF still appears to have headers. Please try correcting this data and any others like it. You can run the manipulation on the collection.

  • tool: Select
  • option: Not Matching
  • regular expression: ^#

Note: Including dots . in the element identifiers of collections can cause different problems, especially when using a workflow. When creating a collection, always use the default option to remove file extensions. From here, you could recreate the collection or use the Collection Manipulation tools to adjust them.

Hi @Margaret_M,

I’m trying to re-run your analysis from the beginning; I’m not sure if the problem could be a result of a deficient annotation, because IsoformSwitchAnalyzeR seems to be very stringent about it. Could you re-upload the Caenorhabditis paired-end datasets and share the fresh history with me?

Regards

Certainly, here is a history with just the paired-end datasets and the WBcel235 assembly files. I had considered it was a problem with not using ensembl IDs.

Hi @Margaret_M,
did you remove the hidden files?

I’m very sorry, which files would you like to see?

The original FASTQ files.