Issue with rnaQUAST tool

Hello,

I’m having some trouble with the rnaQUAST tool.

I am following the Genome-wide alternative splicing analysis tutorial on galaxy and at the transcriptome evaluation step using rnaQUAST, I keep getting the error…

Additionally, in preparation for using the IsoformSwitchAnalyzeR, I am stuck on the step requiring to extract the reference transcriptome using the gffread tool. It failed and gave me this error even though I was using the cached reference data for hg38 FASTA sequence reference genome (in-built).

Im not sure how to proceed and any insights would be greatly appreciated!

Thank you!

Hi @SehajR

I have a copy of the completed protocol in this shared history. Maybe comparing to these steps helps? → https://usegalaxy.eu/u/jenj/h/gtn-differential-isoform-expression-1

XREF tutorial → Hands-on: Genome-wide alternative splicing analysis / Genome-wide alternative splicing analysis / Transcriptomics

Then, for your first error, I would be curious about what the Job Details view looks like for this job. Find this using the i-info icon. The tool is reporting that there is a duplication in one of the inputs. That is just a guess by the tool – so there could be a repeated input file but also inspect that view in general. Example: toggle open the inputs and check to make sure these all have the expected content, and are not empty.

And for the second error, yes, the issue has to do with a mismatch with the reference data. The input data for the tutorial includes a special downsampled set of reference files. You’ll want to use those. Or, you can replace these with full reference versions. But a hybrid approach is not likely to work well – and most of those “mismatched data” errors will look like yours. One of the annotation files (GFF files) includes a sequence that was not in the genome fasta files. Maybe one genome was used for upstream steps, and a different one for downstream steps? Instead, using the same assembly version throughout an analysis is usually required.

:rocket: Bonus guide about reference data → Reference genomes at public Galaxy servers: GRCh38/hg38 example

Please let us know if this helps, and if you would like more help, you are welcome to share back your history!

1 Like

Hi Jennifer,

Thank you so much for your help!
I had sent an email regarding this issue to galaxy-bugs@galaxyproject.org where I linked my history but I can share that here too if that helps.

Hi Jennifer,

I had a look at the job details for the first issue with the rnaQUAST tool, and there doesn’t seem to be any duplicate inputs…

As for the gffreads, I’m still confused because I’ve consistently used the cached hg38 reference genome provided by galaxy for all upstream analysis and the same GTF file consistently, I’m not really sure what I’m missing.

Hi @SehajR

Thanks for sharing the history! Very helpful.

The tutorial was using hg19. And you are using hg38 instead, which is fine, but you will need to gather together the reference data at the very start, then continue to use those same files for all steps.

Right now you are using the built-in hg38 genome (which was sourced from UCSC) for some steps, then a Gencode sourced genome fasta for other steps. This is technically fine except that the reference assembly didn’t have the format simplified. This lead to part of one of the problems.

Then, the annotation files also had a few different versions, some from UCSC and some from Gencode. Some of those versions also had minor format issues that led to the errors.

Try this

  1. Choose a reference genome source and use it for all steps. This can be a fasta in the history. How to format this fasta file is covered in a few of our FAQs, this is a good place to start:
  • FAQ: How to use Custom Reference Genomes?
  • Notice that the fasta file is processed through a tool that removes all of the description content out of the file – leaving just the sequence identifiers on the > title lines.
  1. Choose a reference annotation source and use it for all steps. It doesn’t matter which you choose – the version from UCSC (RefSeq) or the version from Gencode (Ensembl) with both work but you shouldn’t switch between them. If you prefer a format/source for the gene and transcript names, then choose that one, and use it with Stringtie and then the other steps, too.

With those two files, you will be able to derive all of the other files such as a BED12 to use with the extra QC steps.

If you click on the i-info icon for dataset 913 (rnaQuast failture), you’ll see that both of the reference files had a formatting issue, then if you click back through the upstream jobs, the Stringtie processes used a completely different GTF file. So, that job had a few different problems!

Hope this helps! :slight_smile:

Hi Jennifer,

So I went back to the start and looked at each of the jobs in my history to check which files I used for the reference genome and annotation. I know I had uploaded multiple types but for the gtf file, I stuck to using the hg38.ncbiRefSeq.gtf (#182 in the history link).

The first use is with RNA STAR for mapping the splice junctions I provided the hg38.ncbiRefSeq.gtf (#182) but used the built-in index reference genome. I selected the built-in as I recall when uploading one from my end to build an index, it was not working as it needed way more memory which exceeded the limit in galaxy.

For the second RNA STAR using the 2-pass mapping, I again used the same built-in hg38 index for the reference genome and the same hg38.ncbiRefSeq.gtf annotation(#182) for splice junctions .

Then for the first StringTie (#610), the same hg38.ncbiRefSeq.gtf (#182) annotation was used as the reference file.

For StringTie merge (#733), again the same hg38.ncbiRefSeq.gtf (#182) annotation was used.

And in the third StringTie, I followed the tutorial where this time the reference file for the GTF is the output of the previous StringTie merge.

For the GffCompare tool, again only hg38.ncbiRefSeq.gtf (#182) annotation was used for input as the source for the reference annotation.

For the first gffread job (#858) right before the rnaQUAST tool, I used the same hg38 reference genome provided by galaxy (selection: cached reference - hg38) that I’m assuming is the same one as RNA STAR’s built-in hg38 reference genome index option.

Then at the rnaQUAST tool (#959) is where I run into my first issue. Up until this point, I have used the same annotation and reference genome. The problem is for the reference genome selection in this tool, there is no option to use the cached reference from galaxy that I’ve been previously using. So I downloaded an external hg38 fasta file from NCBI (#891 GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta) and used the same hg38.ncbiRefSeq.gtf (#182) genome annotation, which still failed.

Is there a way for me to use the cached reference genome from galaxy, to upload that version to my history and select it that way for rnaQUAST?

The second issue occurs right after, when I use the gffread tool a second time (#961) following the tutorial. Again I used the same cached hg38 reference genome provided by galaxy as with the previous gffread and the previous RNA STAR runs, but this time it failed.

I am not clear where the inconsistency is.

Hi Jennifer,

I’ve been continuing to try to troubleshoot these issues and something weird occurred. The output of the second gffreads (#961) which failed looks the same as the output from the same step in the history you shared for the tutorial. The only difference is mine has refseq IDs and the tutorial had ensemble IDs….so the output seems fine even if it says “failed”.

I was unable to use the “failed” job as an input for IsoformSwitchAnalyzeR so I just downloaded the “failed” output and re-uploaded it which allowed me to use it as input (#963), that then worked with IsoformSwitchAnalyzeR at step 964.

But then the second IsoformSwitchAnalyzeR steps all failed (#965 to #968).

Hi @SehajR

Thanks for explaining all the details!

The rnaQuast tool should be able to access the RNA-Star indexes, but currently cannot. I’ve made a request to the developers for this change.

Another wrinkle is that gffread is accessing a slightly different patch release of the hg38 assembly, so whenever that is run with the server index, minor differences are detected (these are revealed if the extra logs are toggled to yes/warn on the form). I’ll be following up on this, but for now, using the same genome fasta with this tool also solves the errors (reloading was a good workaround!). But since gffread was used several steps back, too, the input to rnaQuast would need to also be regenerated. Tedious!

As a test, I would suggest running the workflow associated with the tutorial using your own read data and reference data. This would involve providing both the reference genome and reference annotation from the history.

You can use the hg38 genome as it is linked in the ticket above along with the annotation you already have in dataset 182. Both are from UCSC and are “paired.” It also looks like you have already loaded the Pfam reference data from the tutorial. This means you have all of the inputs needed. This will give the protocol a full test with just a few clicks, then you can examine and make more custom changes as wanted. If you run into any problems, we can follow up more!

Hi Jennifer,

I’m trying to select files that pair well, and I think I can only use files from gencode because reference annotations sourced from UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/) don’t have the gene_type feature in the Attributes’ column which is needed for generating files with the ‘search in textiles tool’ to use as input with the CPAT tool.

I loaded a new gtf reference annotation, this time from gencode that I got from (Index of /refgenomes-databio/alias/hg38/gencode_gtf/default/), this is job #994 in my history.

I also am struggling to figure out which fasta file to use from gencode that would be compatible with this gencode reference annotation…

Can I just use the RNA STAR built-in index because I’m worried I’ll run into the previous problem of galaxy not having enough memory to build an index from the custom reference genome I provide.

Yes, you can probably use the RNA-Star index, as long as you get the fasta into your history and use it with both gffread and rnaQuast.

And you are right – the way this tutorial does the filtering on the annotation is expecting the Gencode formatted version of the GTF.

If you also got the hg38 assembly from Gencode, you could get the fasta version for just the reference chromosomes, and that would solve the problem with a few of the alignments mapping to the alt/haplotype regions.

But all of these are choices you can make, now that you understand what the root issues were: minor differences between patch assemblies.

1 Like

Hi Jennifer,

So I went ahead and redid the analysis but this time using a reference annotation from gencode which I formatted by removing the # headers (#995). I also used a gencode primary reference genome assembly fasta file that excludes assembly patches and alternate loci (haplotypes) from gencode’s hg38, I also used NormalizeFasta to clean it up according to FAQ: How to use Custom Reference Genomes?

For the RNA STAR runs I stuck with a built-in index, but once I got to the gffread step (#1444), the tool failed again even when using the uploaded gencode primary genome assembly fasta file.

It seems this method unfortunately still doesnt work.

Hi @SehajR

Would you like to share that new history back? I really would like to get this fully solved for you. We are already flushing out a few pain points with the new protocol! The smaller reference data in the tutorial was masking some of the complexities.

Then, as a proof-of-concept for the suggestions I had about the reference data, I loaded up the RNA-Star version of the reference genome to use with both of your original errors with gffread and rnaQuast, and both of those “worked”. I tagged the datasets and all are near the top. Maybe you can spot what I did as one solution? But using the Gencode fasta should be about the same, as long as it is used with all the proceeding steps. I might start up the workflow with the Gencode files next to see what happens to double confirm!

Running this analysis on the full human genome with real full sized sample data is important! Thank you for sharing your data so far to help with the stress testing! Once we are done I’ll get rid of all my copies. :slight_smile:

Hey Jennifer,

I really appreciate all your help on this! I’ve shared my history link here again

Just to clarify the gencode reference annotation I got from Index of /refgenomes-databio/alias/hg38/gencode_gtf/default/

And the gencode reference genome from GENCODE - Human Release 48 the primary reference genome fasta file without the assembly patches and alternate loci (haplotypes)

I’m a little confused is the RNA star version of the reference genome you used for UCSC? I don’t think I’d be able to use it in that case since reason for choosing gencode was for the gene_type feature in the attributes column of the gencode reference annotation which is needed for downstream steps in the tutorial.

1 Like

Hi @SehajR

Your job failed because the reference genome used for mapping the upstream BAM files was the built in hg38 index, not the custom genome in the history.

So, this loops back around to the initial suggestion: pull in the same exact reference genome used for mapping into the history, and use that instead. An example is in my copy of your history that I shared this morning.

The alternative is to use the Gencode fasta for all steps. Mixing up different reference genome assemblies is where the conflicts are introduced – different patch release alt/haplotype fragments.

And I decided to remove that complicated history with you data and my reruns combined. Let’s generate a very clean history using the workflow, your read data, and both Gencode reference files (prepared to the simple formats).

You can click through to see the inputs from pending steps/parameters/dataset choices right now, then it will populate with completed data over the next day or so.

You can do the same, or grab a copy later on.

1 Like

Hi Jennifer,

The reason I didn’t use the gencode reference genome fasta (custom) with RNA STAR and instead used the built-in index is because as I previously mentioned, I’ve run into the issue of galaxy not having enough memory to build an index from the custom reference genome I provide.

I can try it again with this new gencode reference genome fasta as I was using maybe a larger file before…

oh my gosh this is a life saver thank you!! :clap:

1 Like

Oh sure, and those memory concerns were very likely to be a problem up until about a year ago when we were able to access more/different clusters at ORG. Before then, EU was reserved for the very largest memory jobs. But this should work fine now, and I have had a few test jobs using all of your samples process correctly with the human custom genome I am suggesting. But I haven’t run all of it top to bottom yet – and I’m curious how this will turn out – so let’s keep my version of the history running along with what you are trying.

So: I am expecting all the work in that new shared history to process correctly. The wrinkle right now is that of the public servers are under massive load from the training events this week, so random errors might pop out. The cool thing about workflows is that when something goes wrong, it is easy to rerun then resume all the downstream steps really quickly.

This should flush out any small issues remaining but let’s confirm by letting this run and following up as it progresses. :slight_smile:

1 Like

Thanks so much Jennifer,

I’ll keep monitoring it, this has been a huge learning experience for me so thank you!

Hi Jennifer,

Just an update on my end, so far RNA STAR seems to work fine with the external custom gencode reference annotation I provided which is hopeful :+1:

I’ve shared a new link to my history and just started fresh.