Continuing the discussion from [RSEM prepare reference tool does not produce valid reference files]

Continuing the discussion from RSEM prepare reference tool does not produce valid reference files:

Dear specialists,

I wold like to ask your help and attention on “RSEM prepare reference” tool which fails to prepare RSEM reference both on “transcript fasta” and on “reference genome and GTF” as reference transcript source.

tool finishes job with empty output with the warning below on “transcript fasta”;

RSEM_Ref_Gencode43_CDS
WARNING:galaxy.model:Datatype class not found for extension ‘rsem_ref’

the tool fails on “reference genome and GTF” with a standard error
I created a bug report from this failing job.

Would you please suggest how to proceed to generate RSEM reference on CDS with the “transcript fasta” option?

Please find the link to the history below;

Kind regards,
Serdar

Hi @qcsciphi

The message in one of the red error datasets:

RSEM_Ref_Gencode43_only
Cannot extract transcript ENST00000373020.9’s sequence from chromosome ENSG00000000003.15, whose information might not be provided! Please check if the chromosome directory is set correctly or the list of chromosome files is compl (ete)

ENST00000373020.9 is a transcript ID
ENSG00000000003.15 is a gene ID, but the tool thinks it is where a chromosome should be specified?

This was another error:

RSEM_Ref_Ensemble109_only
.gtf file might be corrupted!
Stop at line : 1 ensembl_havana gene 1471765 1497848 . + . gene_id “ENSG00000160072”; gene_version “20”; gene_name “ATAD3B”; gene_source “ensembl_havana”; gene_biotype “protein_coding”;

1 is a chromosome
ENSG00000160072 is a gene ID
and that file has no transcript specified on that line.

What to modify or check:

  1. Each input area on the form expects data in specific formats, and with the matching datatype assigned.
  2. If you are not sure what datatypes are expected, see fastq unavailable -- Tool does not recognize inputs? How to check why
  3. Double check that the content of those data follow the datatype specification.
  4. Check across files for consistency. Tools are literal when making matches between important identifiers.
  • Meaning, if one file has an identifier, then other files should have that exact identifier, not a variation.

  • Example of a mismatch:

    • GTF has: gene_id "ENSG00000160072"
    • But the Fasta has: >ENSG00000160072.11
    • And what the tool is looking for is >ENSG00000160072
  • Another example of a mismatch:

    • Chromosome identifiers: chr1 or Chr1 or 1 do not mean the same thing to a tool
    • In short, make sure all inputs are based on the same exact reference genome build. This includes any indexes native to the server that are involved (or assigned database metadata).

Some FAQs contain datatype format/content help, and this whole FAQ page can be browser-keyword searched.
https://training.galaxyproject.org/training-material/faqs/galaxy/

Please give these a review, then adjust your data.