Converting Transcript Id with Ensemble or Entraz ID ?

I need a tool which can change the transcript id to ensemble or entraz ID ? i need it for goseq analysis, i think the problem in that tool is from the IDs.
thanks

1 Like

Hi @amir

This tool will convert IDs. You may need to make a “convert” file yourself from other public resources.

  • Replace column by values which are defined in a convert file (Galaxy Version 0.2)

Thanks!

@jennaj Hi
If i use this tool, you sure the counts doesnt change ? because your saying i need to find the id list from another ID list, So this tool can identify which is which in my ID list /?
And BTW… i need this for Goseq analysis, why the goseq does not recognize the transcript id ? whats wrong about that tool? I did tried some many times but it never worked

1 Like

That Replace tool will transform IDs if you need to. It accepts any two column tabular mapping file. The ID it changes has to be in the first column of your input dataset. There are other ways to do that of course – many tools will match up data based on a common value, that one just happens to do the replace all in one step. Values in other fields are not modified.

The goseq tool does not look up transcriptIDs, but geneIDs. Maybe that is where you are confused? A transcript-to-gene mapping input is required.

Other troubleshooting help that has resolved usage issues in the past for goseq:

  • There should be no versions on transcript or gene names. If present, remove those (the . dot and any number after).
  • A correctly formatted transcript-to-gene reference is needed for the goseq tool. The IDs from all inputs need to match. This input can be a gtf dataset (with no header lines) or a two column tabular dataset. A gff3 annotation will not work. Sorry, wrong advice, this particular tool does not need a mapping file of this type, all inputs are already summarized by geneID (by the tool Featurecounts or however you decide to do the counting).
  • Then confirm that the gene identifier type is match for the chosen “Select Gene ID format”. You could just do a web search with your ID to find out what type it is.
  • Only four genome builds are supported as built-in indexes. Or you can supply one yourself.
  • The format for all the inputs is described in the help section with examples. If any are malformed, the tool will fail.

General FAQ for DE tools. The formatting rules apply not only to DE analysis (are general best practices, summarized, that link to the other FAQs) and covers some of the topics we have discussed when you had trouble with other tools. May help now: Extended Help for Differential Expression Analysis Tools

If you are still stuck after this, are you really working at usegalaxy.org? You could send in a bug report if so. Include a link to this Galaxy help post in the comments and send a reply here, so I know when to look for it. I might recognize what is going wrong. Make sure the error from goseq and all input dataset are left undeleted.

I see several posts from you about this and some are labeled as usegalaxy.eu and some are for usegalaxy.org. Including this one, which does contain a history link: Problem in Goseq tool. Is that still current?

Ok, I see the problems.

  1. The IDs are transcripts, not genes. Only geneIDs work with this tool. A search for one of those IDs points to here at the Ensembl web site.
  2. The IDs have a . plus version. Those need to be removed once you count up by gene, if still present.
  3. One of your inputs has a header line. That should be removed, again, once you have counts by gene, not transcript.
  4. The IDs are definitely Ensembl. And are human (GRCh38).
  5. The “Gene Categories” input was selected from the history, and was is a fasta dataset, not a tabular “gene-to-go” mapping file. For GRCh38 you don’t need to supply your own mapping – the genome is supported. Instead set the form to use a built-in index and choose “Human (hg38)”.

The proper inputs/formats are all listed out on tool form with examples.

I did these :

  1. excluding the “.” from my transcript ID(A)
    2.get annotate my transcript ID to get Gene symbol with “annotatemyID”(B)
    3: join the two dataset( A and B results) with “Paste”
    4: Delete the other column with “Cut”
    5:get the results into goseq.
    but still i get this error (look at screen shot)

Whats the problem

@amir

Your original counts were “by transcript” and you swapped those out for gene symbols. Multiple transcripts can be associated with any single gene.

Duplicated gene names won’t work with goseq. This tool expects the input counts to be “by gene”. You probably need to back up and redo the counting steps by gene, not transcript. Just adding up the counts for each transcript belonging to the same gene will create scientific content problems (overcounting).

So how do i exactly redo the counting steps, which the Id, be gene instead of trascript ?

The counts are “by gene” when using Featurecounts with the default settings. And since your data is human, you could use the indexed hg38 genome and built-in annotation.

“Transcriptomics” tutorials here go over the details, but I think you have seen those? Maybe review again and compare to your steps?