Converting Transcript Id with Ensemble or Entraz ID ?

I need a tool which can change the transcript id to ensemble or entraz ID ? i need it for goseq analysis, i think the problem in that tool is from the IDs.
thanks

1 Like

Hi @amir

This tool will convert IDs. You may need to make a “convert” file yourself from other public resources.

  • Replace column by values which are defined in a convert file (Galaxy Version 0.2)

Thanks!

@jennaj Hi
If i use this tool, you sure the counts doesnt change ? because your saying i need to find the id list from another ID list, So this tool can identify which is which in my ID list /?
And BTW… i need this for Goseq analysis, why the goseq does not recognize the transcript id ? whats wrong about that tool? I did tried some many times but it never worked

1 Like

That Replace tool will transform IDs if you need to. It accepts any two column tabular mapping file. The ID it changes has to be in the first column of your input dataset. There are other ways to do that of course – many tools will match up data based on a common value, that one just happens to do the replace all in one step. Values in other fields are not modified.

The goseq tool does not look up transcriptIDs, but geneIDs. Maybe that is where you are confused? A transcript-to-gene mapping input is required.

Other troubleshooting help that has resolved usage issues in the past for goseq:

  • There should be no versions on transcript or gene names. If present, remove those (the . dot and any number after).
  • A correctly formatted transcript-to-gene reference is needed for the goseq tool. The IDs from all inputs need to match. This input can be a gtf dataset (with no header lines) or a two column tabular dataset. A gff3 annotation will not work. Sorry, wrong advice, this particular tool does not need a mapping file of this type, all inputs are already summarized by geneID (by the tool Featurecounts or however you decide to do the counting).
  • Then confirm that the gene identifier type is match for the chosen “Select Gene ID format”. You could just do a web search with your ID to find out what type it is.
  • Only four genome builds are supported as built-in indexes. Or you can supply one yourself.
  • The format for all the inputs is described in the help section with examples. If any are malformed, the tool will fail.

General FAQ for DE tools. The formatting rules apply not only to DE analysis (are general best practices, summarized, that link to the other FAQs) and covers some of the topics we have discussed when you had trouble with other tools. May help now: Extended Help for Differential Expression Analysis Tools

If you are still stuck after this, are you really working at usegalaxy.org? You could send in a bug report if so. Include a link to this Galaxy help post in the comments and send a reply here, so I know when to look for it. I might recognize what is going wrong. Make sure the error from goseq and all input dataset are left undeleted.

I see several posts from you about this and some are labeled as usegalaxy.eu and some are for usegalaxy.org. Including this one, which does contain a history link: Problem in Goseq tool. Is that still current?

Ok, I see the problems.

  1. The IDs are transcripts, not genes. Only geneIDs work with this tool. A search for one of those IDs points to here at the Ensembl web site.
  2. The IDs have a . plus version. Those need to be removed once you count up by gene, if still present.
  3. One of your inputs has a header line. That should be removed, again, once you have counts by gene, not transcript.
  4. The IDs are definitely Ensembl. And are human (GRCh38).
  5. The “Gene Categories” input was selected from the history, and was is a fasta dataset, not a tabular “gene-to-go” mapping file. For GRCh38 you don’t need to supply your own mapping – the genome is supported. Instead set the form to use a built-in index and choose “Human (hg38)”.

The proper inputs/formats are all listed out on tool form with examples.

I did these :

  1. excluding the “.” from my transcript ID(A)
    2.get annotate my transcript ID to get Gene symbol with “annotatemyID”(B)
    3: join the two dataset( A and B results) with “Paste”
    4: Delete the other column with “Cut”
    5:get the results into goseq.
    but still i get this error (look at screen shot)

Whats the problem

@amir

Your original counts were “by transcript” and you swapped those out for gene symbols. Multiple transcripts can be associated with any single gene.

Duplicated gene names won’t work with goseq. This tool expects the input counts to be “by gene”. You probably need to back up and redo the counting steps by gene, not transcript. Just adding up the counts for each transcript belonging to the same gene will create scientific content problems (overcounting).

So how do i exactly redo the counting steps, which the Id, be gene instead of trascript ?

The counts are “by gene” when using Featurecounts with the default settings. And since your data is human, you could use the indexed hg38 genome and built-in annotation.

“Transcriptomics” tutorials here go over the details, but I think you have seen those? Maybe review again and compare to your steps?

I did it with the default setting in featureCount. the Gene id looks like this.

Geneid Length
100287102 1652
653635 1769
102466751 68
100302278 138
645520 1130
79501 918
729737 5474
102725121 1173

its not what ive been thinking

Hi @jennaj

I used the default mode as you said and my geneID is like this : |Geneid|Length|
| — | — |
|100287102|1652|
|653635|1769|
|102466751|68|
|100302278|138|
|645520|1130|
|79501|918|
|729737|5474|
|102725121|1173|

are you sure this is working ? Goseq need ensembel ids and those are not Ensenbel IDs

1 Like

Right, these are Entrez IDs (expected).

Map these to Ensembl with the tool:
annotateMyIDs annotate a generic set of identifiers (Galaxy Version 3.7.0+galaxy1)

After, rearrange the columns so that the Ensembl name replaces the Entrez name, and the file is back in the original format, with no extra columns of data.

There are two different but similar “cut” tools. One will rearrange and eliminate columns of data (what you need) – the other only eliminates columns (preserves the original order of columns, not what you need). The tool version may change over time, but you can tell the difference by the name, or by reviewing the tool forms. The second below clearly states that it does not rearrange columns with a pointer to the first tool.

  • Cut columns from a table (Galaxy Version 1.0.2) << Use this one
  • Cut columns from a table (cut) (Galaxy Version 1.1.0)

Important: Swap the Entrez gene identifiers to be Ensembl gene identifiers for all of your inputs to Goseq. All inputs must match up.

Ok. it works. But about that Cut(table) you said, what should i put in field box to replace the Ensenbl ID and the counts back in the table, like the original format ? i think if i want to put the counts beside the Ensembel IDs, should i not using the tools like Join two dataset ?

@jennaj Hello

Im still waiting for your respond. I know its close to Christmas and you are very busy .\

P.S Wish you the Best in 2020

Yes, you might need to join datasets based on common keys, cut/rearrange columns, plus add back in prior headers or create new ones.

I haven’t been giving the full list of text manipulation tools, with exact steps, since you have been using Galaxy for a while now. Instead, more of an overview and pointing out the tool(s) that solve a particular problem you might not know about (ex: ID conversions) or that are more likely to be tricky to use (example: how/why to pick the version of the cut tools that will do what you need).

Use your best judgment for what intermediates tools are needed for your data manipulations. There are usually many ways to do any particular data manipulation to meet the end goal of proper format/content for inputs.

Example: use Text Manipulation and related tools (that do one thing each) in combination and/or use tools that allow for custom programming, if you know how to use them or are willing to learn (sed, awk, replace). Almost all of these data manipulation tools are based on basic line-command functions, or mimic them, on purpose, to give as much flexibility as possible. The tool name in Galaxy is often the same name as the line-command utility. Nearly anything that can be done line-command can be done in Galaxy.

One way isn’t “better” than others as long as you end up with the final correct result at the end. Then put those tools/steps into mini-workflows for re-use. Workflows have an option to display directly in your tool history so they can be used like a “custom tool” (can hide intermediate steps/datasets, or remove them when not needed, or rename/label the final output to be informative). Common manipulation steps, grouped for a specific goal and put into a workflow, can be used directly like any other tool, and/or added as a sub-workflow in a larger workflow – I and many others do both… it depends on the overall manipulation and how often I think I’ll want to do it again.

The primary advantages of using Galaxy is that all your work is recorded for reproducibility, can be put into workflows, all is easily shared with others (in context), plus all usage is GUI based. When using a public server, then that compute resources on that server is used, with no need to provide that from your own compute resources/system and no need to install/configure tools at the technical level. Even super-users that know how to do analysis line-command choose to use Galaxy for these reasons. Keeping track of analysis pipelines – what was done, exactly – is a big headache when working line-command. Line-command work is also challenging to share with others in a clear reproducible way (for collaboration, publication, or however you decide to share/publish). In short, Galaxy makes it easy to track/share 1) input/result data (histories) and 2) methods/tools/parameters (workflows) – for your own personal use or otherwise.

1 Like

There’s also a one-step tool that can do this one task pretty well:
https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
or https://usegalaxy.org/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
depending on which server you’re using.

1 Like

Right, that is what he is using after Featurecounts. But needs to reformat the results so can be used as inputs for goseq (counts and length data). Wants data using with Ensembl gene IDs instead of Entrez.

I don’t think annotatemyIds will reformat/create outputs directly for goseq – or am I missing something? Totally possible :slight_smile:

Goseq works with a list of differentially expressed genes, so that wouldn’t be right. There’s a full tutorial here that uses goseq: https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

1 Like

Agree :slight_smile: Featurecounts is one step (counts) – the DE steps still need to be done for the true/false expression determination.

I think @amir is already following that tutorial’s workflow, just had some problems with getting the inputs right – matching annotation + ref genome. Was counting by transcript originally and had other input issues (fasta was input instead of tabular length). And wanted to use Ensembl IDs. But I’ll let him follow up with actual goals, may have misunderstood.

1 Like

So i done these steps :slightly_smiling_face:
1: getting annotatation with AnnotatedmyID, by the way, i Hit the box which was for deleting duplication, for my next step
2: join both data(Change case+annotatedmyID) with Paste Tool
3: Cut the colmunms needed for the goseq
… Unfortunatly

Error in .rowNamesDF<-(x, value = value) :
duplicate ‘row.names’ are not allowed
Calls: row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Warning message:
non-unique values when setting ‘row.names’: ‘ENSG00000004866’, ‘ENSG00000011454’, ‘ENSG00000049449’, ‘ENSG00000076928’, ‘ENSG00000086205’, ‘ENSG00000104064’, ‘ENSG00000111850’, ‘ENSG00000112096’, ‘ENSG00000115221’, ‘ENSG00000124343’, ‘ENSG00000127603’, ‘ENSG00000129965’, ‘ENSG00000130035’, ‘ENSG00000137871’, ‘ENSG00000143226’, ‘ENSG00000156273’, ‘ENSG00000157326’, ‘ENSG00000159216’, ‘ENSG00000163444’, ‘ENSG00000163633’, ‘ENSG00000163945’, ‘ENSG00000169621’, ‘ENSG00000172366’, ‘ENSG00000178104’, ‘ENSG00000182230’, ‘ENSG00000182648’, ‘ENSG00000183292’, ‘ENSG00000187514’, ‘ENSG00000187951’, ‘ENSG00000188681’, ‘ENSG00000189064’, ‘ENSG00000204314’, ‘ENSG00000213077’, ‘ENSG00000215269’, ‘ENSG00000215483’, ‘ENSG00000223802’, ‘ENSG00000225830’, ‘ENSG00000230000’, ‘ENSG00000236362’, ‘ENSG00000239533’, ‘ENSG00000254911’, ‘ENSG000002 [… truncated]

this came up and this time i really have no idea