Converting Transcript Id with Ensemble or Entraz ID ?

So how do i exactly redo the counting steps, which the Id, be gene instead of trascript ?

The counts are “by gene” when using Featurecounts with the default settings. And since your data is human, you could use the indexed hg38 genome and built-in annotation.

“Transcriptomics” tutorials here go over the details, but I think you have seen those? Maybe review again and compare to your steps?

I did it with the default setting in featureCount. the Gene id looks like this.

Geneid Length
100287102 1652
653635 1769
102466751 68
100302278 138
645520 1130
79501 918
729737 5474
102725121 1173

its not what ive been thinking

Hi @jennaj

I used the default mode as you said and my geneID is like this : |Geneid|Length|
| — | — |
|100287102|1652|
|653635|1769|
|102466751|68|
|100302278|138|
|645520|1130|
|79501|918|
|729737|5474|
|102725121|1173|

are you sure this is working ? Goseq need ensembel ids and those are not Ensenbel IDs

1 Like

Right, these are Entrez IDs (expected).

Map these to Ensembl with the tool:
annotateMyIDs annotate a generic set of identifiers (Galaxy Version 3.7.0+galaxy1)

After, rearrange the columns so that the Ensembl name replaces the Entrez name, and the file is back in the original format, with no extra columns of data.

There are two different but similar “cut” tools. One will rearrange and eliminate columns of data (what you need) – the other only eliminates columns (preserves the original order of columns, not what you need). The tool version may change over time, but you can tell the difference by the name, or by reviewing the tool forms. The second below clearly states that it does not rearrange columns with a pointer to the first tool.

  • Cut columns from a table (Galaxy Version 1.0.2) << Use this one
  • Cut columns from a table (cut) (Galaxy Version 1.1.0)

Important: Swap the Entrez gene identifiers to be Ensembl gene identifiers for all of your inputs to Goseq. All inputs must match up.

Ok. it works. But about that Cut(table) you said, what should i put in field box to replace the Ensenbl ID and the counts back in the table, like the original format ? i think if i want to put the counts beside the Ensembel IDs, should i not using the tools like Join two dataset ?

@jennaj Hello

Im still waiting for your respond. I know its close to Christmas and you are very busy .\

P.S Wish you the Best in 2020

Yes, you might need to join datasets based on common keys, cut/rearrange columns, plus add back in prior headers or create new ones.

I haven’t been giving the full list of text manipulation tools, with exact steps, since you have been using Galaxy for a while now. Instead, more of an overview and pointing out the tool(s) that solve a particular problem you might not know about (ex: ID conversions) or that are more likely to be tricky to use (example: how/why to pick the version of the cut tools that will do what you need).

Use your best judgment for what intermediates tools are needed for your data manipulations. There are usually many ways to do any particular data manipulation to meet the end goal of proper format/content for inputs.

Example: use Text Manipulation and related tools (that do one thing each) in combination and/or use tools that allow for custom programming, if you know how to use them or are willing to learn (sed, awk, replace). Almost all of these data manipulation tools are based on basic line-command functions, or mimic them, on purpose, to give as much flexibility as possible. The tool name in Galaxy is often the same name as the line-command utility. Nearly anything that can be done line-command can be done in Galaxy.

One way isn’t “better” than others as long as you end up with the final correct result at the end. Then put those tools/steps into mini-workflows for re-use. Workflows have an option to display directly in your tool history so they can be used like a “custom tool” (can hide intermediate steps/datasets, or remove them when not needed, or rename/label the final output to be informative). Common manipulation steps, grouped for a specific goal and put into a workflow, can be used directly like any other tool, and/or added as a sub-workflow in a larger workflow – I and many others do both… it depends on the overall manipulation and how often I think I’ll want to do it again.

The primary advantages of using Galaxy is that all your work is recorded for reproducibility, can be put into workflows, all is easily shared with others (in context), plus all usage is GUI based. When using a public server, then that compute resources on that server is used, with no need to provide that from your own compute resources/system and no need to install/configure tools at the technical level. Even super-users that know how to do analysis line-command choose to use Galaxy for these reasons. Keeping track of analysis pipelines – what was done, exactly – is a big headache when working line-command. Line-command work is also challenging to share with others in a clear reproducible way (for collaboration, publication, or however you decide to share/publish). In short, Galaxy makes it easy to track/share 1) input/result data (histories) and 2) methods/tools/parameters (workflows) – for your own personal use or otherwise.

1 Like

There’s also a one-step tool that can do this one task pretty well:
https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
or https://usegalaxy.org/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
depending on which server you’re using.

1 Like

Right, that is what he is using after Featurecounts. But needs to reformat the results so can be used as inputs for goseq (counts and length data). Wants data using with Ensembl gene IDs instead of Entrez.

I don’t think annotatemyIds will reformat/create outputs directly for goseq – or am I missing something? Totally possible :slight_smile:

Goseq works with a list of differentially expressed genes, so that wouldn’t be right. There’s a full tutorial here that uses goseq: https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

1 Like

Agree :slight_smile: Featurecounts is one step (counts) – the DE steps still need to be done for the true/false expression determination.

I think @amir is already following that tutorial’s workflow, just had some problems with getting the inputs right – matching annotation + ref genome. Was counting by transcript originally and had other input issues (fasta was input instead of tabular length). And wanted to use Ensembl IDs. But I’ll let him follow up with actual goals, may have misunderstood.

1 Like

So i done these steps :slightly_smiling_face:
1: getting annotatation with AnnotatedmyID, by the way, i Hit the box which was for deleting duplication, for my next step
2: join both data(Change case+annotatedmyID) with Paste Tool
3: Cut the colmunms needed for the goseq
… Unfortunatly

Error in .rowNamesDF<-(x, value = value) :
duplicate ‘row.names’ are not allowed
Calls: row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Warning message:
non-unique values when setting ‘row.names’: ‘ENSG00000004866’, ‘ENSG00000011454’, ‘ENSG00000049449’, ‘ENSG00000076928’, ‘ENSG00000086205’, ‘ENSG00000104064’, ‘ENSG00000111850’, ‘ENSG00000112096’, ‘ENSG00000115221’, ‘ENSG00000124343’, ‘ENSG00000127603’, ‘ENSG00000129965’, ‘ENSG00000130035’, ‘ENSG00000137871’, ‘ENSG00000143226’, ‘ENSG00000156273’, ‘ENSG00000157326’, ‘ENSG00000159216’, ‘ENSG00000163444’, ‘ENSG00000163633’, ‘ENSG00000163945’, ‘ENSG00000169621’, ‘ENSG00000172366’, ‘ENSG00000178104’, ‘ENSG00000182230’, ‘ENSG00000182648’, ‘ENSG00000183292’, ‘ENSG00000187514’, ‘ENSG00000187951’, ‘ENSG00000188681’, ‘ENSG00000189064’, ‘ENSG00000204314’, ‘ENSG00000213077’, ‘ENSG00000215269’, ‘ENSG00000215483’, ‘ENSG00000223802’, ‘ENSG00000225830’, ‘ENSG00000230000’, ‘ENSG00000236362’, ‘ENSG00000239533’, ‘ENSG00000254911’, ‘ENSG000002 [… truncated]

this came up and this time i really have no idea

@jennaj I solved the problem ive mentioned in the last reply, But still have this :slightly_smiling_face: (shared in this link)

https://usegalaxy.eu/u/amir/h/memar

Error in .rowNamesDF<-(x, value = value) :
missing values in ‘row.names’ are not allowed
Calls: row.names<- -> row.names<-.data.frame -> .rowNamesDF<-

The mapping may not be 1-1.

I would suggest using the Entrez IDs directly from Featurecounts instead to ensure distinct geneIDs. Then, at the end, you can convert those IDs to Ensembl for other purposes you may have.

Follow the tutorial, it will help to avoid problems like this.

Finally its done @jennaj thanks for your supportive and kind response

1 Like

How do I remove the .plus version from gene ids

1 Like

Missed the question, but for others reading, please see:

2 posts were split to a new topic: Using GOSEQ with a custom category input