I need a tool which can change the transcript id to ensemble or entraz ID ? i need it for goseq analysis, i think the problem in that tool is from the IDs.
thanks
Hi @amir
This tool will convert IDs. You may need to make a âconvertâ file yourself from other public resources.
- Replace column by values which are defined in a convert file (Galaxy Version 0.2)
Thanks!
@jennaj Hi
If i use this tool, you sure the counts doesnt change ? because your saying i need to find the id list from another ID list, So this tool can identify which is which in my ID list /?
And BTW⊠i need this for Goseq analysis, why the goseq does not recognize the transcript id ? whats wrong about that tool? I did tried some many times but it never worked
That Replace
tool will transform IDs if you need to. It accepts any two column tabular mapping file. The ID it changes has to be in the first column of your input dataset. There are other ways to do that of course â many tools will match up data based on a common value, that one just happens to do the replace all in one step. Values in other fields are not modified.
The goseq
tool does not look up transcriptIDs, but geneIDs. Maybe that is where you are confused? A transcript-to-gene mapping input is required.
Other troubleshooting help that has resolved usage issues in the past for goseq
:
- There should be no versions on
transcript orgene names. If present, remove those (the.
dot and any number after). -
A correctly formatted transcript-to-gene reference is needed for theSorry, wrong advice, this particular tool does not need a mapping file of this type, all inputs are already summarized by geneID (by the toolgoseq
tool. The IDs from all inputs need to match. This input can be agtf
dataset (with no header lines) or a two column tabular dataset. Agff3
annotation will not work.Featurecounts
or however you decide to do the counting). - Then confirm that the gene identifier type is match for the chosen âSelect Gene ID formatâ. You could just do a web search with your ID to find out what type it is.
- Only four genome builds are supported as built-in indexes. Or you can supply one yourself.
- The format for all the inputs is described in the help section with examples. If any are malformed, the tool will fail.
General FAQ for DE tools. The formatting rules apply not only to DE analysis (are general best practices, summarized, that link to the other FAQs) and covers some of the topics we have discussed when you had trouble with other tools. May help now: Extended Help for Differential Expression Analysis Tools
If you are still stuck after this, are you really working at usegalaxy.org? You could send in a bug report if so. Include a link to this Galaxy help post in the comments and send a reply here, so I know when to look for it. I might recognize what is going wrong. Make sure the error from goseq
and all input dataset are left undeleted.
I see several posts from you about this and some are labeled as usegalaxy.eu and some are for usegalaxy.org. Including this one, which does contain a history link: Problem in Goseq tool. Is that still current?
Ok, I see the problems.
- The IDs are transcripts, not genes. Only geneIDs work with this tool. A search for one of those IDs points to here at the Ensembl web site.
- The IDs have a
.
plus version. Those need to be removed once you count up by gene, if still present. - One of your inputs has a header line. That should be removed, again, once you have counts by gene, not transcript.
- The IDs are definitely Ensembl. And are human (GRCh38).
- The âGene Categoriesâ input was selected from the history, and was is a
fasta
dataset, not a tabular âgene-to-goâ mapping file. For GRCh38 you donât need to supply your own mapping â the genome is supported. Instead set the form to use a built-in index and choose âHuman (hg38)â.
The proper inputs/formats are all listed out on tool form with examples.
I did these :
- excluding the â.â from my transcript ID(A)
2.get annotate my transcript ID to get Gene symbol with âannotatemyIDâ(B)
3: join the two dataset( A and B results) with âPasteâ
4: Delete the other column with âCutâ
5:get the results into goseq.
but still i get this error (look at screen shot)
Whats the problem
Your original counts were âby transcriptâ and you swapped those out for gene symbols. Multiple transcripts can be associated with any single gene.
Duplicated gene names wonât work with goseq
. This tool expects the input counts to be âby geneâ. You probably need to back up and redo the counting steps by gene, not transcript. Just adding up the counts for each transcript belonging to the same gene will create scientific content problems (overcounting).
So how do i exactly redo the counting steps, which the Id, be gene instead of trascript ?
The counts are âby geneâ when using Featurecounts
with the default settings. And since your data is human, you could use the indexed hg38 genome and built-in annotation.
âTranscriptomicsâ tutorials here go over the details, but I think you have seen those? Maybe review again and compare to your steps?
I did it with the default setting in featureCount. the Gene id looks like this.
Geneid | Length |
---|---|
100287102 | 1652 |
653635 | 1769 |
102466751 | 68 |
100302278 | 138 |
645520 | 1130 |
79501 | 918 |
729737 | 5474 |
102725121 | 1173 |
its not what ive been thinking
Hi @jennaj
I used the default mode as you said and my geneID is like this : |Geneid|Length|
| â | â |
|100287102|1652|
|653635|1769|
|102466751|68|
|100302278|138|
|645520|1130|
|79501|918|
|729737|5474|
|102725121|1173|
are you sure this is working ? Goseq need ensembel ids and those are not Ensenbel IDs
Right, these are Entrez IDs (expected).
Map these to Ensembl with the tool:
annotateMyIDs annotate a generic set of identifiers (Galaxy Version 3.7.0+galaxy1)
After, rearrange the columns so that the Ensembl name replaces the Entrez name, and the file is back in the original format, with no extra columns of data.
There are two different but similar âcutâ tools. One will rearrange and eliminate columns of data (what you need) â the other only eliminates columns (preserves the original order of columns, not what you need). The tool version may change over time, but you can tell the difference by the name, or by reviewing the tool forms. The second below clearly states that it does not rearrange columns with a pointer to the first tool.
- Cut columns from a table (Galaxy Version 1.0.2) << Use this one
- Cut columns from a table (cut) (Galaxy Version 1.1.0)
Important: Swap the Entrez gene identifiers to be Ensembl gene identifiers for all of your inputs to Goseq. All inputs must match up.
Ok. it works. But about that Cut(table) you said, what should i put in field box to replace the Ensenbl ID and the counts back in the table, like the original format ? i think if i want to put the counts beside the Ensembel IDs, should i not using the tools like Join two dataset ?
@jennaj Hello
Im still waiting for your respond. I know its close to Christmas and you are very busy .\
P.S Wish you the Best in 2020
Yes, you might need to join datasets based on common keys, cut/rearrange columns, plus add back in prior headers or create new ones.
I havenât been giving the full list of text manipulation tools, with exact steps, since you have been using Galaxy for a while now. Instead, more of an overview and pointing out the tool(s) that solve a particular problem you might not know about (ex: ID conversions) or that are more likely to be tricky to use (example: how/why to pick the version of the cut tools that will do what you need).
Use your best judgment for what intermediates tools are needed for your data manipulations. There are usually many ways to do any particular data manipulation to meet the end goal of proper format/content for inputs.
Example: use Text Manipulation and related tools (that do one thing each) in combination and/or use tools that allow for custom programming, if you know how to use them or are willing to learn (sed, awk, replace). Almost all of these data manipulation tools are based on basic line-command functions, or mimic them, on purpose, to give as much flexibility as possible. The tool name in Galaxy is often the same name as the line-command utility. Nearly anything that can be done line-command can be done in Galaxy.
One way isnât âbetterâ than others as long as you end up with the final correct result at the end. Then put those tools/steps into mini-workflows for re-use. Workflows have an option to display directly in your tool history so they can be used like a âcustom toolâ (can hide intermediate steps/datasets, or remove them when not needed, or rename/label the final output to be informative). Common manipulation steps, grouped for a specific goal and put into a workflow, can be used directly like any other tool, and/or added as a sub-workflow in a larger workflow â I and many others do both⊠it depends on the overall manipulation and how often I think Iâll want to do it again.
The primary advantages of using Galaxy is that all your work is recorded for reproducibility, can be put into workflows, all is easily shared with others (in context), plus all usage is GUI based. When using a public server, then that compute resources on that server is used, with no need to provide that from your own compute resources/system and no need to install/configure tools at the technical level. Even super-users that know how to do analysis line-command choose to use Galaxy for these reasons. Keeping track of analysis pipelines â what was done, exactly â is a big headache when working line-command. Line-command work is also challenging to share with others in a clear reproducible way (for collaboration, publication, or however you decide to share/publish). In short, Galaxy makes it easy to track/share 1) input/result data (histories) and 2) methods/tools/parameters (workflows) â for your own personal use or otherwise.
Thereâs also a one-step tool that can do this one task pretty well:
https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
or https://usegalaxy.org/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/annotatemyids/annotatemyids/3.7.0+galaxy1
depending on which server youâre using.
Right, that is what he is using after Featurecounts
. But needs to reformat the results so can be used as inputs for goseq
(counts and length data). Wants data using with Ensembl gene IDs instead of Entrez.
I donât think annotatemyIds
will reformat/create outputs directly for goseq
â or am I missing something? Totally possible
Goseq works with a list of differentially expressed genes, so that wouldnât be right. Thereâs a full tutorial here that uses goseq: https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html
Agree
Featurecounts
is one step (counts) â the DE steps still need to be done for the true/false expression determination.
I think @amir is already following that tutorialâs workflow, just had some problems with getting the inputs right â matching annotation + ref genome. Was counting by transcript originally and had other input issues (fasta was input instead of tabular length). And wanted to use Ensembl IDs. But Iâll let him follow up with actual goals, may have misunderstood.
So i done these steps
1: getting annotatation with AnnotatedmyID, by the way, i Hit the box which was for deleting duplication, for my next step
2: join both data(Change case+annotatedmyID) with Paste Tool
3: Cut the colmunms needed for the goseq
⊠Unfortunatly
Error in .rowNamesDF<-
(x, value = value) :
duplicate ârow.namesâ are not allowed
Calls: row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Warning message:
non-unique values when setting ârow.namesâ: âENSG00000004866â, âENSG00000011454â, âENSG00000049449â, âENSG00000076928â, âENSG00000086205â, âENSG00000104064â, âENSG00000111850â, âENSG00000112096â, âENSG00000115221â, âENSG00000124343â, âENSG00000127603â, âENSG00000129965â, âENSG00000130035â, âENSG00000137871â, âENSG00000143226â, âENSG00000156273â, âENSG00000157326â, âENSG00000159216â, âENSG00000163444â, âENSG00000163633â, âENSG00000163945â, âENSG00000169621â, âENSG00000172366â, âENSG00000178104â, âENSG00000182230â, âENSG00000182648â, âENSG00000183292â, âENSG00000187514â, âENSG00000187951â, âENSG00000188681â, âENSG00000189064â, âENSG00000204314â, âENSG00000213077â, âENSG00000215269â, âENSG00000215483â, âENSG00000223802â, âENSG00000225830â, âENSG00000230000â, âENSG00000236362â, âENSG00000239533â, âENSG00000254911â, âENSG000002 [⊠truncated]
this came up and this time i really have no idea