I have followed successfully with my own ChIP-seq data the tutorial From peaks to genes .
I haven´t got though what I was expecting to get at the end.
I explain:
At the end of the association of my peaks (from my ChIP-seq) with certain genes, I would ideally like to have a table including in seperate columns for each peak at least the info below:
Number of chromosome that the peak is located
Start of the peak
End of the peak
Position of the peak summit
The unique name of the peak (I have used MACS for the peak calling and the interval file that MACS gives assigns for each discovered peak a unique name in the 10th column (e.g.: MACS2.1.0_in_Galaxy_peak_1))
ID/locus of the gene which is nearby (or associated) with the certain peak
Relative position of the peak to the TSS of the gene that is nearby. For example, is the peak 1.5 kb upstream from the TSS of the nearby gene, or perhaps 3 kb downstream from the TSS of the nearby gene? In short, where the peak is exactly located relative to the nearby gene?
My question: is there any way to generate such a table in GALAXY?
Many thanks in advance,
Any help will be appreciated,
Greetings,
Manolis
P.S.: I do get that if I apply the pipeline described in the tutorial From peaks to genes I can have at the end the overlap between peaks and related to them genes, but that’s it, nothing more. I cannot see how I could/or if I could at all, to get as a final output a table with all the columns mentioned above.
Yes. There isn’t a single tool that will create this type of table, but a combination of tools could be used, and a custom workflow created once you work out the processing.
The tutorial you are following already covers several data manipulation tools as examples. The 101 also covers many data manipulation tools, including the different methods of joining data together based on common content. More tools that will be useful are in the tool groups “Bed” (bedtools), plus “Operate on Genomic Intervals” and “General Text Manipulation” (where most of the tools in the tutorial are grouped).
Some experimentation on your part will be needed, not only with tool choices but with the reference annotation the peak/summit coordinates will be compared to.
For example, to get the data for column 7, you will need the TSS coordinates for the gene to perform a distance calculation, and decide if you want to base that off the peak region (start or end) or the summit.
In the final step, remove/rearrange columns of data to produce the final custom summary report (tool: Cut). The report will not be in interval format anymore but tabular – since it contains multiple coordinate regions – but you could always create properly formatted (and attribute labeled) bed or interval datasets from it as needed. Be aware that bed format is stricter than interval format. Derived bed datasets may include (in the 4th “name” column) the geneID and others the unique peak name, but not both, unless you merge the values together (without introducing whitespace). You may need to merge those two values together anyway – to use it as a common key to join intermediate datasets together.
The UCSC Table Browser is a good choice. This is where the tutorial extracts other bed data used. The TSS is the start coordinate of a transcript’s coding region.
Please feel free to publish and post back a link to your updated workflow (or workflows!), it may help others.