Question about the tutorial ``From peaks to genes``

Hello,

I am following the tutorial From peaks to genes in GALAXY training
It suggests that I must have 2x different kinds of files to start:

A) The file of my ChIP-seq peaks (Chr#, Start, End).

B) An output refGene as BED file created by UCSC-Main, table browser.

I can easily create the A); I do though have problems with B). I explain:

I am working with ChIP-seq data from 2x different organism, Aspergillus nidulans (fungus) and Lotus japonicus (plant). For both of them there is no relevant option to choose in the UCSC-Main of GALAXY.
What should I do in order to get a file format like B) and continue further with the tutorial?

I would appreciate any help,

Thanks,

Manolis

1 Like

Hi @Manolis1

In order to follow that tutorial, you’ll need to have a BED12 dataset that represents the known transcripts in each of your specific organisms.

A GTF reference annotation dataset could be converted to BED12 format with the tool: Convert GTF to BED12 (Galaxy Version 357)

Some genomes only have GFF3 annotation available, but that can be manipulated into GTF format with the tool: gffread Filters and/or converts GFF3/GTF2 records (Galaxy Version 2.2.1.2)

FAQ if you are not sure what these data formats are: Common datatypes explained

Also – those supplemental tools are not installed at all public Galaxy servers. Your post is tagged as “usegalaxy.org” but that may not be actually where you are working.

  • Galaxy Main https://usegalaxy.org won’t have the first tool, and the second tool is considered deprecated at this server. Avoid using them there.
  • Galaxy EU https://usegalaxy.eu has both.
  • Data can be easily moved between servers. Copy the link from a dataset’s “disk” icon and paste that URL into the other server’s Upload tool.

Thanks!

Hi Jennifer,

Many thanks for your answer and your prompt response.

I was trying to make the tool ‘gffread’ working and it did but just once! Otherwise, I do get a normal message: my job is in queue… but no file, never, appears in my history.

Nevertheless, it did work once and I got the GTF file that I have to use in the ‘Convert GTF to BED12’ tool. Problem is that when I am running the tool it fails to create any output. What I get is the message:

gtfToGenePred: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory

I have checked the GTF file, produced by the ‘gffread’ and looks ok. All the necessary columns are there in the right order.

Any idea at where is the problem?

Greetings,

Manolis

P.S.: Is there any other pipeline in GALAXY that will get me from Peaks to genes? Just to work on it in parallel.

1 Like

Which server did you run Convert GTF to BED12 on? That is a server-side dependency problem.

Update: I just ran the `Convert GTF to BED12 tool successfully at Galaxy EU https://usegalaxy.eu. It isn’t available at Galaxy Main https://usegalaxy.org (but probably should be, I’ll make a request to add it). Converting to a BED6 is possible at both, but you need a full BED12 for this particular operation.

If your error was at usegalaxy.eu, try a rerun, maybe there was a transient cluster issue leading to the dependency not being found at runtime.

Update2: There are many ways to compare coordinates between files, find overlaps, then report, reformat, summarize, etc.

For example, the first genome you mentioned has annotation at NCBI. One of the formats available is “Tabular” (here: Proteins - Genome - NCBI). That data could be loaded into Galaxy and converted into an interval format (or the more stringent bed6 format).

This would involve a few steps – reformatting the chromosome names (probably, depends what these are in your peaks file), subtracting “1” from the start coordinate (start coordinates are 0-based in interval/bed files but are 1-based in the NCBI file), rearranging/restricting columns of data (for bed format, for interval it wouldn’t matter), then assigning the proper datatype at the end.

Much of what this tutorial is describing is how to format data into compatible formats so that their genomic coordinates can be compared accurately. It uses functions/tools under the top-level tool grouping “GENERAL TEXT TOOLS”. The manipulations in the tutorial are specific to those particular files/datatypes but the reason why it is in the “Introduction” topic section, and contains so many manipulations, is to help people get familiar with some of those tools and manipulating data in general. Many of these tools are command-line utility analogs.

Some of this is explained in “Part2” of the tutorial. Biomart doesn’t have your particular two genome’s annotation, but NCBI does. If you are confused about what the dataset (file) formats should be like, or how to change metadata, or why primary keys like “chromsomes” names need to match up, these FAQs should help:

Please try to reformat the annotation yourself. It is important to learn how to do this, and that will take some trial and error. But if you get completely stuck, write back and we can help more. I might ask for a history share link (can be sent privately). Keep the history as small as possible (just this analysis) and make sure it contains your peak file and the gene annotation files you have been working with (should include the original GFF3 and the NCBI tabular annotation plus your attempts to manipulate those). It can just be for one of the genomes. I’m assuming that both have the same format for the peak data, so whatever solution works for one will work for the other.

If your peak data is from a public source, or if you don’t mind making it public (at least in part/some subset), we could work out a clean solution then post all of that back here, so others can learn from the example. Or, we could just post back the steps to manipulate the NCBI tabular annotation into an interval dataset (simple history + workflow).