Gene ID's being filtered out in ChipSeq analysis

I’ve performed ChIP-seq on a plant TF, and I’m seeing lots of peaks in promoter regions, so looks like I’ve got a good quality dataset. The problem is that I lose the gene IDs and my final list has the list of relevant peaks in promoter regions, but excludes the gene names. I worked back and saw that the gene IDs are being filtered out when I run the ‘Get flanks’ tool to identify promoter regions. I’ve tried a few workarounds but lose the gene ID’s each time; I even tried to use the ‘Join’ tool to after combining my peak list and the genome annotation file again, but still don’t get a gene ID list. Any suggestion on how to amend this? Is this an issue with the way the genome annotation (gff3) file is set up?
Thanks,
Tim

Welcome, @TCameron1

Check your inputs to the tool – does your interval dataset contain the gene IDs to start with? This tool doesn’t add in annotation.

Note that the “gene ID” here is really a transcript ID. Coordinates are never based on a gene, since each gene can have many transcripts, and this is coordinate data. You can filter for a “primary” transcript to represent your gene if needed. Then you can link back the gene ID using transcript IDs in the data – just know that you’ll potentially get multiple rows of data with the same gene ID, so it can’t be used as a primary key. I hope I am not being too confusing here!

How to format your interval input file, a type of tabular file, is specific, and there is an example on the tool form and a link to a tutorial with a walk-through.

Quote from the form for anyone else reading, too!

Help

This tool finds the upstream and/or downstream flanking region(s) of all the selected regions in the input file.

Note: Every line should contain at least 3 columns: Chromosome number, Start and Stop co-ordinates. If any of these columns is missing or if start and stop co-ordinates are not numerical, the tool may encounter exceptions and such lines are skipped as invalid. The number of invalid skipped lines is documented in the resulting history item as a “Data issue”.


Example 1

  • For the following dataset:

chr22 1000 7000 NM_174568 0 +

  • running get flanks with Region: Around start, Offset: -200, Flank-length: 300 and Location: Upstream will return (Red: Dataset positive strand; Blue: Flanks output):

chr22 500 800 NM_174568 0 +

shed_tool_static/toolshed.g2.bx.psu.edu/devteam/get_flanks/get_flanks1/1.0.0/flanks_ex1.gif

Notice how that is a “BED” format for at least the first 5 or 6 columns. With interval, you can have other columns after that, the tool will ignore them, but you can link back in after. You can find a more complete description about BED files at UCSC (they developed the format originally). Interval format was developed by Galaxy, and is any tabular file with BED style data in the first columns, and was designed to be a loser format on purpose: Galaxy incorporates tools from many different original tool authors with different data requirements, so interval sort of fills that gap to let data “play well” with them. You can ask us if something isn’t clear for a particular tool (same as you are doing now!).

Back to your question: The name field is where you want to place your transcript ID. Make sure it a singleWord with no spaces. Then include the coordinates and strand and score (if you have one, if not just pad it with a number like 0).

If you need to manipulate your data, you can convert your GFF3 to interval (extract and rearrange the fields this tool is expecting to work with). Please let us know if you need more specific instructions. It would be helpful for us to see the exact GFF3 file for that – how to share work is in the banner at this forum. You could just make a new history, copy the file into it, and share it, or share the full failed run. Either should be fine for this.

We have some great guides about using data manipulation utilities. Most are the same as used command line, so searching the tool panel will find them, if that is something you are familiar with. If this is new to you, the tutorial guides can help you to get started.

Hope this helps and let us know how this works out! :slight_smile:

Thanks Jenna!
Yes, I think it appears to be a problem with the format of the Lotus japonicus gff3 file, obtained from Phytozome. I’ve attached a screenshot of it here; the gene Id number appears under ‘attributes’ header, which I don’t think is recognised by ‘Get flanks’ and is excluded. The column header ‘type’ in the gff3 file is converted into ‘name’ in the output of the ‘get flanks’ tool, so if I can change the headers of the gff3 file I might be able to switch those around. Any suggestions on how to do that?
Thanks,
Tim

Hi @TCameron1

Convert the file type to be interval format with bed columns at the start as I described.

GFF3 is not a true “tabular” datatype format. Why?

  • It has headers, and those confuse some tools.

  • It has the attributes you want to use nested inside a single field (the target value, other values, labels), the 9th column. So even if a tool can ignore the headers, it cannot find the value you want to use as the “name” isolated by itself in a column. This is what your error messages were reporting.

Once converted, you’ll just have the actual transcript/gene value in the column of data you’ll be selecting for the “name” column.

So try that first with the data manipulation tools, and if you get stuck, we can follow up here more if you share back what you have so far.

Converting formats is super common, so learning how to do this isn’t just a one-off, it is something you’ll be doing a lot, whether working in Galaxy or not.

Hope this helps! :slight_smile:

Hi Jenna,
Thanks so much for your help, have managed to get that working by manipulating the gff3 file! :slight_smile:

1 Like

Great! So glad that helped :rocket: