Filter GTF attributes by a list of attributes

I have a GTF that looks like this with 2 000 000 rows:

I now want to filter it with a list of transcript_ids (second item in the “attributes” row). I tried with the tool Filter GTF data by attribute values_list but it outputs an empty file.
What am I missing ?

1 Like

Hi @notetienne

This result suggests that the values in the “And attribute values” input are not an exact match with the transcript_id values in the GTF dataset (have Ensembl IDs with .N where N is the version). If your list contains Ensembl IDs without a .N version or that version is not the same version as in the GTF, the data won’t match up. This dataset also must be in tabular format, with the IDs in the first column.

The tool form input for “Using attribute name” also needs to be exactly this for your stated goal. Make sure there are no trailing spaces: transcript_id

If you cannot solve the problem with that information:

  1. What values does the tabular filter dataset contain? Click on the “eye” icon to display it in the center panel, then screenshot it and post it back. The first few lines is enough.

  2. A screenshot of the tool form will also help. Click on the “rerun” (double-circle icon in an expanded dataset) to bring up the original tool form from a run producing no results or unexpected results. That view may also help you to double-check what was originally entered.

Let’s start troubleshooting from there :slight_smile:

Hi @jennaj

Thank you for your help, it worked! :slightly_smiling_face:

I know need to extract “gene_id” from the resulting GTF, is there a tool in Galaxy that can read and handle the “attributes” column of a GTF or GFF? I think I can do it with Excel but I suppose there is a better option?

1 Like

Hi @notetienne

These attributes can be parsed out in Galaxy.

  • Try the tool: Convert GTF to BED12
  • Open the Advanced options. There are two that are likely of interest, and one of those is required to produce a tabular output that contains gene_id and transcript_id values.
    • Output transcript information file: Set to yes to produce the tabular output (required)
    • Include gene and transcript version: Set to yes if the versions are still important to you. (optional)
  • From there you can use other text manipulation tools to extract just the data columns you need, or do counts of distinct identifiers, join that file with your other subsetting tabular list of data based on common values (transcript_id), and the like. Options are in the tool group “GENERAL TEXT TOOLS”

Example of the Convert GTF to BED12 form with both of those options set to “Yes”.

Hope that helps!