Extracting feature sequences with gene ID information

To extract 5’ UTR sequences from my genome I have been able to use the Extract Features tool to pull out the locations of the 5’UTRs from the published genomic gff3 file and use this to extract fasta sequences from the genome. For this I have used bedtools getfasta and also Extract Genomic DNA tools which both pull out a fasta file with all the sequence information.

The trouble is that all the gene information is lost: the fasta headers look like this

chromosome_01:27911-28050

or this

five_prime_UTR::chromosome_01:27911-28050

The information is in the gff3 file, but only in column 9, the attribute column, as ID, in this case the gene information I want is Cre01.g000050_4532.1.v6.1 , and the gff3 entry is

[chromosome_01 phytozomev13 five_prime_UTR 27912 28050 . + . ID=Cre01.g000050_4532.1.v6.1.five_prime_UTR.1;Parent=Cre01.g000050_4532.1.v6.1;pacid=52525454]

So I am trying to find some way of adding the ID in the attribute column to the fasta heading. I have tried everything I can - it should be simple? can you help please?

The history link is here Galaxy

Appreciate your help

Hi @cgreig

Thanks for sharing the history, very helpful!

When using bedtools getfasta, you can input a BED file with the label you want to use in the “name” field, then that will flow out to the fasta output.

Use the ‘name’ column in the BED file and the coordinates for the FASTA headers in the output FASTA file

From where you are now, running gffread on the output from dataset 6 (Extract features where the filter was the 5’ region) can produce a BED file that I think is what you’ll want. The toggle for the output type is near the bottom of the form.

Hope this helps! :slight_smile:

Thanks Jenna, it is a delight to have something work so easily.
So for the record, to extract a fasta file feature sequences from a genome the process is:
Use the tool Extract_features to pull out feature infomation into a new gff file. Then run gffread to convert this into a BED file which will list the ID in the Name column. Then use this with the genome to run bedtools_getfasta and choose the option ‘Use the ‘name’ column in the BED file and the coordinates for the FASTA headers in the output FASTA file’

I had tried to use other tools to create a BED file but these didn’t use the ID. There are not many clues to suggest that gffread does, but it works like a dream- thank you!

1 Like

Great, I’m glad this worked out!!

We have a few tutorials that cover data manipulations to help. These can definitely be logic puzzles! :slight_smile: GTN Materials Search (query=olympics)