Extracting feature sequences with gene ID information

cgreig · April 27, 2025, 1:24pm

To extract 5’ UTR sequences from my genome I have been able to use the Extract Features tool to pull out the locations of the 5’UTRs from the published genomic gff3 file and use this to extract fasta sequences from the genome. For this I have used bedtools getfasta and also Extract Genomic DNA tools which both pull out a fasta file with all the sequence information.

The trouble is that all the gene information is lost: the fasta headers look like this

chromosome_01:27911-28050

or this

five_prime_UTR::chromosome_01:27911-28050

The information is in the gff3 file, but only in column 9, the attribute column, as ID, in this case the gene information I want is Cre01.g000050_4532.1.v6.1 , and the gff3 entry is

[chromosome_01 phytozomev13 five_prime_UTR 27912 28050 . + . ID=Cre01.g000050_4532.1.v6.1.five_prime_UTR.1;Parent=Cre01.g000050_4532.1.v6.1;pacid=52525454]

So I am trying to find some way of adding the ID in the attribute column to the fasta heading. I have tried everything I can - it should be simple? can you help please?

The history link is here Galaxy

Appreciate your help

jennaj · April 29, 2025, 6:02pm

Hi @cgreig

Thanks for sharing the history, very helpful!

When using bedtools getfasta, you can input a BED file with the label you want to use in the “name” field, then that will flow out to the fasta output.

Use the ‘name’ column in the BED file and the coordinates for the FASTA headers in the output FASTA file

From where you are now, running gffread on the output from dataset 6 (Extract features where the filter was the 5’ region) can produce a BED file that I think is what you’ll want. The toggle for the output type is near the bottom of the form.

Hope this helps!

cgreig · May 1, 2025, 7:11am

Thanks Jenna, it is a delight to have something work so easily.
So for the record, to extract a fasta file feature sequences from a genome the process is:
Use the tool Extract_features to pull out feature infomation into a new gff file. Then run gffread to convert this into a BED file which will list the ID in the Name column. Then use this with the genome to run bedtools_getfasta and choose the option ‘Use the ‘name’ column in the BED file and the coordinates for the FASTA headers in the output FASTA file’

I had tried to use other tools to create a BED file but these didn’t use the ID. There are not many clues to suggest that gffread does, but it works like a dream- thank you!

jennaj · May 1, 2025, 4:42pm

Great, I’m glad this worked out!!

We have a few tutorials that cover data manipulations to help. These can definitely be logic puzzles! GTN Materials Search (query=olympics)

Topic		Replies	Views
File GTF em file fasta: Extracting fasta sequences based on coordinates (BED/bedGraph/GFF/VCF/EncodePeak file) usegalaxy.org support bedtools , variant-analysis	4	822	October 19, 2021
"Extract Genomic DNA" tool usegalaxy.eu support tool-help , bedtools_getfastabed , extract-genomic-dna-1	2	74	August 13, 2024
Extracting portion of fasta sequences from a multifasta file having contigs names and start-stop positions usegalaxy.org support	0	436	February 24, 2022
Extracting sequences from bed file using tools extract Genomic DNA tool and bed to Fasta tool usegalaxy.org support metadata , custom-genome , bedtools , custom-build	3	2121	June 30, 2020
Extract Genomic DNA: index not found for hg19 usegalaxy.eu support bed , reference-index , chip-seq , server-side-error , epigenetics	4	1000	December 2, 2019

Extracting feature sequences with gene ID information

Related topics