Genbank to gtf for featurecounts

AlexaDean · January 26, 2021, 10:16pm

Hey all,

I need to run featurecounts two separate times, each with a different annotated reference genome.
Q1. My first genome is not a common one, there is a genbank file but no gtf file. I see mixed reviews on conversion tools…

Q2. My second genome I got a gtf file from ensembl bacteria, just wondering if I need to convert this to UCSC format before featurecounts or just make sure it matches my bam output files from bwa-mem mapping step? My gtf file has “chromosome” in the first column, and “gene_id” in the attributes column whereas my bam files look different (attached here).

Appreciate any advice on this

Alexa

gallardoalba · January 26, 2021, 11:32pm

Hi @AlexaDean,

regarding the first question, you can use the Genbank to GFF3 tool for generating an annotation file compatible with featureCounts. Concerning the second question, it is important that the annotation file corresponds to the same version as the reference genome used to perform the mapping.

Regards

AlexaDean · January 27, 2021, 2:36pm

Hello,

Sorry, this may be me overthinking it but I used the genbank to gff3 tool - I keep reading here that gff3 does not work with featurecounts, only gtf/gff2?

For the mapping step, I used bwa-mem and selected “use a genome from history and build index” so I just input the fasta file of the reference genome and used bwa-mem to build an index. From all the tutorials I thought this was fine, and then for counting step you use gtf?

Alexa

gallardoalba · January 27, 2021, 9:56pm

Hi @AlexaDean,
indeed featureCounts accepts the gff3 format, you can just try it. Regarding the second question, as I mentioned you, the reference genome should correspond to the same version as the annotation. As an example you can have a look at the Human Genome Resources at NCBI. As you can see, each version (GRCh38/GRCh37) has its own reference genome sequence and reference genome annotation.

Regards.

AlexaDean · January 29, 2021, 2:40pm

To follow up for anyone reading this thread - FeatureCounts produced an error when trying to use gff3

gallardoalba · January 29, 2021, 3:24pm

Hi @AlexaDean,
probably the problem is not the format itself, but which features it includes. Could you share your history with me? My email is gallardo@informatik.uni-freiburg.de.

Regards

AlexaDean · January 29, 2021, 4:34pm

Yes, I will share with you now, thank you very much!

I have two reference genomes which I re-labelled Reference 1 and 2. Both reference genomes have:

an associated normalized fasta file (Reference #_Normalized) used for mapping
an associated genbank file (Reference #.gb) downloaded from NCBI
an associated gff3 file (Reference as gff3) generated by Genbank to gff3 tool in Galaxy
Reference 2 has a gtf file downloaded from ensembl.bacteria, Reference 1 is less common and there is no gtf file I can download

I am starting over at the mapping step so that is currently running but I have attached here an example of the output I got from mapping before to give you an idea of format.

Example BAM output from BWA-MEM