htseq count issue using UCSC-sourced indexed bacterial genomes and NCBI reference annotation (mismatched chrom identifiers)

Hi,
I’m trying to use htseq-count to tell me how many reads I have mapped to each gene in my BAM. The BAM was created with bowtie2 using the built in reference genome MG1655 and I am using a GFF downloaded from NCBI for this strain. I’ve read that it is probably mismatched chromosome identifiers but I am having trouble fixing this. Is there somewhere that contains the GFFs that match the built in reference genomes? Viewing my Bam file. Doing an IdxStats on my BAM returning only the header shows that my chromosome identifier is chr, but I’m unsure of how to match my GFF to this. I think the GFF identifier is the seqid from NCBI (NC_000913.3).

Any help would be much appreciated.

Thanks,
Tayla

1 Like

Hi,

You’ll probably need to use a custom reference genome instead of the built-in index, to get the chromosome identifiers to match up. Chromosome mismatches are common for bacterial genomes that were originally sourced/indexed from the UCSC Mircrobial source (have customized chromosome identifiers and the same source does not have updated matching reference annotation).

The genome and reference annotation are available from NCBI. Genome: https://www.ncbi.nlm.nih.gov/nuccore/556503834

FAQs:

The first FAQ explains how to standardize fasta data with the tool NormalizeFasta. Removing the description line content from custom genomes is important, and you won’t notice the problem during mapping, only after when using downstream tools. This usually means starting over from mapping again, using a standardized fasta.

The second FAQ includes help for modifying identifiers, and might be enough but involves a few steps and even that might not work. The genome could be from an earlier assembly version. You could check this by comparing the chromosome length in the BAM versus the GFF source.

That said, I would still recommend using a custom genome fasta and matching annotation GFF, both from the same source and confirmed to be from the same assembly build/version. Things will go smoother and avoid any potential hidden scientific problems (including putatively “successful” jobs that are not quite right content-wise). Plus your chromosome identifiers will be more meaningful (“NC_000913.3”), not just “chr”. These UCSC microbial genomes were indexed directly from the UCSC source many years ago (over 9). We keep them for backward-compatibility reasons but for most current use cases, should be avoided. A custom genome + custom build is a better choice.

Thanks!

Hi!
I fixed this by manually changing the first line of my FASTA file in notepad to match the chromosome name in my GFF file. Then using the FTP to upload both and performing bowtie using this new FASTA as my reference genome, and subsequent htseq-count with the GFF file and it worked perfectly.

Thanks.

1 Like

Glad that worked!