Very few genes in htseqCount-/FeatureCounts-Output

Hi everyone,

I am currently analyzing RNASeq data obtained from dog cell culture. I used HISAT2 for the alignment to the canis familiaris reference genome in Galaxy. Then, to count the reads, I tried both htseq-count (Galaxy Version 0.9.1) and featureCounts (Galaxy Version 1.6.4+galaxy1). As gene annotation file, I took this file from iGenomes:
It contains 32461 lines, among them 15178 exons (feature type). As I run htseq-count/featureCounts with ‘exon’ as feature type, I expected 15178 lines in the outout of these tools. In contrast, I get both in htseq-count and featureCounts only 2023 lines, even though exons with 0 counts are included in the output.
Do you have any idea what could have happened to the remaining exons?
Thank you.



1 Like

Hi @HanMue

The counts are summarized by the meta-feature “gene” at default settings. This is specified by the Advanced option “GFF gene identifier” set to “gene_id”.

There can be one or more exon per transcript, then one or more transcript per gene.

As a test to see how this is arranged in gtf annotation, try using the tool Select with a simple query that just contains the name of one of your genes against the gtf dataset. Can check a few if you want. For each, you’ll see how the formatting is organized for that particular gene and can compare it to the gtf datatype specification in the FAQ below.

To instead count up by feature (exons), toggle the Advanced option “On feature level” to be “Yes”. (No is the default). There are more options that can be adjusted – could try a few different ways and compare to better understand how each impact your particular data.

The above options are for FeatureCounts. Htseq-count is similar but with fewer options.



1 Like

Hi @jennaj

thank you for your kind reply.
I checked the gtf annotation file. As a standard gtf-file, it has 9 fields containing: grafik

To check the formatting on the gene level as you recommended, I randomly checked the gene ‘BCL2’ and this is what I got:

I also have checked a few other genes and they have something between 1 and aprox. 15 lines each.
That is why I think counting by the meta-feature “gene”, the 15178 exon lines in the gtf annotation shrink to 2023 genes.

Doing this, I get 15179 lines as I expected in the beginning, but know the counts look like this:
The counts of the different exons of one gene are know split into different lines (before, when counting by the meta-feature “gene” it was only one line per gene). But will this have any adavantage for DeSeq2 because the information is missing to which exon the counts belong? Then, what was confusing, I added the counts of e.g. all BCL2 lines in the picture above and compared it to the count number of BCL2 in the first featureCounts run counting by the meta-feature “gene” and it does not match.

My main problem is, that 2023 genes are a very small number that makes downstream analysis like differential gene expression with DeSeq2or a PCA difficult/sketchy…

Do you think I should try another annotation, e.g. the ensembl-version?
I have nerver worked with the organism canis familiaris (dog) before, so I do not know how well it is annotated, perhaps the small number of genes is due to the quality of annotation?

Many new questions in this post, but your answer solved my original question! Thank you!


1 Like

Hi @HanMue

The iGenomes version of the annotation is based on RefSeq from several years ago.

Annotation from Ensembl will be more current but will require some manipulations to have the chromosome identifiers match up with the built-in canFam3 indexes that are sourced from UCSC.

Or, you can also use the matching Ensembl fasta version of the genome and use that as a Custom Reference Genome/Build with tools. Be aware that the dog genome is very large and may exceed server processing resources when used this way at a public Galaxy server. But you can try.

Do not assign the “database” as canFam3 if you decide to use all Ensembl data. You’ll also need to remap. The Ensembl version of the data does not have the same chromosome identifiers as the UCSC version. Every step in an analysis needs to use data that is all based on the same exact genome/build.