I am trying to align my RNA-seq reads (Bacteria), using HISAT2 to a reference genome (assembly.fna). Then,to measure the expression in the resulted bam.file, I used (featurecounts) tool, using the (gtf) annotation file. However, although both fna/gtf files belong to the same strain tested, I got this empty report;
Your help much appreciated
please check if the reference genome and the gene annotation file use the same sequence name.
Thank you for your quick response, both files belong to the same reference genome.
Could you please explain what does reference name mean;
Do you mean both files should have same name like ( genome.fna and genome.gtf) ?
No, not the file names, the sequence/contig names.
Click on Preview (eye) icon of the reference genome (FASTA) file. Check the sequence name in the first line after ‘>’. Most tools consider a text string before the first space character as a name, and the rest is a comment.
Click on Preview icon of the gene annotation file (GTF). Check the name in the first column (seqname). Sometimes the annotation file has a header. Header lines start with #. Ignore the header lines.
Both files should have identical sequence name.
This is assuming the assembly contain a single sequence. If you click on name of FASTA file, Galaxy shows number of sequences present in the file. Some assemblies may have more than one. If it is the case, filter the genome assembly (FASTA) file using something like ^> (select lines staring with ‘>’) and check that the annotation file uses the same sequence names.
Alternatively, you can compare sequence names in the BAM file (SQ line(s)) and the gene annotation file. I assumed you use reference genome from history, hence mentioned FASTA file.
I checked the three files (reference genome from history .fna), bam, and GTF.
- fna.file has several sequence/names such as (> NZ_JAFJXZ010000001.1, >NZ_JAFJXZ010000010.1 …etc)
*gtf.file and bam.files has also same several sequence/names (please check the attached screen shots.
Do you think such multiple sequences may affect the alignment, if yes how can we solve it?
So, it is not the reference data.
featureCounts summary indicates majority of reads are mapped outside of the annotated genes. Sometimes RNA-Seq libraries are extremely enriched with rRNA sequences. Run Samtools idxstats on BAM. Do you see reads mapped to all contigs? Do you see a contig with unusually high number of aligned reads, something like >90% of reads?
Both gtf and fan files were downloaded form NCBI. I run samtools stats and idxstats (resultes in attached screen shots).
From the the samtools stats report you can see ( read mapped and paired = 18495100, unmapped =339882).
the reads are mapped to many contigs. This is a good sign.
If you share the history I can have a look. As you got no counts for genes, mismatch between the genome and the gene annotation is the prime suspect.
History sharing: History menu (cog wheel icon at the top right corner of the History panel) > Share or Publish > Create a link and paste in in reply.
If you have many samples, you can copy files for one sample (reference genome, gene annotation, BAM and featureCounts outputs) and share only one sample.
Thanks for the follow up,
I repeated the steps, and this time, I did not use any trimming tools.
Please use this link to access the workflow;
output of featureCounts #14 has counts for 157 genes, so it does count reads against some genes. The small number of genes is an unintended consequence of the gene annotation. Tools such as featureCounts and htseq-count count reads against a feature in 3d column of GTF file and aggregate the results using an attribute from the last column. For example, feature in the 3d column can be ‘exon’, and attribute is ‘gene_id’. The tool counts reads against axons and returns count for a value (gene ID) in ‘gene_id’. Your GTF file does not have ‘exon’ feature for some genes. Filter the GTF file on column 3 using Filter data on any column using simple expressions with c3==‘exon’. You’ll see 157 records. As bacterial genes are not spliced, the featureCounts estimated read counts for genes with available ‘exon’ feature. You can select a feature used by featurCounts for counting in Advanced options. If you set it to CDS, featureCounts returns 5092 lines. With CDS you’ll miss some ncRNAs. Try different features, examine the annotation, and select a feature most appropriate for purpose of your analysis.
You can unshare the history if you don’t have any other questions: History menu > Share or Publish > Unshare it.