I’m trying to count reads mapped to human tRNA in some 1x40bp illumina small-seq files previously published. Since normal genome reference files do not include tRNAs, this means I need to make my own for analysis. The reference tRNA fasta file I used to create my genome index I took from gtrnadb using high-confidence tRNAs identified in hg38 assembly (hg38-tRNAs.fa) and mapped to the genome annotation file from ENCODE genes (Predicted tRNA genes GTF/GFF3 files). I trimmed using fastp and got back decent enough reads to continue. I ran bowtie2 and got back roughly 6% uniquely matching, which I reasoned may be more related to biology than a technical issue.
The problem: When I feed this bam file and relevant reference files into htseq-count, however, I get back “__no_feature” and “__not_aligned” reads.
The things I’ve tried:
Try other aligners and mappers, like STAR and hisat and featureCounts
List the “feature” flag as “tRNA” to reflect the GTF file column 3
Altered strandedness for each possibility
Confirmed chromosomal tags are consistent (however, this doesn’t necessarily relate because according to my tRNA reference fa file, each tRNA is being read as its own chromosome. I confirmed this with Samtools: IdxStats)
Viewed the bam mapping in IGV and confirmed reads are associated with specific tRNAs
Removed the top 5 # lines in the GTF file and altered the attribute to “GTF” as recommended in other help requests
Using normalizeFasta to “standardize” my reference fasta genome by wrapping to 80 and removing anything after the first whitespace. This truncates every line to, for example, “>Homo_sapiens_tRNA-Ala-AGC-1-1” and not the more typical “chrX” in other reference genomes
Tried setting my custom genome as a custom build, and set my cleaned fasta file and GTF to that build. featureCounts no longer has any unmapped reads, but still shows only “unassigned_NoFeatures” reads
Downloaded alternative tRNA fasta/GTF files from genome.ucsc.edu tables which fixed the “feature” issue being tRNA instead to exon, as expected in default layouts
I still think this is a chr mapping issue or an issue due to substantial multimapping? But I’ve exhausted all of my google skills to getting this far. Would really appreciate any help.
A chromosome mismatch issue is certainly possible for some of these tests. The reference genome (and resulting BAM mappings) and the reference annotation (GTF) must be based on the same underlying assembly.
However, I agree that multi-mapping is more likely to be the problem given your stats, tests, results, and the underlying biological content.
If you want to test to see if multi-mapping is truly the issue, what happens if you map against the UCSC hg38 assembly and use the reference GTF from Gencode specific for predicted tRNAs? This annotation is a “match” for the assembly. The result may give some clues about technical problems using the other assemblies/annotation versions – if any were present using the other data.
Copy the URL for the GTF version of the annotation and paste it into the Upload tool. You can set the reference genome if you want to (optional, but useful), but allow the datatype to be autodetected. It will be gff to start with.
Why? Because GTF data from this source includes headers. You will need to remove the # headers as you did before.
Finally, either assign or detect the datatype (pencil icon > Edit attributes forms > tab "Datatypes).
If either is not done, some tools will not recognize the dataset as appropriate input AND some tools that do recognize the dataset as appropriate may still not interpret the file content’s correctly (error or odd results). These are technical issues to eliminate, so you can focus on scientific content issues/tool choice.
Note: “Detecting” the datatype is usually preferred for common datatypes and can be considered as a basic format “QA” step during data loading OR after manipulating files. In this case, for a common datatype like gtf, if the dataset is not automatically detected (Upload) or redetected (Edit Attributes) as gtf, that means the data content does not align with strict GTF format and you’ll need to find out why and make adjustments.
But if you want to attempt to interpret the results from this tool (could be very noisy) there is a way to count up non-specific mappings (for exploratory purposes?). Expand the “Advanced settings” section of the tool form and count “All” instead of “None”. Graphic below:
I have insured that both my reference annotations (GTF) and genome are built on the same assembly-- everything is on hg38. I have also used gencode’s tRNA GTF where I removed the headers and ensured the version is set to GTF.
I tried redoing my mapping with the highest sensitivity on bowtie2 and got much better mapping at roughly 17k reads uniquely mapping to my tRNAs. While I understand that htseq is not equipped for multimapping, I want to try this without multimapping, and I believe that 17k reads should be enough to see something counted. I am still seeing a lack of “feature” counts, despite trying everything here.
Is there a way you recommend to see if I’m having chromosomal mismatch? Could a file datatype elsewhere be causing this issue? For example, I’m feeding my fastp trimmed genome file into bowtie2, and fastp has converted my genome file into fastasanger.gz instead of just a fasta.gz file. When I read my resulting bam file with samtools, I get sequence identifiers as my tRNAs rather than chromosomes, eg “hg38_tRNAs_nm-tRNA-Tyr-GTA-chr1-142” whereas when I look at my GTF file, I see seqnames has “chr1”. Should I fix that, so seqnames in my GTF file matches the indexes on my bam file? When I try and use IGV to view my bam file on both the hg38 genome I see reads mapping to tRNAs.
Thanks again.
EDIT: Even though I am basing everything off of hg38, the genome coordinates in my tRNA genome file are truncated when I use normalizeFasta. So, maybe it could be that the features loading into the GTF are not being identified on the custom genome due to some name mismatch, etc, despite trying to download from the same source or everything built on hg38?