Hi, I am relatively new to RNA-seq data analysis following the " Finding and quantifying new transcripts" tutorial (https://galaxyproject.org/tutorials/nt_rnaseq/). The transcriptomics study that I done was on the effects of a chemical on a Mycobacterial species. I uploaded the fasta and gtf file and followed closely to what the tutorial suggested. My Hisat2 summary was as follows:
HISAT2 summary stats:
Total pairs: 20863625
Aligned concordantly or discordantly 0 time: 3642143 (17.46%)
Aligned concordantly 1 time: 8046746 (38.57%)
Aligned concordantly >1 times: 9064704 (43.45%)
Aligned discordantly 1 time: 110032 (0.53%)
Total unpaired reads: 7284286
Aligned 0 time: 5004928 (68.71%)
Aligned 1 time: 931102 (12.78%)
Aligned >1 times: 1348256 (18.51%)
Overall alignment rate: 88.01%.
I followed up filtering only for quality reads (during filtering did not check for paired reads and paired reads mapped in proper pair), stringtie and stringtie merge to create a merged transcriptomics file as per the tutorial and followed up with featureCounts to determine the counts of my data. However, the summary for featureCounts for one of my datasets looked like this:
Status
Filtered LJ5 Control 1 BAM
Status
Filtered LJ5 Control 1 BAM
Assigned
1844744
Unassigned_Unmapped
0
Unassigned_MappingQuality
0
Unassigned_Chimera
7171
Unassigned_FragmentLength
0
Unassigned_Duplicate
0
Unassigned_MultiMapping
0
Unassigned_Secondary
0
Unassigned_NonSplit
0
Unassigned_NoFeatures
19955
Unassigned_Overlapping_Length
0
Unassigned_Ambiguity
7025997
with a high amount of reads unassigned due to ambiguity.
I followed the tutorial pretty strictly with regards to parameters to set for featureCounts except for strand specificity (I have half for either so unstranded). Please help if possible. Thank you.
The fasta and gtf files were downloaded from the shogun sequenced genome of the type strain of the Mycobacterium species ob BLAST assembly database. Most of the annotated proteins are hypothetical. I am unsure if this might have affected the results for featureCounts. I also only filtered for quality because I am interested in both functional gene and tRNA expression levels.
I tried multi-mapping the reads and apparently featureCounts was able to align them. Parameters used are as follows:
|Alignment file|* 299: Filter SAM or BAM, output SAM or BAM on data 222: bam|
|Specify strand information|Unstranded|
|Gene annotation file|history|
|Gene annotation file|* 314: Merged Transcriptome (Mapped paired reads)|
|Output format|Gene-ID “\t” read-count (MultiQC/DESeq2/edgeR/limma-voom compatible)|
|Create gene-length file|False|
|pe_parameters||
|Count fragments instead of reads|-p|
|Check paired-end distance||
|Only allow fragments with both reads aligned|False|
|Exclude chimeric fragments|True|
|extended_parameters||
|GFF feature type filter|exon|
|GFF gene identifier|transcript_id|
|On feature level|False|
|Allow read to contribute to multiple features|True|
|Count multi-mapping reads/fragments|-M|
|Assign fractions to multimapping reads|False|
|Minimum mapping quality per read|0|
|Exon-exon junctions||
|Long reads|False|
|Count reads by read group|False|
|Largest overlap|False|
|Minimum bases of overlap|1|
|Minimum fraction (of read) overlapping a feature|0|
|Minimum fraction (of feature) overlapping a read|0|
|Read 5’ extension|0|
|Read 3’ extension|0|
|Reduce read to single position|Leave the read as it is|
|Only count primary alignments|False|
|Ignore reads marked as duplicate|False|
|Annotates the alignment file with ‘XS:Z:’-tags to described per read or read-pair the corresponding assigned feature(s).|False|
|Ignore unspliced alignments|False|
The summary output looks like this:
Status
Filtered LJ5 Control 1 BAM
Status
Filtered LJ5 Control 1 BAM
Assigned
7773515
Unassigned_Unmapped
0
Unassigned_MappingQuality
0
Unassigned_Chimera
0
Unassigned_FragmentLength
0
Unassigned_Duplicate
0
Unassigned_MultiMapping
0
Unassigned_Secondary
0
Unassigned_NonSplit
0
Unassigned_NoFeatures
24615
Unassigned_Overlapping_Length
0
Unassigned_Ambiguity
0
Please let me know if I input any parameters wrongly. While this would be an ideal result for me, the downstream DeSeq2 analysis would be gravely affected if I did anything wrongly with this analysis.