Sort status of BAM files after RNA STAR

Hi, I am new to Galaxy.
Are the BAM files generated after mapping using RNA STAR (version 2.7.8a+galaxy1 or newer) sorted (either by coordinate or name or tag)? After clicking on the ‘i’ logo for the ‘*.mapped.bam’ dataset details, the dataset information section says the format is BAM. The ‘Tool Standard Output’ is as follows:

Apr 28 20:52:35 … started STAR run
Apr 28 20:52:35 … starting to generate Genome files
Apr 28 20:53:23 … processing annotations GTF
Apr 28 20:53:54 … starting to sort Suffix Array. This may take a long time…
Apr 28 20:54:14 … sorting Suffix Array chunks and saving them to disk…
Apr 28 21:13:08 … loading chunks from disk, packing SA…
Apr 28 21:14:38 … finished generating suffix array
Apr 28 21:14:38 … generating Suffix Array index
Apr 28 21:19:00 … completed Suffix Array index
Apr 28 21:19:01 … inserting junctions into the genome indices
Apr 28 21:21:04 … writing Genome to disk …
Apr 28 21:21:07 … writing Suffix Array to disk …
Apr 28 21:21:30 … writing SAindex to disk
Apr 28 21:21:33 … finished successfully
Apr 28 21:21:33 … started STAR run
Apr 28 21:21:33 … loading genome
Apr 28 21:21:55 … started mapping
Apr 28 21:30:30 … finished mapping
Apr 28 21:30:54 … started sorting BAM
Apr 28 21:31:45 … finished successfully

However, though this output is essentially the same for my ‘*.transcriptome-mapped.bam’ file, its dataset information says the format is unsorted.bam. See attached screenshots.


The 1st line of ‘.mapped.bam’ file is ‘@HD VN:1.4 SO:coordinate’. So am I correct in assuming that perhaps it might be sorted by coordinate? The 1st line of '.transcriptome-mapped.bam’ is not the same as the ‘*.mapped.bam’ but rather is ‘@SQ SN:NR_046018.2 LN:1652’.

Do I still need to perform the Samtools sort operation for either one or both these files? I intend to use them as inputs for featureCounts.

Thanks a lot for the help in advance. :slight_smile:

1 Like

Hi @agschindler

A datatype metadata describes the sort formats of bam data.

  1. The first dataset with the bam datatype assigned is coordinate sorted.
  2. The second dataset with the unsorted.bam datatype assigned is not.

The first is what most people would be using with Featurecounts. Instead, for transcript level expression analysis, Salmon or Kallisto are probably “better” choices scientifically. But, you can of course try out different data/parameters with any and see what happens :slight_smile: to explore. Maybe you can figure out how to get the data you need in novel ways.

Hi @jennaj

Thank you very much your reply. This is quite helpful. :slight_smile:

1 Like