RNA STAR quantimode count difference between run on galaxy and on self linux server

Hi Sir/Madam,

I am quite new to bioinformatics analysis, and therefore tried to use the galaxy server which I found it easy to use, thank you for the work.

I run the analysis with cutadapt ver 4.8 with PE option, min length 20 and min quality 20, followed by QC with FastQC and subsequent alignment with STAR with quantimode.
I have chosen the option with (galaxy) build-in human genome hg38 without gtf annotation (I believe is an index built by the galaxy team without specifying any annotation) but provide a gtf file downloaded from ENCODE GRCh38.p14 All gtf (when I gone through the STAR manual, I recently realized that the PRI shall be used, but will it make any difference?), with the option “the length of the genomic sequence around annotated junctions” set to 19 (maybe I am wrong to set it as 19 as I thought that my min. read length after adapter trimming is now 20).

The analysis gave me a decent count table with 63k lines and (for simplicity, look at column 3) out of 53398187 reads, only 724269 (about 1.35%) were without feature.
galaxy run

After that, I realized that the server load is quite heavy so that I tried to run the packages in the linux server from my University.

Basically I tried to run the same set of fastq to test the work flow:
Cutadapt > fastQC > STAR (with quantimode)

The version of STAR is a little bit different, galaxy’s one is 2.7.11a, mine is 2.7.10a.

I cut adapter from my paired-end 150 bp reads (2 separate files) with min length 20 and min quality of 20.
Everything went as expected as it was run on galaxy after comparing the fastQC reports from galaxy run and my run.

Then, I construct my genome index with the USMC hg38 release 77 primary assembly, which it shall not have any patch as suggested by STAR manual. At this moment, I build the index such that “the length of the genomic sequence around annotated junctions” was set to 19 (same as the run in galaxy) with the option “–sdjbOverhang 19”, using the same gtf file I supplied to galaxy (ENCODE GRCh38.p14, All).

However, the count table went wrong and a considerable portion of the reads are without feature (column 3) 6633906 over 50611147 (about 13%). The lines count for the output also reduced dramatically to 58 lines.
self run

So, I tried to recreate the index with the following combinations but all yield similar results (i.e. 58 lines, considerable no feature reads):

  1. genome assembly hg38.p14 primary assembly with PRI gtf
  2. genome assembly hg38.p14 primary assembly with All gtf
  3. genome assembly hg38 (release 77) primary assembly with PRI gtf
  4. Provide gtf to an established genome (either option 1 - 3 above) on the fly with sjdbOverhang set to 19

I have also tried to copy the options from tool standard output found in the “Dataset details” within the galaxy to linux but in vain.

----Question below----

  1. For sjdbOverhang, the description stated that the ideal value is mate length minus 1, so should it be 150 bp (raw reads without adaptor trimming) minus 1 or length of single trimmed read minus 1 or the total length of a mate pair minus 1 (which I would not know exactly the mate length for each pair is?)?
  2. Even the sjdbOverhang was wrongly set, but output shall be similar to the run performed in galaxy?
  3. Shall the issue arise from index building? Will it solve the problem if a firstly build the index without providing a gtf and then provide the gtf on the fly?

Thank you.

Hi @TFYIP

The first thing to confirm is that the version of the reference build is the same for all inputs: the reference genome and the reference annotation. Some of your comments and results lead me to believe a mixup is likely part of the problem.

This guide explains all of the details. Please review, then we can followup here with any questions. → Reference genomes at public Galaxy servers: GRCh38/hg38 example

Dear Jan,

Thank you for the reply, trying.

Thank you, the problem is partially solved. Although using “allknowngene” gtf file yielded some mapped features with uniprot KB entry code.

If you want different identifiers, you could choose a different reference annotation. UCSC has four choices as of right now, including Ensembl and RefSeq Genes.

Review the track descriptions in the main browser to learn what each gene track represents. “Known Genes” is a merged-sources result; the others are from one source each from what I remember.

You could also convert identifiers after – just be aware that these mappings are not always 1-1 between sources: some may map to two or more alternatives and some may not map at all.

Hope this works out :slight_smile: