Hi Sir/Madam,
I am quite new to bioinformatics analysis, and therefore tried to use the galaxy server which I found it easy to use, thank you for the work.
I run the analysis with cutadapt ver 4.8 with PE option, min length 20 and min quality 20, followed by QC with FastQC and subsequent alignment with STAR with quantimode.
I have chosen the option with (galaxy) build-in human genome hg38 without gtf annotation (I believe is an index built by the galaxy team without specifying any annotation) but provide a gtf file downloaded from ENCODE GRCh38.p14 All gtf (when I gone through the STAR manual, I recently realized that the PRI shall be used, but will it make any difference?), with the option “the length of the genomic sequence around annotated junctions” set to 19 (maybe I am wrong to set it as 19 as I thought that my min. read length after adapter trimming is now 20).
The analysis gave me a decent count table with 63k lines and (for simplicity, look at column 3) out of 53398187 reads, only 724269 (about 1.35%) were without feature.
After that, I realized that the server load is quite heavy so that I tried to run the packages in the linux server from my University.
Basically I tried to run the same set of fastq to test the work flow:
Cutadapt > fastQC > STAR (with quantimode)
The version of STAR is a little bit different, galaxy’s one is 2.7.11a, mine is 2.7.10a.
I cut adapter from my paired-end 150 bp reads (2 separate files) with min length 20 and min quality of 20.
Everything went as expected as it was run on galaxy after comparing the fastQC reports from galaxy run and my run.
Then, I construct my genome index with the USMC hg38 release 77 primary assembly, which it shall not have any patch as suggested by STAR manual. At this moment, I build the index such that “the length of the genomic sequence around annotated junctions” was set to 19 (same as the run in galaxy) with the option “–sdjbOverhang 19”, using the same gtf file I supplied to galaxy (ENCODE GRCh38.p14, All).
However, the count table went wrong and a considerable portion of the reads are without feature (column 3) 6633906 over 50611147 (about 13%). The lines count for the output also reduced dramatically to 58 lines.
So, I tried to recreate the index with the following combinations but all yield similar results (i.e. 58 lines, considerable no feature reads):
- genome assembly hg38.p14 primary assembly with PRI gtf
- genome assembly hg38.p14 primary assembly with All gtf
- genome assembly hg38 (release 77) primary assembly with PRI gtf
- Provide gtf to an established genome (either option 1 - 3 above) on the fly with sjdbOverhang set to 19
I have also tried to copy the options from tool standard output found in the “Dataset details” within the galaxy to linux but in vain.
----Question below----
- For sjdbOverhang, the description stated that the ideal value is mate length minus 1, so should it be 150 bp (raw reads without adaptor trimming) minus 1 or length of single trimmed read minus 1 or the total length of a mate pair minus 1 (which I would not know exactly the mate length for each pair is?)?
- Even the sjdbOverhang was wrongly set, but output shall be similar to the run performed in galaxy?
- Shall the issue arise from index building? Will it solve the problem if a firstly build the index without providing a gtf and then provide the gtf on the fly?
Thank you.