I am getting systematic errors in output generated by Cutadapt. I think other users have experienced similar problems before.
I ran RNA Star with paired sequences that had been trimmed using Cutadapt.
Command Line:
*gunzip -c ‘/jetstream2/scratch/main/jobs/56982005/inputs/dataset_26149e76-4ccc-4d0f-9e9f-55093c06d828.dat’ > refgenome.fa && mkdir -p tempstargenomedir && STAR --runMode genomeGenerate --genomeDir ‘tempstargenomedir’ --genomeFastaFiles refgenome.fa --sjdbOverhang ‘100’ --sjdbGTFfile ‘/jetstream2/scratch/main/jobs/56982005/inputs/dataset_65a579d0-a211-40f0-9f27-bdebf97092a2.dat’ --sjdbGTFfeatureExon ‘exon’ --genomeSAindexNbases 14 --runThreadN ${GALAXY_SLOTS:-4} --limitGenomeGenerateRAM $((${GALAXY_MEMORY_MB:-31000} * 1000000)) && STAR --runThreadN ${GALAXY_SLOTS:-4} --genomeLoad NoSharedMemory --genomeDir tempstargenomedir --readFilesIn ‘/jetstream2/scratch/main/jobs/56982005/inputs/dataset_5a81956e-5dd0-4ecc-9df7-3d0cee99846f.dat’ ‘/jetstream2/scratch/main/jobs/56982005/inputs/dataset_8d9e7ba1-f09e-483b-9537-c0419c85cb5b.dat’ --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None ‘’ --quantMode GeneCounts --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 60 --outSAMunmapped Within --outBAMsortingThreadN ${GALAXY_SLOTS:-4} --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM $((${GALAXY_MEMORY_MB:-0}1000000)) && samtools view -b -o ‘/jetstream2/scratch/main/jobs/56982005/outputs/dataset_8919e0f8-d403-47f7-ae68-65588e5bb8c9.dat’ Aligned.sortedByCoord.out.bam
Tool Standard Output:
/usr/local/bin/STAR-avx2 --runMode genomeGenerate --genomeDir tempstargenomedir --genomeFastaFiles refgenome.fa --sjdbOverhang 100 --sjdbGTFfile /jetstream2/scratch/main/jobs/56982005/inputs/dataset_65a579d0-a211-40f0-9f27-bdebf97092a2.dat --sjdbGTFfeatureExon exon --genomeSAindexNbases 14 --runThreadN 16 --limitGenomeGenerateRAM 59392000000
- STAR version: 2.7.11a compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source*
Apr 09 16:19:17 … started STAR run
Apr 09 16:19:17 … starting to generate Genome files
Apr 09 16:20:01 … processing annotations GTF
Apr 09 16:20:27 … starting to sort Suffix Array. This may take a long time…
Apr 09 16:20:42 … sorting Suffix Array chunks and saving them to disk…
Apr 09 16:35:49 … loading chunks from disk, packing SA…
Apr 09 16:37:23 … finished generating suffix array
Apr 09 16:37:23 … generating Suffix Array index
Apr 09 16:40:13 … completed Suffix Array index
Apr 09 16:40:14 … inserting junctions into the genome indices
Apr 09 16:42:55 … writing Genome to disk …
Apr 09 16:43:01 … writing Suffix Array to disk …
Apr 09 16:44:03 … writing SAindex to disk
Apr 09 16:44:11 … finished successfully - /usr/local/bin/STAR-avx2 --runThreadN 16 --genomeLoad NoSharedMemory --genomeDir tempstargenomedir --readFilesIn /jetstream2/scratch/main/jobs/56982005/inputs/dataset_5a81956e-5dd0-4ecc-9df7-3d0cee99846f.dat /jetstream2/scratch/main/jobs/56982005/inputs/dataset_8d9e7ba1-f09e-483b-9537-c0419c85cb5b.dat --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None --quantMode GeneCounts --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 60 --outSAMunmapped Within --outBAMsortingThreadN 16 --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM 59392000000*
- STAR version: 2.7.11a compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source*
Apr 09 16:44:11 … started STAR run
Apr 09 16:44:12 … loading genome
Apr 09 16:44:23 … started mapping
I got errors in RNA Star saying that length of quality string was different from sequence length.
Here is one representative Tool Standard Error:
EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
@V350218072L2C001R00100033293/1
+
350218072L2C001R00100033299/1 1381 N 0
SOLUTION: fix your fastq file
Apr 09 16:44:26 … FATAL ERROR, exiting
I then ran FASTQ Info on all the Cutadapt output sequences. It turns that ALL 24 sequence datasets I had trimmed with Cuadapt have problems with read length.
Here is one representative output:
fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/0/0/3/dataset_0036b7fd-cd09-4bf2-b956-c772c0f6ae98.dat
ERROR: Error in file /corral4/main/objects/0/0/3/dataset_0036b7fd-cd09-4bf2-b956-c772c0f6ae98.dat: line 6937: read length too small - 0
The line is different in the 24 datasets, the highest line number I find is 141473, yet I think other line numbers are not random but repeating. For instance, see this compilation of line numbers:
6937; 6937; 39369; 32113; 90949; 90949; 39805; 36581; 80501; 19265; 16793; 8793; 141473; 4629; 67417; 5521; 4301; 2961; 17405; 3729; 3665; 1873; 585; 585; 6937; 39369; 90949; 39805; 80501; 16793; 141473
I was not sure whether FASTQ Info reports only the first error and then exits, or all errors, and there are thus, respectively, potentially many more such errors in the dataset, or only one.
I then ran FASTQ Info on ALL the ORIGINAL sequence datasets used for trimming with Cutadapt. It turns out NONE of them seem to have problems. Here is one representative output:
fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/9/5/7/dataset_9578c703-a94c-4c66-aa5e-46e137b6207e.dat
-
100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 2000000 2100000 2200000 2300000 2400000 2500000 2600000 2700000 2800000 2900000 3000000 3100000 3200000 3300000 3400000 3500000 3600000 3700000 3800000 3900000 4000000 4100000 4200000 4300000 4400000 4500000 4600000 4700000 4800000 4900000 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 8300000 8400000 8500000 8600000 8700000 8800000 8900000 9000000 9100000 9200000 9300000 9400000 9500000 9600000 9700000 9800000 9900000 10000000 10100000 10200000 10300000 10400000 10500000 10600000 10700000 10800000 10900000 11000000 11100000 11200000 11300000 11400000 11500000 11600000 11700000 11800000 11900000 12000000 12100000 12200000 12300000 12400000 12500000 12600000 12700000 12800000 12900000 13000000 13100000 13200000 13300000 13400000 13500000 13600000 13700000 13800000 13900000 14000000 14100000 14200000 14300000 14400000 14500000 14600000 14700000 14800000 14900000 15000000 15100000 15200000 15300000 15400000 15500000 15600000 15700000 15800000 15900000 16000000 16100000 16200000 16300000 16400000 16500000 16600000 16700000 16800000 16900000 17000000 17100000 17200000 17300000 17400000 17500000 17600000 17700000 17800000 17900000 18000000 18100000 18200000 18300000 18400000 18500000 18600000 18700000 18800000 18900000 19000000 19100000 19200000 19300000 19400000 19500000 19600000 19700000 19800000 19900000 20000000 20100000 20200000 20300000 20400000 20500000 20600000 20700000 20800000 20900000 21000000 21100000 21200000 21300000 21400000 21500000 21600000 21700000 21800000 21900000 22000000 22100000 22200000 22300000 22400000 22500000 22600000 22700000 22800000 22900000 23000000 23100000 23200000 23300000 23400000 23500000 23600000 23700000 23800000 23900000 24000000 24100000 24200000 24300000 24400000 24500000 24600000 24700000 24800000 24900000 25000000 25100000 25200000 25300000 25400000 25500000 25600000Scanning complete.*
Reads processed: 25691513
Memory used in indexing: ~1764 MB
------------------------------------
Number of reads: 25691513
Quality encoding range: 33 73
Quality encoding: 33
Read length: 100 100 100
OK
Do I interpret it correctly that the Cutadapt trimming generated errors in its output systematically and none of the Cutadapt output could thus be used for RNA Star as every Star run will abort with the same error?
I also note that the same problems (with different sequence datasets) occur on both Galaxy.ORG and Galaxy.EU.
Is there an easy way to fix the Cutadapt output?
Any other advice on alternative options for trimming my sequences that will actually work?
Thanks a lot in advance!
PS:
The original sequences are MGI PE100. I use MGI forward filter and reverse filter sequences for trimming.