Error Using human hg38 in Reference-based RNA-Seq data analysis

Hi there!
Thanks for such a great tutorial. However, I tried to follow it up with human data and could not get the RNA Start stage to work.

I had this error message, but I cannot figure it out. Could you please help me on this?
Thanks

Galaxy Tool Error Report

from https://usegalaxy.eu/

Error Localization

Dataset 129983688 (4838ba20a6d8676541b1f0c338a6e6d7)
History 3214167 (5a989024b1c40faf)
Failed Job 150: RNA STAR on data 108, data 81, and data 80: log (4838ba20a6d86765a8d82224b9b0b96a)

User Provided Information

The user redacted (user: 87468) provided the following information:

Fatal error: Matched on FATAL ERROR Fatal INPUT FILE error, no valid exon lines in the GTF file: /data/dnb09/galaxy_db/files/e/2/e/dataset_e2ef103f-6927-4189-844c-ee4c686de0af.dat Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file. Apr 28 13:55:54 … FATAL ERROR, exiting gzip: stdout: Broken pipe gzip: stdout: Broken pipe I am using the RNA STAR tool with this genome version (https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.chr.gtf.gz) and have that error message. Could you please help me out to fix that error? Thanks

Detailed Job Information

Job environment and execution information is available at the job info page.

Job ID 69115183 (11ac94870d0bb33adaddfbf32a1e9058)
Tool ID toolshed.g2.bx.psu.edu/repos/iuc/rgrnastar/rna_star/2.7.11a+galaxy0
Tool Version 2.7.11a+galaxy0
Job PID or DRM id 49433078
Job Tool Version None

Job Execution and Failure Information

Command Line

STAR --runThreadN ${GALAXY_SLOTS:-4} --genomeLoad NoSharedMemory --genomeDir ‘/data/db/data_managers/rnastar/2.7.4a/hg38/hg38/dataset_412f3413-7e68-407c-9652-ff4e935abf5a_files’ --sjdbOverhang 100 --sjdbGTFfile ‘/data/dnb09/galaxy_db/files/e/2/e/dataset_e2ef103f-6927-4189-844c-ee4c686de0af.dat’ --sjdbGTFfeatureExon ‘exon’ --readFilesIn ‘/data/dnb10/galaxy_db/files/c/1/0/dataset_c1037ada-c74b-4ee9-8f41-cf3c5f0202a2.dat’ ‘/data/dnb10/galaxy_db/files/0/0/1/dataset_00174e26-d136-484e-a01b-a9ee477236f1.dat’ --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None ‘’ --quantMode GeneCounts --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 60 --outSAMunmapped Within --outBAMsortingThreadN ${GALAXY_SLOTS:-4} --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM $((${GALAXY_MEMORY_MB:-0}*1000000)) --outWigType ‘bedGraph’ ‘’ --outWigStrand ‘Stranded’ --outWigReferencesPrefix ‘-’ --outWigNorm ‘RPM’ && samtools view -b -o ‘/data/jwd02f/main/069/115/69115183/outputs/dataset_19396f4f-26fd-4bb4-a85e-a548087f627d.dat’ Aligned.sortedByCoord.out.bam && mv Signal.Unique.str1.out.bg Signal.Unique.str1.out && mv Signal.UniqueMultiple.str1.out.bg Signal.UniqueMultiple.str1.out && mv Signal.Unique.str2.out.bg Signal.Unique.str2.out && mv Signal.UniqueMultiple.str2.out.bg Signal.UniqueMultiple.str2.out

stderr

Fatal INPUT FILE error, no valid exon lines in the GTF file: /data/dnb09/galaxy_db/files/e/2/e/dataset_e2ef103f-6927-4189-844c-ee4c686de0af.dat Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file. Apr 28 13:55:54 … FATAL ERROR, exiting gzip: stdout: Broken pipe gzip: stdout: Broken pipe

stdout

/usr/local/tools/_conda/envs/mulled-v1-40c069a58b8570974e4581195144b4016c8d8f4255f4cbb822c5896056b567f4/bin/STAR-avx2 --runThreadN 10 --genomeLoad NoSharedMemory --genomeDir /data/db/data_managers/rnastar/2.7.4a/hg38/hg38/dataset_412f3413-7e68-407c-9652-ff4e935abf5a_files --sjdbOverhang 100 --sjdbGTFfile /data/dnb09/galaxy_db/files/e/2/e/dataset_e2ef103f-6927-4189-844c-ee4c686de0af.dat --sjdbGTFfeatureExon exon --readFilesIn /data/dnb10/galaxy_db/files/c/1/0/dataset_c1037ada-c74b-4ee9-8f41-cf3c5f0202a2.dat /data/dnb10/galaxy_db/files/0/0/1/dataset_00174e26-d136-484e-a01b-a9ee477236f1.dat --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None --quantMode GeneCounts --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 60 --outSAMunmapped Within --outBAMsortingThreadN 10 --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM 51200000000 --outWigType bedGraph --outWigStrand Stranded --outWigReferencesPrefix - --outWigNorm RPM STAR version: 2.7.11a compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source Apr 28 13:49:01 … started STAR run Apr 28 13:49:01 … loading genome Apr 28 13:55:11 … processing annotations GTF

Job Information

None

Job Traceback

None

This is an automated message. Do not reply to this address.

Hi @jcorchero
Check the standard error log file:

There is issue with the annotation file. Either chromosome names are different or it has no exon annotations. Galaxy uses chr1, chr2 etc for for human genome. What do you see in the annotation file? By any chance, is it 1, 2 etc? If yes, get compatible annotation file or modify the chromosome names. Some tools might consider chr1 and Chr1 as different text strings. If the annotation file uses chr1, chr2 for chromosome names, check attributes in the last column. Do use see exon annotation? I assume you used built-in hg38 for mapping.
You can get compatible gene annotations from UCSC Genome Browser or GenCode.
Hope tat helps.
Kind regards,
Igor

1 Like

Hi Igor,

Thank you very much for your response. You were absolutely right. The file I was using contained a different denomination for chromosomes. I downloaded the correct version from UCSC and it worked nicely. Later, I saw in the tutorial documentation that the files downloaded from Ensembl need further modification to be used with RNA Star, which is exactly what you suggested. Thanks again. Have a good one!

Javier

1 Like