Kept seeing "Illegal stran" when using SAM/BAM to count matrix

Xiaoqiang_Ma · February 9, 2021, 4:43am

Hello,

I am using Galaxy Australia of SAM/BAM to count matrix to get readout per gene but kept seeing the following issues, which led to no results.

File “/mnt/tools-indices/shed_tools/testtoolshed.g2.bx.psu.edu/repos/fubar/htseq_bams_to_count_matrix/d300bc688e95/htseq_bams_to_count_matrix/htseqsams2mx.py”, line 82 raise ValueError, “Illegal strand” ^ SyntaxError: invalid synt

I have tried both Yes and No for “Reads are stranded,”

but the same Syntax Error generated.

May I please get your advice on this? Thank you!

gallardoalba · February 9, 2021, 1:38pm

Hi @Xiaoqiang_Ma,

according to the htseq source code, it seems that there is an error in the strand field of your GTF file. Could you verify it?

Regards

Xiaoqiang_Ma · February 9, 2021, 1:54pm

@gallardoalba Thanks much for your help! I took a look at the gtf file and it seems the strand is showing in both forward and reverse. But the read in the bam file seems always in the forward strand, not sure if this would cause the error.

gallardoalba · February 9, 2021, 4:32pm

Hi @Xiaoqiang_Ma,
I would try to use the Filter data on any column using simple expressions tool with the following condition in order to remove potential errors from the GTF file: c7==’+’ or c7==’-’

jennaj · February 9, 2021, 8:08pm

Hi @Xiaoqiang_Ma

As additional advice to that provided by @gallardoalba, it appears that your reference GTF contains header lines. Header lines are out of specification for strict GTF format (even though many data providers include them). When present, those lines can cause odd errors with many tools – and usually not mapping tools (if incorporated) – but instead with downstream tools. It is strongly recommended to remove any GTF headers to avoid technical errors (whether a tool is run in Galaxy, or not).

Please try removing the header lines, rerun, and see if that resolves the error as a first-pass solution. If it doesn’t, then do investigate the GTF content closer. The genome appears to have just one sequence/chromosome with the name “Chromosome”. If that is not a match for the sequence/chromosome label in your BAM datasets, due to the genome mapped against having a different sequence/chromosome label, that can also cause conflicts but can be addressed.

“How to” is covered here:

Overview: Extended Help for Differential Expression Analysis Tools
Specifics for GTF formatting: Common datatypes explained >> Datatypes - Galaxy Community Hub (review the GTF “tips”)
Other related help:
- Preparing and using a Custom Reference Genome or Build
- Mismatched Chromosome identifiers (and how to avoid them)

This is also covered in much prior Q&A:

Searches: Search results for 'gtf' - Galaxy Community Help && Search results for 'htseq' - Galaxy Community Help
The summary in this post is concise. Don’t worry about the specific tool/context of the original post – the format help for GTF data applies across analysis methods/tools:

Many things could be going wrong content-wise, but those are the top issues that tend to produce errors like yours. My guess is that you are hitting the “GTF formatting” issue first, then could possibly hit a “chromosome mismatch problem”. Verify/fix both as needed – full details are in the FAQs above.

Note: The GTF is also a hybrid format (GFF3 transformed into a GTF), but that will probably not be a problem with this particular tool.

Thanks!

Xiaoqiang_Ma · February 10, 2021, 3:08am

@gallardoalba I have tried as you suggested but got another error. Thanks!

Xiaoqiang_Ma · February 10, 2021, 3:18am

@jennaj Thanks so much for your information, and I have done removing headlines of the gtf file. But still got the same error info.

Yes, I am using a simple E. coli k12 genome gtf which only has one chromosome. Please take a look the content of the bam file I used, do you think it has some wrongly labeled info that can’t be matched with the content of the gtf file?

Best,
Xiaoqiang

jennaj · February 10, 2021, 11:14pm

Ok, thanks for trying that. Your inputs are correct now.

The problem is likely with the tool itself. I didn’t notice this before but it was sourced from the test toolshed which is a sandbox/testing tool repository. And the tool wrapper plus the dependencies it uses haven’t been updated since 2015. My guess is that the tool is still hosted on the AU server due to legacy reasons – and it may work in some special cases/older workflows – but I’m not too surprised it isn’t working now.

Try using Htseq-count or Featurecounts instead. Those tools are current, work, and produce individual count files, which would be required by DESeq2. You could combine those counts into a matrix, or use the individual count files, with EdgeR or Limma. The format of a matrix is on those latter tool’s forms down in the help section. Tools in the group Text Manipulation (example: Multi-Join) could be used to merge the individual count files together into a matrix, but that isn’t required, and if you are not sure how to do that or have problems – skip that and input the individual count files instead.

Examples of RNA-seq DE analysis are covered in the GTN tutorials under the topic “Transcriptomics”. The three tutorials in the group “End-to-End Analysis” are the best place to start for an overview of current methods.

So – the bad news was that this tool won’t work (again, sorry for not noticing where it was sourced originally!) – but the GOOD news is that the changes/checks you made with your inputs would have been needed anyway when using the updated tools/methods.

Please give that a try!

Xiaoqiang_Ma · February 12, 2021, 3:20am

@jennaj Thanks so much for your advice! I took a try by Htseq-count which worked but not for feature count. I will then use Htseq for the data processing. Again, thanks for your kind help! Have a good weekend ahead.

Topic		Replies	Views
count reads per gene	2	419	September 21, 2020
How to find SAM/BAM to count matrix in Galaxy usegalaxy.org support troubleshooting	6	306	March 26, 2024
I was trying to count the number of reads through HTSeq but failed.	1	437	January 7, 2019
Error with FeatureCounts usegalaxy.org support transcriptomics	5	27	April 14, 2025
Error with HTseq RNAseq read count transcriptomics , htseq-count , resources , rna-seq	1	308	November 17, 2023

Kept seeing "Illegal stran" when using SAM/BAM to count matrix

Related topics