RNAseq mapping issues

I have been trying to follow up on the RNAseq tutorial I did last week as part of the Galaxy Training Academy 2025, and instead of practicing a new skill, I have run into a bit of hot water- so would like to ask for some help and advice
I would like to rerun some older publicly available datasets on the recent edition of the genome, to be able to compare them and look for gene expression correlations across multiple stress conditions. So I chose the most important one to start with, and it turns out the transcriptomes were done with SOLID. I have been able to download files which already contain the quality information, both colorspace fastq files, and translated files imported from the SRA archive, so that’s the first step. I was also able to run cutadapt on these 50bp single unstranded reads.
But RNAstar mapping of the translated files was very poor - very high multimapping, and it won’t map colorspace reads, so clearly I am going wrong.

I understand I should be mapping the colorspace files directly, and you have a version of bowtie, and a mapper called PerM for this: but I can’t get them to work - they may have problems and need updating?- see error logs in this history.

the original data analysis used Bowtie 1.01 followed by cuffdiff - but I am wondering if there is any way this data could be analysed using the GFF data to take the numerous introns in every gene into account…
It would be excellent if you could advise me on good parameters for this type of data - it would be so useful to be able to reanalyse old datasets, and I need some magic to make this possible!
Could I ask for your help?
Hopefully you can put me right and get me going again - there must be a way to do this…
Thank you so much…

Hi @cgreig,

Have you examined the read quality? Check dataset 1 in the shared history. You see than more than half or read length consist of polyN. FastQC and Falco bow show that read length varies from 20 to 50 bp, but, as you can see at the start of the file, some long reads contain long polyN tracks. Nucleotide composition is very unusual for 29M reads. The dataset has over-represented sequences, up to 6.8% for the top sequence. It might be OK for some data, for example, miRNA sequencing. I Blasted the sequence present in 6.8% reads, and it seems one of the hits is to rRNA. Once I came across data with more than half reads derived from rRNA.

Examine the log file from the STAR job (dataset 5). 10M reads (35%) are not mapped because they are too short. You can check alignments including examples of multi-mapped reads. High proportion of multi-mapped reads does not mean the mapping was poor, because reads may come from repeated regions - see my remark regarding the rRNA sequence. rRNA genes come in clusters, and the genome assembly may contain multiple copies of the rRNA genes. See, if you can exclude reads mapped to rRNA genes.

You can check alignment using IGV. Download the genome assembly and gene annotation, create a custom genome in IGV using the same dbkey, as in Galaxy. Start IGV and visualise reads from the BAM file. Galaxy will connect the BAM file with IGB on your computer, and only small proportion of reads will be transferred for visuallisation.

I have not tested PerM.

Not sure, if bowtie can handle spliced data. HiSAT2 and RNA_STAR are popular options for RNA-Seq data. Both can use gene annotation data for mapping, but will do a good job even without gene annotation. For read counting consider htseq-count and featureCounts. IMHO, the main concern is data quality: high proportion of short reads, presence of polyN in a small fraction of reads and (apparently) high proportion of reads derived from rRNA.

Kind regards,
Igor

Thank you Igor. I appreciate your advice. The data is published by another group and I don’t have the resources to repeat the experiment. This is a common feature of public domain data, and I was trying to find a good way to use it. What you say re the quality of the decoded and trimmed reads is true but could be a result of the standard parameters used and of the way the colorspace is decoded- [which I now understand shouldn’t be done before mapping because it loses too much information]. So this data needs an option which can use the colorspace reads. Can you look into why the tools that can do this on galaxy are giving me error messages please? (second history. also has genome and annotations if that’s any use..). I take your point about rRNA, this wasn’t mentioned in the publication, but may be a good way to improve the quality of the data. (though unfortunately I am a beginner and would need to ask you how exactly can one exclude genes from a mapping experiment..). It would be good to know if the cutadapt trimming step I used was appropriate for this kind of data too - it is clearly a specialised form of sequencing, but not that old- there may be a lot of data out there that could be used again with the right guidance and tools.
Is there a glimmer of hope in the fact that bowtie (or certain versions of it) can map colorspace data and HiSat2 is a development of bowtie that can work with spliced data…(?).. It would be good to be able to open up this kind of data and incorporate it into cross- study analyses- something that galaxy could really contribute towards. I hope you are able to help- and thank you for looking into it so promptly. I very much appreciate your knowledge and guidance.
Carolyn

Hi Carolyn,

Unfortunately, I am not familiar with SOLiD sequencing technology. It is not used anymore.

2nd history: click at any output from the failed PerM job. Click at Error icon, the one looking like the ladybird beetle. You’ll see the following:
PerM: command not found
It looks like an installation issue. Again, it is probably an old software, and, as far as I can see, it is not available in the tool shed (a repository of Galaxy tools). It is endemic to Galaxy Europe.
You can report the failed job to the server admins by clicking Report button at the Dataset Error Report page, but we can contact people from Europe here.

Hi @wm75, it seems PerM is not installed properly in Europe. The jobs fail with “PerM: command not found”. Can it be fixed?

You can check compatible data formats for tools in Galaxy by clicking at “accepted formats” link under Input boxes (places were we select input files). HiSAT2 accepts fastqsanger and fasta. I checked other mapping tools, and cannot find any supporting the SOLiD technology. Dedicated bowtie_color_wrappers are not maintained since 2015 and most likely incompatible with current Galaxy environment.

You can try SortMeRNA for filtering rRNA sequences:

Check if the gene annotation file has info on rRNA genes. If yes, look at these regions with IGV. Attach the BAM file with reads. I did a quick check, and I might be wrong, but presence of a sequence in 6.8% of reads is consistent with rRNA contamination. If it is the case, significant proportion of multimapped reads may come from rRNA. Maybe do the read counting with featureCounts and check the top genes.

Kind regards,
Igor

Thanks Igor. It could be really useful to be able to maintain a capability to use older datasets - this was published 8 years ago. The Solid data usies a colorspace encoding which is the only real difference I think. (?) Thanks for your advice about using SortMeRNA- this may also be necessary to replicate the published analysis with the new genome
There is a bowtie for solid tool in the history , so it looks like someone had done some work to make this possible in the past- this also isn’t working with similar issues. Will report as suggested. As you can see the colorspace data has the format fastqcsanger.gz and I tried this with HISAT2 - it does not accept it as is. Perhaps there is someone who has used SOLID in the past who can help with advice/ tips? Would be happy to help contribute to a tutorial if this is the case and we can get things working …
Carolyn

Hi Carolyn,

You said you have files in standard FASTQ format, so you can analyse the data.

By memory, SOLiD technology uses dinucleotides, and data can be converted into FASTQ, which uses nucleotides. The issue might be around indels, real or sequencing errors.

As I said, bowtie for solid wrapper (bowtie_color_wrappers) is not maintained since 2015.

Most likely, the species you work with has spliced genes, and bowtie is genome aligner. I believe it cannot map through splice sites. You can use transcriptome instead of reference genome.

During data upload, Galaxy checks files and assigns an appropriate datatype, for example, fastqcssanger.gz. This one has the following meaning: reads in colorspace format with Phred33 encoding compressed with GZip. I usually think about datatypes as a label in a shop, but metadata is a more appropriate term. Tools in Galaxy handle data based on datatypes. HiSAT2 search(?) for datasets with fastqsanger, fastqsanger.gz, fastqsanger.bz2 and fasta datatypes. It ignores all other datatypes, such as fastqillumina or fastqcssanger. Many tools in Galaxy currently handle only FASTQ files with Phred33 encoding. All modern sequencing platforms use +33 offset for quality encoding. It is expected that other types of data, for example, old illumina standard, will be converted into Phred33. You already have data in this format.

Native SOLiD data is very different to FASTQ/FASTA. It requires different indexing of genome, and probably use different rules. Mapping software is improving all the time for existing and emerging sequencing technology, like long reads.

I hope people with experience with SOLiD data will give useful suggestions.

Kind regards,
Igor

Thanks Igor. And apologies for the difficulty of this question - I appreciate your help and advice. I guess no- one has recent experience of this kind of analysis so its a big ask. I do have the converted files, but because the conversion process can easily go badly wrong, and result in a big reduction in the data quality, most advice is to use the colorspace reads directly. Bowtie can do this if the right options are enabled (which needs the wrapper to work in galaxy) , I was just hoping that Hisat might have retained this in the background somewhere and it might be possible to turn this capability on? I’m not sure who could make this possible…Otherwise, using the transcriptome is a good idea - I suppose the best thing for this would be to compile the first transcript information only (it might be the best compromise). Can you point me to any clear steps for doing this - it would be good to know I am doing this correctly. I have already compiled a custom genome from the latest info, so it may be straightforward but might be useful to be sure. And have any indexing steps sorted out prior to mapping too- in case they are not included in the mapping process. It would be good to find a way forward that enables use of these datasets…
Again, much appreciated, and if there is anyone on the forum who can share their experience, that would be very helpful…

Hi Carolyn,

I start with easy questions. In Galaxy, mapping tools, such as bowtie, index a reference genome prior to mapping, when jobs submitted with Custom reference genome from history. PerM can handle custom reference genomes from a history, and in this case it should index the genome prior mapping. Custom reference genome can be any file including transcriptome.

I don’t know if reads counting tools, such as htseq-count or featureCounts support BAM files produced with colorspace data.

I guess, an improvement in colorspace-to-sequence conversion from alignments might be significant if aligner tool supports indels. However, it seems PerM, just like bowtie, apparently does not support gaps (indels) - see Table 1 in https://onlinelibrary.wiley.com/doi/10.1155/2014/309650
I don’t know how much in quality you will gain from a non-gapped alignment.

If you want to try, upload the transcriptome file, map reads with PerM once it fixed in Europe and convert BAM to FASTQ. I hope, the conversion step will produce nucleotide sequences.

As I said previously, the sequence quality might be OK for SOLiD, but the dataset has a high proportion of rRNA-derrived reads, and high proportion of multi-mapped reads is consistent with rRNA-derrived reads.

HiSAT2 was (mainly) developed after demise of SOLiD technology. The question was asked before, but not answered: SOLiD RNA-seq and HISAT2 mapping?
The HiSAT2 manual has the following: Colorspace is always set to 0 for HISAT2.
Manual | HISAT2
It looks like it does not support this type of data.

Kind regards,
Igor