How to map a collection of individual samples against a custom reference genome with RNA STAR

Hi all,

I am a complete newby to RNAseq and so far, I followed along the " Reference-based RNA-Seq data analysis"-tutorial with my own data.
In my project, I try to find differentially expressed genes in five different mutants (sampled as biological triplicates). The organism is an unconventional yeast, so I imported the genome fasta and the gtf file from NCBI. When I run the tool, I get the error message pictured.
My two questions: 1) is it right, that the tool “wants” to only map one query sequence since I only feed it one reference genome?
2) If so, how do I a workflow to map all paired end reads of my different samples againt the same reference genome?
Thanks for your help and cheers,
Kai

Hi @Kai_Buechner
Maybe try HiSAT2. It is similar to RNA_STAR but requires less memory. Test the approach on one sample first. I assume you have gzipped fastq files in history. You also need a genome assembly of the species you work with.
During HiSAT2 job setup change Source of the reference genome to From history and select the fasta file containing the reference genome. Change Is this single end or paired end to Paired-end and select files with F and R reads from one sample. Activate both options in Summary options section: the output is useful for visualisation with MultiQC. Click Run Tool. HiSAT2 will index the genome and map reads. Wait for completion of the job. Check the results including summary file. You can check alignment on IGV.

If you are happy with the results, use re-run option and replace FASTQ files (use files from another sample). You don’t have many samples, so it should not take too long.

Kind regards,
Igor

1 Like

Hi @igor
Thank you for your help! I’ll get to work on it.

Have a great day,
Kai

Hi @Kai_Buechner

As a test to make sure everything is working as expected, I started up a simple paired end collection mapping in this history.

The other option is to map without the multi-sample collection as @igor is describing.

You original error message is odd! Mapping multiple samples against the same single reference genome should definitely be possible, all together, in the same run. If you would like to share back your history, we can troubleshoot any problems that my example and Igor’s help did not resolve.

Remember that RNA Star is very picky about reference data formats! So if you are supplying your own reference data, content issues can lead to all sort of odd error messages! We can usually sort those out here if we can see the example. How to share is in the banner of this forum. The entire context usually matters and the shared history link is the best way to communicate those details.

Please let us know if you solve this! :slight_smile:

XRef

Hi all,
first of all: thank you @igor and @jennaj for taking the time to tackle my problem. I found the solution and try to write it up as comprehensively as I can:
I used the NCBI Datasets Genomes tool from the “Get Data” menu to import the dataset. I did not realise that the genome was imported as folder nested within a folder; this is what threw RNA STAR off. When I downloaded the fasta.gz and uploaded it again, it worked very well.
The next issue I had was the .gff3 annotation file. It did not have the right annotation format so I first asked an LLM to write a short python program to unify the gene_id for all instances and then replaced the several different delimiters (“\t” “,” and “;”) uniformly with tab delimiters. That solved the problems I had, hopefully these solutions can help other newcomers in the same situation.
Have a great upcoming week, everyone!
Kai

2 Likes