Creating a customized genome index (large) on a private Galaxy server -> Use Data Managers

Hello,
I was hoping to get some advice - I am trying to create a custom genome with combine human and rat builds to use as a reference genome for alignment with bulk RNA sequencing data. I am trying to use the UCSC database but I am not sure which settings (group, track and output format mostly) that would be required for this application. Any advice or point toward a tutorial that might help would be extremely helpful!

Thanks in advance,
Kiera

Hi @kdwyer

Human or rat individually will both be too large to use as a custom reference genome. Combined will also be too large.

You could create your own indexed reference with this data if you are running your own Galaxy server. UCSC would be a good source. Find the fasta’s in their Downloads area, as this data is too large to use the Table Browser query function.

General instructions for creating reference indexes in Galaxy are here:

hi @jennaj thanks for the reply!

I am a little confused in your first comment on individually human or rat being too large. I might have mis-explained and want to give more context. I have bulk RNA seq data from explanted tissue that contains mostly human cells with some contaminating rat cells. I was advised at the last Galaxy conference to make a “custom genome” to use during the alignment phase (I had used HISAT2 previously) that contained both the human and rat reference sequence to be able to map my dataset to in order to determine the percent reading to human versus rat. I had previously used human and rat individually and separately to align my data using HISAT2 with the built in human (hg38) and then rat (rn6) genomes- so I am confused when you say this is too large? Could you explain this more?

I am not running my own Galaxy server-would you be able to explain the reference indexes more, I could not seem to find any mention in the link? Knowing my application, is there any more advice on if this is possible in Galaxy, how to do it, or if the USCS database is good, what settings I should be using?

Thanks so much,
Kiera

Hi @kdwyer

The built-in indexes for human and rat are already prepared for mapping against on the public servers. This means that the original fasta for the genome has had several indexes pre-computed. Those indexes are what tools use for processing when you run a job.

If those same exact genomes are used as a custom reference genome (a fasta from the user’s history), the indexes are created at runtime before the analysis tool is run. That indexing step is usually where the job fails, other times the processing for the index creation + the analysis job combined will exceed the maximum runtime at that server.

Any larger genome will have this same problem at most public servers (and definately at UseGalaxy.org). FAQ working-with-very-large-fasta-datasets

That said, UseGalaxy.eu can sometimes handle larger jobs. So, you can try. The genome fasta files at UCSC are in their downloads area. Since you want to combine two genomes, you’ll need to adjust the chromosome identifiers. Any single fasta that is used as a “custom genome” needs to have distinct identifier names for all of the sequences included. The naming for human and rat will have some overlap: example, human will have a “chr1” and rat will have a “chr1”. One or both will need a change, not just for “chr1” but for any other chromosomes with a common name between the two species/fasta files.