BWA-MEM built-in genome(s)

Pettorato · July 13, 2020, 10:46am

Hello everyone,
I’m using galaxy to trim and align data from targeted DNA-seq. Using BWA-MEM I’ve seen that between built-in genomes there is not the hg38 canonical you can find in bowtie2.
How is it possible to use that genome version to assembly my data?
Thank you in advantage.,
G.

wm75 · July 13, 2020, 12:51pm

usegalaxy.eu does have hg38 as a built-in rerefernce genome for bwa-mem. Just start typing hg in the select box and you should find it.

Pettorato · July 13, 2020, 1:11pm

Yes there is a hg38 but it is not the same as the “hg38 Canonical” I’ve already use with bowtie2.

Or I’m wrong and effectively is that the same as the one I can find in BWA-MEM?

Thanks!

wm75 · July 13, 2020, 1:45pm

Ah, yes, you’re right (sorry for not reading the question carefully the first time).
hg38 should be identical to hg38 Full, but it’s different from hg38 Canonical.
So, no, you cannot use hg38 Canonical as the reference in a bwa-mem alignment.

That said, you should also not do that anyway. Aligning against just canonical chromosomes can cause misalignments of reads that originate from non-canonical sequences, simply because there is no better match for them than a stretch of canonical sequence.
The better approach is to align against the full genome, then eliminate non-canonical mappings by filtering or by using the canonical genome during variant calling. Agreed, this can be tricky depending on downstream tools. Some variant callers, for example, may refuse to work with sequences mentioned in the input BAM header that are not found in the reference genome, in which case you would have to reheader your BAM dataset first.

All in all, my recommendation would be to rerun your bowtie2 jobs using the full hg38 version, not to look for solutions for making bwa-mem work with the canonical version.

At the same time, it’s probably true that we should offer hg38 Canonical for bwa-mem if we do so for bowtie2 so thanks for bringing this up.

Pettorato · July 13, 2020, 1:58pm

Thank you for your answer and for your little class about alignments!

Best,
G.

Alysha · September 7, 2020, 9:51am

Hi, before reading this thread I aligned my paired end RNA-seq data to the (b38) hg38 version of the human genome available through BWA-MEM. They state that this genome is pre-indexed with bwa index utility and so ready to be mapped against with BWA-MEM. I don’t know how to index so would rather stick with the aligned BAM files I have generated if possible but for downstream analysis using programs such as RSeQC I am asked to input the reference genome used. I have read that BWA-MEM use UCSC but I can’t find which hg38 version they used. Would it be wrong to use a different hg38 version (without the BWA-MEM indexes) for downstream analysis of the aligned files? Any help would be really appreciated!

jennaj · September 8, 2020, 5:53pm

Hi @Alysha

If you mapped against GRCh38 using a public Galaxy server, then yes, it was (probably) the UCSC version of the assembly/build named hg38. But you can check – how to is below.

For the assembly version, UCSC based their build of hg38 on the original release (not later patches/updates): http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/

Full details about UCSC’s source for the assembly are covered in the documents in their Downloads area plus there is a summary in the UCSC Genome Browser itself (check the home page for the human/hg38 genome at https://genome.ucsc.edu/cgi-bin/hgGateway). That is the same version as indexed for tools at usegalaxy.* public Galaxy servers: usegalaxy.org, usegalaxy.eu, and usegalaxy.org.au.

http://hgdownload.soe.ucsc.edu/downloads.html#human >> Genome sequence files and select annotations (2bit, GTF, GC-content, etc)

For many tools, reference genome indexes for hg38 will be built-in at public Galaxy servers but reference annotation for hg38 will need to be supplied by you from the history as a dataset. The reference annotation datatype required by tools can differ. Most use a gtf datatype – but in the case of RSeQC, the tool requires a bed datatype with 12 columns.

UCSC is a good source for “bed12” annotation (will have the bed datatype assigned and should be the 12 column type, not 3, 4, or 6 columns). Importantly, the chromosome identifiers will match between the annotation (sourced from UCSC’s Table Browser) and the reference genome indexed in Galaxy (sourced from UCSC’s Downloads area).

If you didn’t map with BWA-MEM in Galaxy, or at a public Galaxy server other than the usegalaxy.* servers, some other build/version of the GRCh38 genome assembly may have been used.

These FAQs cover datatypes, expected formats, and related help, including how to make sure all of your inputs are a “match”. Using the same exact genome build/version for all analysis steps is important, otherwise tools can error or produce unexpected/scientifically incorrect results.

Start here: https://galaxyproject.org/support/ >> https://galaxyproject.org/support/#getting-inputs-right

More help is also at this forum. Searching with keywords like “reference” or “annotation” or even “hg38” will find prior Q&A similar to yours

Hope that helps!

Topic		Replies	Views
Does default hg38 build version contain alt contigs? A note on built-in reference genomes genome , mapping	2	1262	May 28, 2019
BWAmeth: No option available to select the reference genome. Solution: Indexes available at usegalaxy.eu mapping , bwa	4	864	August 26, 2022
Map With BWA-MEM not working with one particular reference genome usegalaxy.org support reference-index , server-open-issue	5	128	August 30, 2024
Hg38 canonical genome usegalaxy.eu support reference-genome	1	233	February 28, 2024
Reference genome options server-admin , reference-index , custom-genome	1	213	February 14, 2024

BWA-MEM built-in genome(s)

Related topics