Accurate mapping of 6p21 genes

Ive ran the BWA MEM on my fastq files and it’s showing almost no coverage on the area.

It his due to high polymorphism of the HLA area ? How to overcome this?

Hi @dna2026

Your question is about interpretation of the results, not Galaxy. Maybe consider checking other bioinformatic forums, such as Biostar. You may also follow any established protocol for HLA or MHC, or check published papers.

If you decide to post on other forum, provide more details, for example, what kind of data you got (it looks like WGS, but it is just a guess), what tissue or cell type was used, what reference was used, parameters used in the mapping step, some basic description of the screenshot (for example, does it show multimapped reads or not).

I guess the assembly has many repeats in this locus, and what you see might be consequence of multimapped reads, to be precise, the way mapping tools handle multimapped reads with default settings.

Kind regards,

Igor

1 Like

Hi Igor

Thank you for the reply !

This is a 30x WGS from a commercial sequencing company, buccal swab etc.

I wanted to align it to hg38 as the original vcf I received was in hg19 - particularly to see if I can get a better look at CYP21a2, TNXB and surrounding genes based on the raw FASTQ files

The original pipeline did call some variants on the VCF ( my BWA MEM alignment showed none and very poor coverage in the HLA area ) and I am trying to see if I can use any tools to solve this and learn in the process.

I’ve been advised this region is very difficult to map and might need specialist processing while doing the actual sequencing but just checking if I can get any approximation using the files at hand.

I assume this would mean masking some pseudogenes or possibly aligning to specific HLA reference or gene ?

Hi @dna2026,

so, this is not cells with genomic rearrangement in HLA. You probably got enough reads in the sequencing data. I searched for TNXB: it is in “complex cluster”.
I don’t have experience with HLA or MHC, but it seems people use(d) a specific reference made of these genes for short reads mapping. Long reads, like high quality PacBio or very long Nanopore, might be a better option for these regions if using a standard genome assembly, assuming the assembly has good representation in this region. At least in the past, regions with repeats had poor assembly (only few copies present), for example, heterochromatin regions in fruit flies. You probably need to check literature.

Multimapped reads probably have MAPQ=0, and only few matches might be present in alignment with default settings. Check if reads MAPQ=0 are shown in coverage plots. Use of multimapped reads for variant calling is also problematic.

As I said previously, check published protocols and talk to people working in this field: the standard variant calling protocols may not be suitable for highly repeated regions, such as HLA.

Kind regards,

Igor

1 Like