FTDNA BAM to T2T pileup and haplogroup

Hi … I’m hoping for some guidance on the workflow for T2T mapping a FTDNA BigY BAM with the DF27-based CM034974.1 … and taking this through to pileup and metagenomic haplogroup … as a newbie I’ve been experimenting a lot without developing a way forward using usegalaxy, samtools (directly with the Ydna DF27-based CM034974.1) and Yleaf … in usegalaxy at the moment I have a FTDNA BAM converted to fastq and applied Bowtie2 and usegalaxy’s Chm13_T2T_v2.0 reference to map a new BAM and a convert this to BED… now I’m stuck trying to get a pileup because there appears to be no reference genome available (which might possibly be the Ydna J1-based CP086569.1 that goes with usegalaxy’s Chm13_T2T_v2.0) … ideally I’d like to combine calls against all three of CM034974.1, CP086569.1 and H38 reference genomes … so I’m hoping for some guidance on the next steps for a pileup and how to use it for haplogroup identification … thanks!

If you are using the tool Samtools mpileup, then the tool form interprets the assigned database of the inputs to offer up the correct built-in fasta index. You can also use a custom genome fasta from the history instead.

I’m not sure if the T2T assembly is indexed or not for this tool at the AU server but you could check by selecting a dataset that has that database assigned (or is that what you have done?).

→ If you mapped against that genome as a native index, the database key should already be assigned.
→ If you mapped against a custom genome fasta, you would need to create the custom build, then assign it to datasets with data points associated with that assembly.
→ Never assign a server indexed database key when using a custom genome unless you know that the two reference genomes are exactly the same: bases and identifiers. But if they are exactly the same, then the question of why to use a fasta from the history at all to start with comes up…

I just wrote up some help for another question that explains a bit about how the custom genome and custom build functions work in Galaxy, so I’ll repost that here since it probably also applies to your use case:

I also have another guide here that covers some of the complexities of working with human genomes from different sources. You may need it now, or maybe later as a reference.

Reference genomes at public Galaxy servers: GRCh38/hg38 example

For this part

This sounds like you want a custom genome that contains all of those genomes. You’ll have some trouble with this due to sequence duplications between the difference references, the exposed telemore and repetitive regions in the T2T assemblies themselves, then the really massive size of the fasta itself (jobs against this reference are likely to hit processing limitations – possibly with the variant calling tools themselves!). Also, I’m not sure if that will give you the scientific results you are after, so I am probably misunderstanding!

Instead, you’ll probably want to call variants against each reference separately, then compare those calls across assemblies. UCSC has some reference-to-reference mapping data available: most of the annotation tracks that you see in their browser are from that “liftover” process, and the files are in the Table browser or their downloads area. Maybe review that (or ask them if you can’t find it), then learn which specific assemblies they used, then also call your reference-specific variants against those same exact assemblies to facilitate the mapping over of your results.

So … I can think of a few ways to approach this but it would be better to find a publication that has attempted this already, learn the methods, then adopt those to your own novel data as a starting place. Then tune.

We don’t have a tutorial for anything human T2T related yet (or I missed it and someone can correct me!), but you can still review the other variant calling guides to get an idea of the usual pathway for that part. And, you might want to switch up your target genomes to match what UCSC is hosting for the reference-to-reference mapping part, and use the custom genome/build functions.

If you plan on incorporating any annotation (dbSNP, gene bounds), I would suggest getting that organized at the very start too. Why? Variant calling is super sensitive to the content of a particular assemblies base positions plus how the data is pre-filtered, the read quality, potential filtering for regions of interest when mapping, and more that you probably already know! You’ll want to eliminate simple data conflicts when exploring new methods to avoid weird technical errors or “wrong” scientific results (not so easy to detect sometimes, and usually at some downstream step after a lot of investment in pre-processing, that then needs to be redone).


Overall, I think the immediate problem is some reference genome mix-up or labeling problem but if the CHM13_T2T_v2.0 assembly is actually not available in the pileup tool that you are using, and you still want to use that specific assembly, let us know which exact tool and that can probably be adjusted.

We’ll need the full tool name and version is at the very top of the tool form. Then I’ll confirm and we can ask the AU admin’s here for the correction request (if they think the tool can actually handle the processing!). You could also check to see if the EU server already has it already has it or not - they have the largest clusters of the three UseGalaxy servers.

Let’s start there and thanks for sharing all of your details. Very helpful! :slight_smile:

jennaj … thanks for a great set of things to check … I’ll get onto it asap and let you know … cheers!

1 Like