How to turn .bam into VCF

I’m not in this industry at all and can’t seem to have AI help me here. I’m trying to turn a .bam into VCF with the FreeBayes tool but I keep running into an error (it says unknown error) when I run the tool. The genome build is 19/GRCh37, but there are a ton of options for this. i’ve tried quite a few and none have worked. Would really appreciate any help here for a total novice!

Welcome, @rachaeldavid1

Glad you found your way over here!

There is a potential problem at UseGalaxy.org right now. I’m talking with our administrative team and can get clarity or a correction.

This is how the error is presenting. Is this also what you are getting?

History panel screenshot for a dataset

More soon, and you can confirm if this is also your error. Thanks! :slight_smile:

Ah yes this is my exact error! Thanks for this clarification. Again I don’t work in this field at all and really have no idea what I’m doing, I’m just a complex chronic illness patient who’s geneticist is too busy so I’m trying to do the work myself.

Idk if this context is helpful, but what I’m trying to do is upload my genetic I received from Invitae testing into Promethease. I only have .bam and .bam.bai files and Promethease only takes VCF, so i was trying to use this tool to convert. If there’s any easier ways to go about this I’ll do whatever!

1 Like

Hi @rachaeldavid1

I’m messaging your direct, so let’s continue in there.

The “conversion” is more like generating a meaningful summary than a direct translation between file formats.

Your medical data might not be a good fit for the public servers if you are concerned about privacy. But we can discuss.

Gotcha-- waiting for that DM. I can’t figure out how to send you one but waiting for yours!

Hi @rachaeldavid1

Your BAM file appears to be based on the human genome, but it is not the version hosted as hg19 at the public Galaxy servers (how the data is currently labeled, and the genome selected with Freebayes originally).

This is what the error is reporting: mismatched chromosome identifiers

More about the different human genome assemblies is here.

This guide has more details about how those kinds of check are done at a detailed level.

Your data might be based on hg_g1k_v37. If yes, you can use Freebayes against that human genome reference. I would try this first.

  • Update: this will work if you add in default Read groups to the BAM file with the tool AddOrReplaceReadGroups. This results in a VCF from Freebayes but it is not annotated. If you want to create an annotated VCF, then you’ll need to try the next suggestion below instead, starting from the fastq reads.

The other option is to extract the fastq reads out of the BAM you have, then to map against a version of the human genome we host and proceed to downstream steps that way. This is probably the cleanest way to create the file you want (a VCF file) or to obtain rs identifiers, but this might mean that the data can’t be used in other external applications (because they are expecting data based on a different human genome assembly!).

  • bedtools Convert from BAM to FastQ

If the goal is to just learn if the data includes any known SNP rs identifiers, you can do that, and following this protocol with a single sample, from the starting reads, is what to try.

Hope that gives you some options!

So I tried changing to hg_g1k_v37. I keep asking Invitae, but they just keep repeating 19/GRCh37 is used. I did AddOrReplaceReadGroups successfully based on hg_g1k_v37, but I’m still getting the same error code as I got originally in the screenshot above. Does that mean I should move onto the other options? Confirming it does not have to be annotated.

Hi @rachaeldavid1

I’ve created an example here.

You can examine the data from here, run these tools again with different parameters, and maybe follow the tutorials to see how to refine the calls.

Hope this helps! :slight_smile: