My project involves genotyping individuals at targeted sites in the human SNP databases. The original DNA has been enriched in the targeted SNPs following capture hybridization. I have bam files of the aligned sequences resulting from the capture hybridization and a list of approximately 250 targeted SNP sites. I would like the output file(s) to list the two alleles from each site, the allele frequency and the coverage. I am hoping for suggestions as to how to retrieve this information from the bam files using Galaxy?
BAM files contain reads aligned to reference genome. You need to call variant first, for example Free Bayes is a good caller. You may want to work on alignments before calling variants, for example, re-align indels to left. VCF format contain information about variants including depth of coverage. Once you get list of variants, IDs from dbSNP can be added to the identified variants using tools like SnpSift Annotate SNPs from dbSnp. Have a look at GTN tutorials in
Thank you Igor for responding to my request for help. I understand the steps I would take to obtain the allelic information. However, it seems like more steps than would be necessary, given that I already have a list of SNPs I am interested in, and am not trying to discover new ones. What has me stuck is that when I visualize the reads on IGV, searching by SNP location, all of the allelic information that I want is present. The problem is that I need to copy that information separately from each locus. Isn’t there a way to retrieve the information for multiple SNPs at once?
BAM files contain information about reads mapped to genome, while IGV displays pileup style data in coverage track at the top of alignment. It converts information about mapped reads into “coverage”. This is what we do during variant calling.
I don’t know if information about variants and coverage can be extracted in IGV. It might be doable using intersection or other options, but I would not do it: it shows total coverage for a position, so it is easy to get confused by multi-mapped reads.
On other hand, FreeBayes can cal variants in all samples and produce a single table. To speed up, you can call variants only around the target site.