Project help...(variant analysis)


(note: I am an absolute complete beginner in this) Ive recently been taking a course on coursera about using galaxy and I have a project whereby I have to

"Create a Galaxy workflow to identify polymorphic sites in all three individuals. Your workflow will need to map the three sets of paired reads to the appropriate reference genome. You will then need to use a variant caller to identify sites that appear to have strong support for the presence of a polymorphism, and call the genotype at that site for each sample.

You should report your results in VCF (variant call format). You should only include sites where the chance of a false positive call is 1 in 10,000 or better according to the VCF qual field.

Using your resulting VCF determine 1) the number of single nucleotide variants, 2) the number of insertion/deletion variants, 3) the number of multi-necleotide variants, 4) the number of variants with multiple alternate alleles, and 5) the names of the 5 genes with the largest number of polymorphic sites."

I am currently at the stage where I have used Freebayes to call variants and used vcf filter to only include sites where the chance of a false positive call is 1 in 10,000. As I have no prior experience at all in this topic I was wondering if someone could help with the last steps in this project…and how the data analysis should be done…

below is something I found as to the kind of steps that we should take in this project however I don’t understand the last ones…

step 1: load data - the data are loaded from local files, set “fastqsanger” format and “hg19” database on the starting page

step 2: check quality of all sequencing files - use FastQC tool (version: 0.63) to check quality of the sequencing

step 3: mapping - use BWA-MEM tool (version: 0.1) to map sequence to reference genome (choose hg19 as reference), paired end

step 4: add or replace read groups - label each group (the mapping file) using AddOrReplaceReadGroup (version: 1.126.0)

step 5: merge 3 individual mapping files - use MergeSamFiles (version: 1.126.0)

step 6: filter - using filter tools: Filter (version: 1.126.0, remove low quality mapping), MarkDuplicates (version: 1.126.0, filter out duplicated mapping), CleanSam (version: 1.126.0)

step 7: identify polymorphic sites - using FreeBayes tool (version: 0.4) to identify polymorphic sites base on hg19 genome

step 8: filter out false positive sites - using VCFfilter (version: 0.0.3) to select sites where the chance of a false positive call is 1 in 10,000 or better.

step 9: extract workflow and download final vcf file for further analyses.

Stage 2 - analyze data of polymorphic sites based on vcf file

step 10: load data - set format as “vcf”, genomic database as hg19

step 11: identify number of snp, mnp, del, ins or complex - using VCFfilter tool (version:0.0.3 ) to select different types of polymorphism (for example: -f “TYPE = snp”, select snp only), then using Filter tool (version: 1.1.0) to find duplicated polymorphisms

step 12: identify genes with polymorphic sites - using ANNOVAR Annotate VCF tool (version: 0.1) to annotate the vcf file in step 10

step 13: count polymorphic sites for each gene - using Group tool (version: 2.1.0, by gene name) to count number of polymorphic sites for each gene

step 14: sort results in step 13 using Sort tool (version: 1.0.3, by descending).