Base mismatch in arabidopsis tair 10 ref genome

New Galaxy user using the free version with no access to command line just tools. Base positions of variants in variant caller vcf output are matching up with known SNPs, however, bases themselves are not correct. Any suggestions why at set positions ref bases are not matching up with other reviewed sources when both use Arabidopsis thaliana Tair10?

Welcome @libby

If there is a base mismatch between data sources for the same species, the usual reason is a mismatch between the baseline reference sequence assembly version.

The version hosted at the public UseGalaxy servers is available here. This contains all of the underlying data files found indexed in tools for this genome.

Inside the download directory are two README files with some more details about the original source and pre-processing.

These are the primary details:

ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/TAIR10_chr1.fas through TAIR10_chr5.fas, TAIR10_chrC.fas, and TARI10_chrM.fas
Change chromosome headers to be like chr1, chrC, etc.


The following files contain the fasta-formatted complete sequences of the 5 Arabidopsis chromosomes:

TAIR10_chr1.fas
TAIR10_chr2.fas
TAIR10_chr3.fas
TAIR10_chr4.fas
TAIR10_chr5.fas   

Chloroplast chromosome:
TAIR10_ChrC.fas

Mitochondria chromosome:
TAIR10_ChrM.fas



These files provide details of the genome assembly updates:
TAIR8_Assembly_updates.xls
TAIR9_Assembly_updates.xls

Please note that assembly changes in TAIR8 only consisted of substitutions while TAIR9 assembly changes also included insertions and deletions. Therefore, coordinates of most genes changed from TAIR8 to TAIR9. 
In TAIR10, no assembly updates were made.  

Does this help? If not, would you like to share some details about the files you are comparing and where they were sourced? We can look into this closer to get it fully resolved. :slight_smile:

Hi @jennaj. Thank you for the response. Yes it seems to be an issue with the initial mapping. I am comparing vcf files that were produced in galaxy from the raw reads stage. Comparing with platforms such as ePlant and data from the 1001 genomes website both of which use Tair10 as their reference genome, the reference data seems to be different at each base position. For example, Chr4:268403 other sources say Tair10 is T in the reference file but my data processed through galaxy says ref is C and alt is T. Hope this makes sense, let me know if you would like anymore info!

Hi @libby

Yes, the other sites are probably using a different release, possibly a patch release. Their release notes will likely list out the changes over time.

The default indexes in Galaxy are just one choice. If you would like to use a different genomic release version for follow up analysis in Galaxy, you could load up the genomic fasta file and use it as a custom genome. This method will work with most tools, and you could generate a custom database dbkey too, for cross-links with display applications like IGV.

If this interests you, please see → Custom genome + custom build: How to use a genome that is not natively indexed at the server you are working at - #2 by jennaj

Any questions with this, please ask! :slight_smile: