Hi @zhenderson1
Thanks for sharing the screenshot details! Very helpful.
The issue has to do with a mismatch between the chromosome identifiers (names). Your file uses an identifier with the format 20 while the UCSC reference genome indexed on the server uses the format chr20.
Description
This is a common hurdle for people working with bioinformatics data, in particular when moving data between platforms. I’ll post some details about the difference if you are curious about the “why”.
- Reference genomes at public Galaxy servers: GRCh38/hg38 example
- FAQ: Mismatched Chromosome identifiers and how to avoid them
What to do
The last time I checked closely, 23andMe was using the GATK version of the hg19 reference genome.
We have this indexed in Galaxy! However, it is not available for the bcftools tools. And, you probably would not want to use this version of the genome labeling since this will somewhat restrict what you can do with the results later!
To be clear: the bases of the GRCh37/b37 genome assemblies all use the same coordinate system as the other “hg19” genomes. The difference is the chromosome labeling (identifier names) and possibly which chromosomes are included (other versions may contain more e.g. more haplotype and alt versions, usually not included for genotyping studies). All will include chromosome 20 aka chr20.
Options
- Convert to using the UCSC version of the identifier labels.
- Since you only have four lines of data, modifying the file directly in a text editor would be straightforward if you are very careful to not change the whitepace (tabs, spaces) in your file. Excel would NOT be recommended (ask me why!). Or, you can use a tool and even more tools!
- Great: TextEdit on a MAC or the equivalent on a PC. Add in the
chrto the current20identifiers to createchr20for each line. - Better: Replace Text in a specific column in Galaxy. Since all are the same, your search/find will be somewhat simple. This is what I would do with this file.
- For data with many more and different rows: Replace column by values which are defined in a convert file can be used. Common convert files are available! See the bottom of the tool form for the public repositories scientist often use, or create your own.
- Continue to use the hg_g1k_v37 GATK reference genome version.
- A version of the fasta of this genome can be found here.
- http://datacache.galaxyproject.org/indexes/hg_g1k_v37/seq/hg_g1k_v37.fa
- Copy and paste the link into the Upload tool. Use all default settings when loading the data. Allow the file to completely load (this may take a while!).
- Use bfctools tool form’s option to use a Genome from the History option instead of a server index. Other tools may also require this choice. You may also need to assign the database key.
You have some choices! Please give these a try and let us know what worked for you! ![]()