Bcftools convert to vcf skipping all rows

Hi @zhenderson1

Thanks for sharing the screenshot details! Very helpful.

The issue has to do with a mismatch between the chromosome identifiers (names). Your file uses an identifier with the format 20 while the UCSC reference genome indexed on the server uses the format chr20.

Description

This is a common hurdle for people working with bioinformatics data, in particular when moving data between platforms. I’ll post some details about the difference if you are curious about the “why”.

What to do

The last time I checked closely, 23andMe was using the GATK version of the hg19 reference genome.

We have this indexed in Galaxy! However, it is not available for the bcftools tools. And, you probably would not want to use this version of the genome labeling since this will somewhat restrict what you can do with the results later!

To be clear: the bases of the GRCh37/b37 genome assemblies all use the same coordinate system as the other “hg19” genomes. The difference is the chromosome labeling (identifier names) and possibly which chromosomes are included (other versions may contain more e.g. more haplotype and alt versions, usually not included for genotyping studies). All will include chromosome 20 aka chr20.

Options

  1. Convert to using the UCSC version of the identifier labels.
  • Since you only have four lines of data, modifying the file directly in a text editor would be straightforward if you are very careful to not change the whitepace (tabs, spaces) in your file. Excel would NOT be recommended (ask me why!). Or, you can use a tool and even more tools!
  • Great: TextEdit on a MAC or the equivalent on a PC. Add in the chr to the current 20 identifiers to create chr20 for each line.
  • Better: Replace Text in a specific column in Galaxy. Since all are the same, your search/find will be somewhat simple. This is what I would do with this file.
  • For data with many more and different rows: Replace column by values which are defined in a convert file can be used. Common convert files are available! See the bottom of the tool form for the public repositories scientist often use, or create your own.
  1. Continue to use the hg_g1k_v37 GATK reference genome version.


You have some choices! Please give these a try and let us know what worked for you! :slight_smile:

1 Like