Filtering by region using a list of contigs

Dear Galaxy
I have been trying to extract snpEFF annotaion information from a multisample VCF file using Snpsft extract fields.
This results in a fatal error, as does any attempt to filter (eg to remove LOW effect variants) using snpsft
However smaller subsets, created using Mimod vcf filter tools work fine
Perhaps the data is too big? perhaps there is a problem with a record outside the smaller sets?
to address the first option, I could split the data by region,
I can enter 17 chromosomes into a filter manually in batches, but there are more than 40 additional contigs
I have been trying to get bcf tools to use a list of contigs to get over this. I have a tab separated list file uploaded and changed to tabular form
But I still get fatal errors. it would be good to know what I am doing wrong, and I’m sure it would be useful to others to have some more details of how to use lists to filter information by regions successfully

the message is:-

Could not parse 1-th line of file, failed to read the regions

the online help gives an internal server error

the history is available here

url: Galaxy
I hope you can help
Carolyn Greig

Hi @cgreig

Thanks for sharing the history, very helpful! These little puzzles of exactly what to use with this tool are tricky. You were very close.

Try using a three column tabular file. Your existing bed without the header line will work when I tested it against the vcf in dataset 1. The Slice VCF tool wasn’t working with it because it would reject the 4.2 format (adjusting that first line to 4.1 might be enough to get around it).

Gotchas

  1. Extra “whitespace” in the file. A tool like Convert delimiters to TAB can clean that up for you.

  2. Not ordering the regions exactly the same as the IDs in the VCF. The tool is parsing line by line, so just be aware of that.

Please give that a try and let us know if it works! :slight_smile:

Dear JennaJ
Thank you so much, I’ve taken your advice and it’s all working now. I was already using a 3 column tabular file and had the regions just as they are in the vcf file, so that wasn’t the problem. But I did have whitespaces in the chromosome name column - once these were removed it worked like a charm… so here are some better instructions than I could find - hopefully that will help the next frustrated person- it is indeed a very tricky tool… thanks again

How to split by location using a list file with bcf tools filter
Take the chromosome list from the VCF file and paste into a text file
Use find and replace to turn into a 3 tabbed list so the chromosome names are exactly as in the vcf file
…and in the same order.
Use the start and end location for the 2nd and 3rd tab
eg like this
CHROM BEG END
contig_18 1 146728
contig_19 1 104829
contig_20 1 104799

BUT with no header and NO SPACES, (check with find and replace), detailing exactly the regions you would like to extract. Once the text file is perfect, upload to Galaxy where it will be turned into a dataset in your history, then click on the pencil icon and change the data type to tabular
This is then ready to use with wotk with bcf tools filter as a list file of locations to extract.

1 Like

Great, I’m glad that worked. Thanks for sharing your steps back! It shouldn’t be necessary to leave Galaxy for this so I am going to share a bit more for others reading later.

The Convert tool I listed above will strip out all “extra” whitespace from any tabular file. (spaces, tabs)

More manipulations:

For your file with the bed coordinates, I ran these two tools to reformat it.

  • Removed the header with → Remove beginning of a file
  • Cleared out any potential excess whitespace with → Convert delimiters to TAB

If instead one is starting with a VCF, I would use the tool Select to pull out target lines then use a combination of tools to reformat. Using so many tools might be tedious to do many times, so all could instead be placed in a mini-workflow. If that is “favorited” it will show up in the tool panel like a custom tool (especially if you hide the intermediate files). Or, add those reformatting steps as a sub-workflow to your main workflow that is creating the VCF and running the downstream steps :slight_smile:

From where you are now, you could consider extracting a workflow for next time. This tutorial is a good place to start for anyone new to workflows or text manipulations in Galaxy.

Very glad this worked out!!