Understanding bcftools isec output files (0000.vcf, 0001.vcf, 0002.vcf, 0003.vcf, sites.txt, README.txt)

Hi everyone,

I’m using bcftools isec to compare two VCF files with the following command:

singularity exec bcftools_1.21–h3a4d415_1.sif bcftools isec Sample1.classified.somatic.vcf.gz Sample2.classified.somatic.vcf.gz -p vcf_compare_output

This command generates the following output files in the vcf_compare_output directory:

0000.vcf
0001.vcf
0002.vcf
0003.vcf
sites.txt
README.txt

I’m looking for a clear explanation of what exactly the data in each of these files represents. While the README.txt provides some information, I’d appreciate a more in-depth understanding, especially regarding the 0000.vcf through 0003.vcf files and why there are two VCFs for common variants.

Any insights or detailed explanations would be greatly appreciated!

Thanks in advance.

Hi @Vaishnavi_Gawde

It sounds like you are using the tool command line, not inside of Galaxy. Please see GalaxyProject.org for how this special deployment of tools and workflows works. :slight_smile:

For a short summary of what the tool is doing, you could review the Galaxy tool form as a reference. Example link to the tool at one of our public servers → https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/bcftools_isec/bcftools_isec/1.15.1+galaxy4

Or, you can review the tool documentation here → bcftools(1)

Or, you can ask at a Bioinformatics forum where the other users are also using tools on the command line. Biostars.org and Seqanswers.com are using good places to start.

And finally, I’ve also found that all of the newer chatbots are pretty good at scooping up all the prior discussions and tool manuals and serving them up for Q&A. Probably because these are generally good at “fact types” of discussions.

This is what the 4.o model from ChatGPT gave me just now. It looks pretty good! But double check things of course since it likes to imagine what a tool *could be* not always what it *actually is*.

Here’s a detailed breakdown of what each file produced by bcftools isec means, specifically when comparing two VCF files:

:open_file_folder: Output Files from bcftools isec

When you run:

bcftools isec A.vcf.gz B.vcf.gz -p output_dir

You’ll always get:

README.txt

sites.txt

Up to 4 VCF files: 0000.vcf, 0001.vcf, 0002.vcf, 0003.vcf

:magnifying_glass_tilted_left: What Each VCF File Represents

For two input files (let’s say: A.vcf.gz and B.vcf.gz), bcftools isec outputs up to 4 combinations of presence/absence using bitmasks:
File Binary Mask Meaning
0000.vcf 00 Variants not present in either file (only created for >2 inputs)
0001.vcf 01 Variants only in file B (Sample2)
0002.vcf 10 Variants only in file A (Sample1)
0003.vcf 11 Variants shared (common to both files)

For 2 files, 0000.vcf will generally be empty or not created at all.

So for your command:

bcftools isec Sample1.classified.somatic.vcf.gz Sample2.classified.somatic.vcf.gz -p vcf_compare_output

Your output files mean:

0001.vcf: Variants found only in Sample2

0002.vcf: Variants found only in Sample1

0003.vcf: Variants found in both Sample1 and Sample2

0000.vcf: (Likely empty or unnecessary with 2 files — often not generated)

:brain: Why are there two VCFs for common variants?

Actually, there is just one VCF for shared variants: 0003.vcf. If you’re seeing multiple VCFs that appear to show common variants, check whether:

You’ve used the -n option to restrict what gets output.

Some shared variants differ slightly (e.g., different annotations or INFO fields), so bcftools treats them as different.

Your VCFs have different contig naming or inconsistent normalization (left-aligning indels, etc.)

:page_facing_up: Other Files

sites.txt: A table showing which positions are present in which files (bitmask format).

README.txt: A brief summary of file encoding and how to interpret bitmasks.

:white_check_mark: Tips

Use bcftools isec -n +2 if you only want variants shared across both files.

Use -c to control which fields are used for comparison (e.g., position only, full alleles, etc.)

Normalize your VCFs before comparison with bcftools norm or vt normalize.

Let me know if you’d like a command-line one-liner to extract shared variant positions or compare INFO fields!

Hope this helps! And if you decide to try working in Galaxy, this is the introduction session from a recent training event we held. The Slack channel is now closed, but the materials will be online “forever”. :slight_smile:

  • :graduation_cap: Galaxy Training Academy 2025
  • You can use Galaxy through the web or API or some combination. Maybe API for HTP batch processing and web for sharing data, developing workflows, or learning bioinformatics.
  • More details at → GalaxyProject.org