I need a tool that can extract bases before a certain position

Hi everyone,

Does anyone know of a command/package/program (either within or outside of Galaxy) that can spit out a file telling me the ~15 or so bases leading up to a certain position in a mouse genome?

More information on what I’m trying to do:
I have 9 whole exome sequencing files. Each of those 9 files have about 50 C to T mutations. I’m required to show what bases are leading up to that C to T mutation (for reasons not relevant to explain here). I could do this by hand because I have the exact position of these mutations, but seeing that I have 9 samples and 50 mutations each, that would drive me insane.

Thanks!

1 Like

bedtools GetFastaBed is exactly what you want

1 Like

Thanks for your reply. All my files are in .txt format. Does GetFastaBed work with this file format? If not, is there a way to convert to an acceptable format?

Sorry I just figured out I can use VCF files and I have those too. I was able to generate a fasta file with position # of every base pair. This is awesome, but is there a way I could input a bunch of positions (containing my mutations) and have the program give me a file with all the base pairs leading up to it all at once? I’d like to avoid having to find each position individually and copying and pasting the base pairs leading up to it. Thanks :slight_smile:

I recommend awk, or more specifically gawk. It is specifically designed to parse these kinds of files and allows you to do math operations on the content.

Thank you! I’m very new to command line, which options would you recommend to use with gawk to extract the previous 15 bases leading up to a certain position?

Can you provide me with a sample of the data?

FYI: Galaxy doesn’t have the wrapped version of awk hosted on public servers anymore (there was a problem with it). You could install it in your own Galaxy, just don’t host it publically.

So if you want to do this all in Galaxy, avoid line-command scripts, keep a complete record of your full work, and (optionally) put the analysis steps into a workflow for reuse…

Convert your data to BED format and work with it from there. Tools to rearrange data are in the group’s under “GENERAL TEXT TOOLS”. Most do not require any coding skills. You could also try VCFtoTab-delimited then use Cut to just pull out the coordinates you want. VCF start coordinates are 1-based, Bed start coordinates are 0-based. Direct conversion tools will interpret the data that way, other tools won’t so you’ll need to adjust. See format FAQs below to understand this better plus this very informative Biostars post.

The tool Get flanks returns flanking region/s for every gene will take an input set of coordinates in BED format and output various ranges in BED format. Then that can be input to the get fasta fetching tool. Do it all in a collection (batch runs) and consider extracting that processing into a workflow after, so you can reuse whatever tools/steps work for you now, should you need to do this again.

The process will go something like this:

  1. Put all of your current text or vcf data into a dataset collection
  2. Manipulate that collection of data to create a bed datasets from your current coordinates
  3. Run that collection through the Get Flanks tool to get the upstream coordinates
  4. Run the result collection through bedtools GetFastaBed
  5. Merge the final collection results with the tool Collapse Collection into single dataset in order of the collection (Galaxy Version 4.0)

You might want to try this on one dataset first, to test out what the best steps/tools are to convert text/vcf-to-bed, then run those same steps on the collection (all of your data, at once).

Collection operations are explained here:

  • Dataset collections - modern studies usually include many samples. Collection are designed to simplify complex, multi-sample analyses as shown in this tutorial.
  • Galaxy Training Network, many tutorials include collections but review those that focus just on collections to start with: Galaxy Tips & Tricks > Data Manipulation

FAQs:

Sure! This is what one of the nine files looks like: https://drive.google.com/open?id=1cZfsbqoq99pSrjUKWkIrjuHJMasO6XNa.

Each row is a specific C to T mutation. The columns contain information on chromosome and position and also gene name.

Thanks

Let’s see the gawk command @innovate-invent wants to share.

Otherwise, this file can easily be converted to bed format using other tools. Let me know if you want a how-to. Or wait for Nolen to write back. There are many ways to do what you want to do. Example: Exact Genomic DNA is an alternative to bedtools GetFastaBed. Example2: bcftools query is an alternative to VCFtoTab-delimited. For similar tools, at least one will usually have a single function and others have that same base function plus more options.

Important for either method: Do you know if the SNP position is 0 or 1 based? Check a few positions in a browser like UCSC to confirm (just inspect, should be able to tell). What that position represents changes the transformation details slightly (yet importantly – you don’t want the data to be 1-base off :nerd_face:

I would love a how to :slight_smile: To check if the SNP position is 0 or 1-based do I view the original BAM file (from where I got my data set from) in UCSC? If so, I’ll get back to you on that since those files are on the lab computer. Thanks!

This method worked for me :slight_smile: Thank you all for helping me out!

This command converts your file to bed format with the adjusted coordinates:

gawk 'BEGIN {FS=OFS="\t"} NR>1 && /^[^#]/ { print $1, $2-15, $2 }' sample_data.txt