Does anyone know of a command/package/program (either within or outside of Galaxy) that can spit out a file telling me the ~15 or so bases leading up to a certain position in a mouse genome?
More information on what I’m trying to do:
I have 9 whole exome sequencing files. Each of those 9 files have about 50 C to T mutations. I’m required to show what bases are leading up to that C to T mutation (for reasons not relevant to explain here). I could do this by hand because I have the exact position of these mutations, but seeing that I have 9 samples and 50 mutations each, that would drive me insane.
Sorry I just figured out I can use VCF files and I have those too. I was able to generate a fasta file with position # of every base pair. This is awesome, but is there a way I could input a bunch of positions (containing my mutations) and have the program give me a file with all the base pairs leading up to it all at once? I’d like to avoid having to find each position individually and copying and pasting the base pairs leading up to it. Thanks
FYI: Galaxy doesn’t have the wrapped version of awk hosted on public servers anymore (there was a problem with it). You could install it in your own Galaxy, just don’t host it publically.
So if you want to do this all in Galaxy, avoid line-command scripts, keep a complete record of your full work, and (optionally) put the analysis steps into a workflow for reuse…
Convert your data to BED format and work with it from there. Tools to rearrange data are in the group’s under “GENERAL TEXT TOOLS”. Most do not require any coding skills. You could also try VCFtoTab-delimited then use Cut to just pull out the coordinates you want. VCF start coordinates are 1-based, Bed start coordinates are 0-based. Direct conversion tools will interpret the data that way, other tools won’t so you’ll need to adjust. See format FAQs below to understand this better plus this very informative Biostars post.
The tool Get flanks returns flanking region/s for every gene will take an input set of coordinates in BED format and output various ranges in BED format. Then that can be input to the get fasta fetching tool. Do it all in a collection (batch runs) and consider extracting that processing into a workflow after, so you can reuse whatever tools/steps work for you now, should you need to do this again.
The process will go something like this:
Put all of your current text or vcf data into a dataset collection
Manipulate that collection of data to create a bed datasets from your current coordinates
Run that collection through the Get Flanks tool to get the upstream coordinates
Run the result collection through bedtools GetFastaBed
Merge the final collection results with the tool Collapse Collection into single dataset in order of the collection (Galaxy Version 4.0)
You might want to try this on one dataset first, to test out what the best steps/tools are to convert text/vcf-to-bed, then run those same steps on the collection (all of your data, at once).
Collection operations are explained here:
Dataset collections - modern studies usually include many samples. Collection are designed to simplify complex, multi-sample analyses as shown in this tutorial.
Galaxy Training Network, many tutorials include collections but review those that focus just on collections to start with: Galaxy Tips & Tricks > Data Manipulation
Otherwise, this file can easily be converted to bed format using other tools. Let me know if you want a how-to. Or wait for Nolen to write back. There are many ways to do what you want to do. Example: Exact Genomic DNA is an alternative to bedtools GetFastaBed. Example2: bcftools query is an alternative to VCFtoTab-delimited. For similar tools, at least one will usually have a single function and others have that same base function plus more options.
Important for either method: Do you know if the SNP position is 0 or 1 based? Check a few positions in a browser like UCSC to confirm (just inspect, should be able to tell). What that position represents changes the transformation details slightly (yet importantly – you don’t want the data to be 1-base off
I would love a how to To check if the SNP position is 0 or 1-based do I view the original BAM file (from where I got my data set from) in UCSC? If so, I’ll get back to you on that since those files are on the lab computer. Thanks!