I need a tool that can extract bases before a certain position

mlim · July 19, 2019, 10:50pm

Hi everyone,

Does anyone know of a command/package/program (either within or outside of Galaxy) that can spit out a file telling me the ~15 or so bases leading up to a certain position in a mouse genome?

More information on what I’m trying to do:
I have 9 whole exome sequencing files. Each of those 9 files have about 50 C to T mutations. I’m required to show what bases are leading up to that C to T mutation (for reasons not relevant to explain here). I could do this by hand because I have the exact position of these mutations, but seeing that I have 9 samples and 50 mutations each, that would drive me insane.

Thanks!

innovate-invent · July 20, 2019, 1:55am

bedtools GetFastaBed is exactly what you want

mlim · July 20, 2019, 2:01am

Thanks for your reply. All my files are in .txt format. Does GetFastaBed work with this file format? If not, is there a way to convert to an acceptable format?

mlim · July 20, 2019, 2:11am

Sorry I just figured out I can use VCF files and I have those too. I was able to generate a fasta file with position # of every base pair. This is awesome, but is there a way I could input a bunch of positions (containing my mutations) and have the program give me a file with all the base pairs leading up to it all at once? I’d like to avoid having to find each position individually and copying and pasting the base pairs leading up to it. Thanks

innovate-invent · July 22, 2019, 2:09am

I recommend awk, or more specifically gawk. It is specifically designed to parse these kinds of files and allows you to do math operations on the content.

mlim · July 22, 2019, 4:40pm

Thank you! I’m very new to command line, which options would you recommend to use with gawk to extract the previous 15 bases leading up to a certain position?

innovate-invent · July 22, 2019, 5:56pm

Can you provide me with a sample of the data?

jennaj · July 22, 2019, 7:37pm

FYI: Galaxy doesn’t have the wrapped version of awk hosted on public servers anymore (there was a problem with it). You could install it in your own Galaxy, just don’t host it publically.

So if you want to do this all in Galaxy, avoid line-command scripts, keep a complete record of your full work, and (optionally) put the analysis steps into a workflow for reuse…

Convert your data to BED format and work with it from there. Tools to rearrange data are in the group’s under “GENERAL TEXT TOOLS”. Most do not require any coding skills. You could also try VCFtoTab-delimited then use Cut to just pull out the coordinates you want. VCF start coordinates are 1-based, Bed start coordinates are 0-based. Direct conversion tools will interpret the data that way, other tools won’t so you’ll need to adjust. See format FAQs below to understand this better plus this very informative Biostars post.

The tool Get flanks returns flanking region/s for every gene will take an input set of coordinates in BED format and output various ranges in BED format. Then that can be input to the get fasta fetching tool. Do it all in a collection (batch runs) and consider extracting that processing into a workflow after, so you can reuse whatever tools/steps work for you now, should you need to do this again.

The process will go something like this:

Put all of your current text or vcf data into a dataset collection
Manipulate that collection of data to create a bed datasets from your current coordinates
Run that collection through the Get Flanks tool to get the upstream coordinates
Run the result collection through bedtools GetFastaBed
Merge the final collection results with the tool Collapse Collection into single dataset in order of the collection (Galaxy Version 4.0)

You might want to try this on one dataset first, to test out what the best steps/tools are to convert text/vcf-to-bed, then run those same steps on the collection (all of your data, at once).

Collection operations are explained here:

Dataset collections - modern studies usually include many samples. Collection are designed to simplify complex, multi-sample analyses as shown in this tutorial.
Galaxy Training Network, many tutorials include collections but review those that focus just on collections to start with: Galaxy Tips & Tricks > Data Manipulation

FAQs:

mlim · July 22, 2019, 9:30pm

Sure! This is what one of the nine files looks like: https://drive.google.com/open?id=1cZfsbqoq99pSrjUKWkIrjuHJMasO6XNa.

Each row is a specific C to T mutation. The columns contain information on chromosome and position and also gene name.

Thanks

jennaj · July 23, 2019, 12:10am

Let’s see the gawk command @innovate-invent wants to share.

Otherwise, this file can easily be converted to bed format using other tools. Let me know if you want a how-to. Or wait for Nolen to write back. There are many ways to do what you want to do. Example: Exact Genomic DNA is an alternative to bedtools GetFastaBed. Example2: bcftools query is an alternative to VCFtoTab-delimited. For similar tools, at least one will usually have a single function and others have that same base function plus more options.

Important for either method: Do you know if the SNP position is 0 or 1 based? Check a few positions in a browser like UCSC to confirm (just inspect, should be able to tell). What that position represents changes the transformation details slightly (yet importantly – you don’t want the data to be 1-base off

mlim · July 23, 2019, 1:25am

I would love a how to To check if the SNP position is 0 or 1-based do I view the original BAM file (from where I got my data set from) in UCSC? If so, I’ll get back to you on that since those files are on the lab computer. Thanks!

mlim · July 24, 2019, 6:20pm

This method worked for me Thank you all for helping me out!

innovate-invent · July 24, 2019, 7:23pm

This command converts your file to bed format with the adjusted coordinates:

gawk 'BEGIN {FS=OFS="\t"} NR>1 && /^[^#]/ { print $1, $2-15, $2 }' sample_data.txt

Topic		Replies	Views
File GTF em file fasta: Extracting fasta sequences based on coordinates (BED/bedGraph/GFF/VCF/EncodePeak file) usegalaxy.org support bedtools , variant-analysis	4	827	October 19, 2021
Extracting portion of fasta sequences from a multifasta file having contigs names and start-stop positions usegalaxy.org support	0	438	February 24, 2022
Extract Genomic DNA: index not found for hg19 usegalaxy.eu support bed , reference-index , chip-seq , server-side-error , epigenetics	4	1000	December 2, 2019
Extracting feature sequences with gene ID information usegalaxy.org support text-manipulation , bedtools_getfastabed	3	23	May 1, 2025
Extracting sequences from bed file using tools extract Genomic DNA tool and bed to Fasta tool usegalaxy.org support metadata , custom-genome , bedtools , custom-build	3	2126	June 30, 2020

I need a tool that can extract bases before a certain position

Related topics