Discard parts of sequence not mapped to reference

Hello,
We have nanopore sequence data for an amplicon that spans a couple of exons and an intron. We want to look for specific variants that are at consistent positions in each exon, but the intron is of variable length and has a lot of variation. I’d like to map my reads to exon 1, then trim away whatever doesn’t map to exon 1. Then, I’d like to save/export those trimmed aligned reads.

I’d then repeat for exon 2. I can map my reads to either exon but can’t get rid of the remaining sequence because it doesn’t start with a consistent series of nucleotides and the lengths of the reads are variable. I’ve tried several mapping tools and looked at the trimming tools and don’t see anything that will do this.

Can anyone advise a tool or tools? Thanks, Linda

Hi @lkothera
Is it a metagenomic data? If not, how many nanopore sequences do you have? Do you have just one amplicon in you experiment? With one amplicon and modest number of sequnces you probably can generate an alignment using ClustalW or MAFFT.

For alignment trimming in FASTA format maybe consider Chop.seqs: it can handle indels (dash characters), but check the output. I have impression that it has a minor bug: the output sequences one nucleotide shorter for “back” characters. Some tools, such as Trim sequences expect nucleotides only. You may also consider text manipulations tools such as awk. Assuming each sequence in alignment is recorded in a single line, you can use awk with something like this:
{ if ($1~/^>/) {print $0} else {print substr($1,4,10)} }
It prints the sequence names without changes and for every sequence it prints ten nucleotides starting from position four.
Hope that helps.
You can also check other specialised suites, such as MEGA designed for multi-sequence alignment.
Kind regards,
Igor

1 Like

Hi Igor,
I’m checking Galaxy help for something else and see I never replied here. Sorry about that. Thanks for this info. We decided to go another route, the Variant Effects Predictor on the Ensembl website. It is amplicon data, with variable sized introns.