Discard parts of sequence not mapped to reference

lkothera · July 10, 2024, 10:31pm

Hello,
We have nanopore sequence data for an amplicon that spans a couple of exons and an intron. We want to look for specific variants that are at consistent positions in each exon, but the intron is of variable length and has a lot of variation. I’d like to map my reads to exon 1, then trim away whatever doesn’t map to exon 1. Then, I’d like to save/export those trimmed aligned reads.

I’d then repeat for exon 2. I can map my reads to either exon but can’t get rid of the remaining sequence because it doesn’t start with a consistent series of nucleotides and the lengths of the reads are variable. I’ve tried several mapping tools and looked at the trimming tools and don’t see anything that will do this.

Can anyone advise a tool or tools? Thanks, Linda

igor · July 11, 2024, 1:41am

Hi @lkothera
Is it a metagenomic data? If not, how many nanopore sequences do you have? Do you have just one amplicon in you experiment? With one amplicon and modest number of sequnces you probably can generate an alignment using ClustalW or MAFFT.

For alignment trimming in FASTA format maybe consider Chop.seqs: it can handle indels (dash characters), but check the output. I have impression that it has a minor bug: the output sequences one nucleotide shorter for “back” characters. Some tools, such as Trim sequences expect nucleotides only. You may also consider text manipulations tools such as awk. Assuming each sequence in alignment is recorded in a single line, you can use awk with something like this:
{ if ($1~/^>/) {print $0} else {print substr($1,4,10)} }
It prints the sequence names without changes and for every sequence it prints ten nucleotides starting from position four.
Hope that helps.
You can also check other specialised suites, such as MEGA designed for multi-sequence alignment.
Kind regards,
Igor

lkothera · October 18, 2024, 6:12pm

Hi Igor,
I’m checking Galaxy help for something else and see I never replied here. Sorry about that. Thanks for this info. We decided to go another route, the Variant Effects Predictor on the Ensembl website. It is amplicon data, with variable sized introns.

Topic		Replies	Views
Paired-End RNA Seq Trimming Workflow usegalaxy.eu support workflow , transcriptomics , rna_star	0	422	October 20, 2022
Tool to trim multiple sequences alignement phylogeny , evolution	1	94	August 8, 2024
Alternative tool for split'N'trim usegalaxy.org support variant-analysis , snpeff , picard_markduplicates	0	434	May 17, 2020
NanoFilt Tool - Galaxy usegalaxy.eu support single-cell , quality-control	3	32	July 11, 2025
Analysis of non-templated residues	5	298	December 22, 2022

Discard parts of sequence not mapped to reference

Related topics