How to merge or generate FASTA file with the same chromosome

Nattawat_Chaiyawong · December 12, 2018, 5:21pm

Hi

I try to generate FASTA file from BAM file. I followed the protocol that I got from https://usegalaxy.org/u/antunderwood/w/bam-to-fasta However, after getting FASTA file, the file does not merge the same chromosome. The file shows the same chromosome with several fragments.

For example Chr1 0-100, Chr1 101-200, Chr1 201-300…Chr 1001-2000… I need to merge all dataset with the same chromosome together which will make me easy to analyze the data. Such as, Chr1 0-10XXXX, Chr2 0-10XXXX, Chr3 0-10XXXX… Could you give me some suggestions about this problem? Thank you so much.

jennaj · December 12, 2018, 9:08pm

Hello,

Examine the fasta identifiers in your output and note that they have regions included. Something like:

>chr1:0-100
ATCGATCGATCGATCGATCG

When what you want seems to be:

>chr1 0-100
ATCGATCGATCGATCGATCG

Use one of the Replace Text tools to modify the “>” lines. Just be aware that renaming the sequences without the coordinates included in the fasta identifiers will create duplicates.

FAQ for Fasta format: https://galaxyproject.org/learn/datatypes/#fasta

Nattawat_Chaiyawong · December 12, 2018, 9:44pm

Hello,

Thank you for your comment. But, I still don’t understand the answer you provided. According to my question, my FASTA file shows Chromosome like Py17x_01_v3:0-1073, Py17x_01_v3:1102-1255, Py17x_01_v3:1259-1587 … Py17x_01_v3:813740-815147 (figure above). But I need the FASTA file show Py17x_01_v3:0-815147 for the Chr1 and Py17x_02_v3:0-xxxxxxxx for Chr2 … until Chr14. Because I want to see all sequence in the same window when I analyze the data using IGV. I don’t want to separate the data with the same chromosome. I am sorry if my question is not clear. Could you explain how I can modify it in detail, please? Thank you so much.

jennaj · December 12, 2018, 10:02pm

Look at your custom genome/BAM hit data. My guess is that it has chromosomes named like “Py17x_01_v3”, not “chr1”.

You’ll need to start off by mapping against a genome that has chromosomes named in a way that you want to use or group by in later steps, otherwise, the identifiers/coordinates won’t be a match.

Nattawat_Chaiyawong · December 12, 2018, 10:24pm

Yes, the chromosome name is Py17x_01_v3, Py17x_02_v3…, Py17x_14_v3 (14 chromosomes). For my understanding, I need to start with mapping against a genome (same chromosome name) to generate the BAM file and then convert it to FASTA file, right? The reason that I want to generate the FASTA file is I will use this FASTA file for the template (reference genome) for my further analysis (mapping this reference genome with my unknown sample). Thank you.

jennaj · December 12, 2018, 11:34pm

I’m not quite sure I understand but it sounds like you are trying to build a new reference transcriptome (or other “-ome”). If so, it may help to review the Galaxy tutorials.

Note: It is technically possible to rename identifiers in any dataset but some are more difficult to transform than others. And all datasets used together in visualization or other analysis steps need to have the identifiers (that represent the exact same underlying data) modified with a precision method that fits the different datatypes involved. I wouldn’t recommend that anyone attempts modifications like this unless they already know how to do it and are able to detect plus troubleshoot any problems that might come up. The steps are too complicated (especially with BAM data) and much can go wrong. This is why I think that starting over with the identifiers you want to use at the beginning, then working through your analysis with consistent data for all steps, is the best approach. All will go much smoother!

Since you are starting over, this might be a good time to consider updating the workflow you shared. It uses older versions of a few tools. Using the latest versions of tools is always better. Tools can be updated within the workflow editor.

Topic		Replies	Views
Bam file to fasta file - Genome assembly usegalaxy.org support genome , assembly	3	4580	February 6, 2019
Variant calling from VCF files chrominfo , vcf	3	686	October 16, 2023
Edit chromosome identifiers in VCF files usegalaxy.eu support variant-analysis , vcf	1	357	May 8, 2022
Deleting sequence identifier line usegalaxy.org support fasta-manipulation	1	75	April 9, 2024
Samtools merge error database , mapping , reference-genome	3	2002	March 16, 2020

How to merge or generate FASTA file with the same chromosome

Related topics