How to merge or generate FASTA file with the same chromosome

admin
fastqc

#1

Hi

I try to generate FASTA file from BAM file. I followed the protocol that I got from https://usegalaxy.org/u/antunderwood/w/bam-to-fasta However, after getting FASTA file, the file does not merge the same chromosome. The file shows the same chromosome with several fragments.

For example Chr1 0-100, Chr1 101-200, Chr1 201-300…Chr 1001-2000… I need to merge all dataset with the same chromosome together which will make me easy to analyze the data. Such as, Chr1 0-10XXXX, Chr2 0-10XXXX, Chr3 0-10XXXX… Could you give me some suggestions about this problem? Thank you so much.


#2

Hello,

Examine the fasta identifiers in your output and note that they have regions included. Something like:

>chr1:0-100
ATCGATCGATCGATCGATCG

When what you want seems to be:

>chr1 0-100
ATCGATCGATCGATCGATCG

Use one of the Replace Text tools to modify the “>” lines. Just be aware that renaming the sequences without the coordinates included in the fasta identifiers will create duplicates.

FAQ for Fasta format: https://galaxyproject.org/learn/datatypes/#fasta


#3

Hello,

Thank you for your comment. But, I still don’t understand the answer you provided. According to my question, my FASTA file shows Chromosome like Py17x_01_v3:0-1073, Py17x_01_v3:1102-1255, Py17x_01_v3:1259-1587 … Py17x_01_v3:813740-815147 (figure above). But I need the FASTA file show Py17x_01_v3:0-815147 for the Chr1 and Py17x_02_v3:0-xxxxxxxx for Chr2 … until Chr14. Because I want to see all sequence in the same window when I analyze the data using IGV. I don’t want to separate the data with the same chromosome. I am sorry if my question is not clear. Could you explain how I can modify it in detail, please? Thank you so much.


#4

Look at your custom genome/BAM hit data. My guess is that it has chromosomes named like “Py17x_01_v3”, not “chr1”.

You’ll need to start off by mapping against a genome that has chromosomes named in a way that you want to use or group by in later steps, otherwise, the identifiers/coordinates won’t be a match.


#5

Yes, the chromosome name is Py17x_01_v3, Py17x_02_v3…, Py17x_14_v3 (14 chromosomes). For my understanding, I need to start with mapping against a genome (same chromosome name) to generate the BAM file and then convert it to FASTA file, right? The reason that I want to generate the FASTA file is I will use this FASTA file for the template (reference genome) for my further analysis (mapping this reference genome with my unknown sample). Thank you.


#6

I’m not quite sure I understand but it sounds like you are trying to build a new reference transcriptome (or other “-ome”). If so, it may help to review the Galaxy tutorials.

Note: It is technically possible to rename identifiers in any dataset but some are more difficult to transform than others. And all datasets used together in visualization or other analysis steps need to have the identifiers (that represent the exact same underlying data) modified with a precision method that fits the different datatypes involved. I wouldn’t recommend that anyone attempts modifications like this unless they already know how to do it and are able to detect plus troubleshoot any problems that might come up. The steps are too complicated (especially with BAM data) and much can go wrong. This is why I think that starting over with the identifiers you want to use at the beginning, then working through your analysis with consistent data for all steps, is the best approach. All will go much smoother!

Since you are starting over, this might be a good time to consider updating the workflow you shared. It uses older versions of a few tools. Using the latest versions of tools is always better. Tools can be updated within the workflow editor.