Merging FASTA files

Hello everyone. This might be a very basic question, but I am only starting to analyze my transcriptomic data for the first time

My sample sequencing was run on 4 lanes, so I have 4 FASTA files per sample. How do I merge them together, to get a complete summary of each sample sequencing?

Thank you for your help

Hi @Goncalo_Pinheiro

A single sample run on multiple lanes can be combined with the Concatenate tool. This stacks the reads top to bottom.

For QA or summary information like quality metrics, the protocol varies by the analysis domain.

Tutorials: https://training.galaxyproject.org/
Start here: NGS data logistics

1 Like

Hello jennaj

First, I need to make a small correction to what I had said. I have each sample separated into 4 lanes, each producing a separate FASTQ (not FASTA) file.

I have tried to use the Concatenate tool, but it seems that it won’t perform the function I want it to perform (add the reads of each FASTQ file into just one). I have converted the FASTQ files into FASTA files and used the Merge FASTA tool, which did (at least I think it did) what I require, but now I do not know how to convert it back to FASTQ, nor how to obtain a QUAL file for the merged FASTA (which seems to be what I am missing).

Does anyone have a clue how I can solve this issue? Is there a tool that merges FASTQ files, and I just missed it?

Thank you for the help

These should definitely work. Try uploading the data and let Galaxy guess the datatype. It should be fastqsanger or fastqsanger.gz for current NGS sequencing methods. You might want to uncompress the read data once uploaded (pencil icon → Convert).

Concatenate per sample into one file each. Keep samples distinct. If the data is paired-end interleaved, you can leave it interleaved for some tools, and others will expect one file per end. Then if you want to batch process, put the data into a collection.

The tools to use are one of these (examples at UseGalaxy.org, but most public Galaxy servers will host the same by default):

  • This version allows you to choose the order: Galaxy
  • This version uses the existing order of the datasets in the history: Galaxy

Don’t use this kind of method – you are losing information (quality scores) plus you don’t need the extra functionality. All the sequences should already have distinct identifier names when in the fastq format.