Tool for merging 2x single-read illumina sequencing files (fastq) into one?

Dear support team at GALAXY,

I have recently performed a ChIP-seq experiment and I have just got my raw seq files from the seq service. I would like to perform my analysis but each of my samples was sequenced twice; meaning that each sample has two seq files. This is because each sample was sequenced twice to get a workable number of total reads (because of very low concentration of my ChIPed DNA). Thus, I am looking for a tool in GALAXY that will allow me to merge the two seq files of the same samples in to just one, with whom I will then proceed to the mapping with Bowtie2, the Peak calling with MACS2 etc. Could you please propose me a tool that I can properly perform such a task of merging?

Many thanks in advance,

Greetings,

Manolis

1 Like

Hi @Manolis1

Try this tool (per sample pair): Concatenate datasets tail-to-head (cat)

Hi @ jennaj,

Many thanks for your fast response.

I did it but I got this below:

pastedImage.png

When you look into it:

pastedImage.png

For info: The format of my 2x files that I was about to merge was:

pastedImage.png

When you look into them file-1 (the 3x first lines):

pastedImage.png

File-2 (the 3x first lines):

pastedImage.png

What am I doing wrong?

Greetings,

Manolis

1 Like

Hi @Manolis1

This tool tends to work best with uncompressed inputs.

Please try uncompressing the inputs, then run the tool. Galaxy does some intermediate steps to transform the data during processing (in this case: fastq > plain text > fastq). When compressed fastq is input, sometimes a compression version isn’t exactly what is expected at a technical level. Uncompressing first will usually avoid problems with any text manipulation tool.

To uncompress a fastqsanger.gz dataset:

  1. Click on the pencil icon to reach the Edit Attributes functions
  2. Click into the Convert tab
  3. The drop-down menu will include a choice to produce an uncompressed version of the data (these will be new datasets)
  4. Run the Concatinate tool on the uncompressed fastqsanger datasets
  5. After everything is completely done and confirmed to be correct, go back and permanently delete (purge) the original compressed data to remove it from your quota usage.

I tested this out last week for another person (same exact tool/problem) and it worked, but please let us know if that doesn’t work for your case and we can troubleshoot more. :slight_smile:

A replacement tool for @jennaj 's suggestion is Collapse Collection (from the Collection operations section).
Despite its name it also works with multiple regular datasets (although if the order of the concatenation matters to you, you should first build a list collection from your replicates). This tool should work directly with your gzipped data.

1 Like

Hi @ jennaj,

Many thanks! It did worked beautifully : )

pastedImage.png

I am going to continue now with the mapping with Bowtie2 etc.

Best wishes for a lovely weekend,

Manolis

1 Like

Hi @ wm75,

Thank you for your response.

I tried also what you suggested: Collapse Collection> Merge Collection> But neither of my files (gzip or gunzip) are recognized as inputs by this tool. Nevertheless, @jenna suggested me previously to unzip first the gzip files and then use them in the Concatenate datasets tail-to-head tool; this did work nicely.

Wishes for a great weekend,

Manolis

1 Like

Both tools work with some basic tests on fastqsanger.gz files. But either not working can be for odd reasons. Most I’ve reviewed seem to be related to the version of “gzip” used for the original compression. So, my go-to advice is to just uncompress and see what happens since it usually resolves compatibility problems.

That also does a “sanity check” on the data. If it won’t uncompress in Galaxy with the Convert function directly, that usually means the file is truncated or corrupted (good to know!) – either upstream from Galaxy or during Upload (due to an interrupted connection – tend to be larger files). Reloading or checking upstream files at the source is how to fix that kind of problem. MD5 and other Secure Hash / Message Digest tools can be used before/after loading to Galaxy, too, and compared. I included that in the test below to show how that is done (and to confirm both tools produce exactly the same output).

Anyway, glad you go this to work :rocket:

Test/example history here Galaxy | Accessible History | test cat + collapse w fastqsanger.gz I’ll leave it shared for anyone who wants to review for example use-cases. The whole history is really small (3 kb) – if imported, click on the “rerun” button to see exactly how the tool forms were set up, then purge the copy when not needed anymore to clear the clutter.

1 Like