Comparing two contig assemblies- LASTZ failure

Hi, I have zero knowledge of coding, so I really enjoy using Galaxy to analyze my NGS data. I am interested in comparing two genome contig assemblies. Both contigs are more than 2Gb in size and when I ran the LastZ on the genomes for comparison, I got the error message below.

FAILURE: in load_fasta_sequence for /galaxy-repl/main/files/048/440/dataset_48440756.dat, sequence length 2,147,296,059+425,431 exceeds maximum (2,147,483,637)

How can I compare the two contig assemblies in Galaxy then? I appreciate your help.

-Sandip

I’m sorry, it looks like the Galaxy version of this tool does not support genomes larger than 2Gb. The command line version does, if you build a special version.

If you want to do the analysis in Galaxy, you’ll have to use a workaround like comparing portions of your assemblies to each other. Are 2,147,296,059 and 425,431 the sizes of your two assemblies? That puts you only slightly over the maximum size, so you could just do it in two alignments: Cut your first assembly (the 2,147,296,059bp one) in half and compare the other to one half, then the other.

Actually, since that would put you well under the limit, you could make each alignment more complete by just cutting off the last 250kb of your big assembly for one alignment, then the first 250kb for the second.

Hi Nick, thanks for your reply. Two contig assemblies I am comparing are 2.324Gg and 2.059 Gb in size. Can I split the genomes into two parts each in Galaxy and compare? I am interested in generating Dotplot from this analysis. How would the genome splitting affect the Dotplot output?

Is there any other software in the Galaxy for assembly comparison?

Also, how to split the genomes in Galaxy?

Ah, so I guessed wrong about the sizes of the assemblies. Then you might have to split both assemblies into halves and do four comparisons (each pair of halves). If you’re looking to get a dotplot out of this, that should be perfect! The dotplot of half 1 of sequence 1 vs half 1 of sequence 2 should be the same as the first quadrant of the dotplot of sequence 1 vs sequence 2.

I think there should be enough tools in the “FASTA/FASTQ” section of Galaxy to do this. The “Split Fasta” tool might be all you need. Depending on whether your assemblies are tons of little sequences or a few big ones, you might want the “Each sequence in its own dataset” or the “Split into a number of chunks” option. If you’re not sure of the composition of your assemblies, you can use the “Compute sequence length” tool. It will tell you how long each sequence in a FASTA file is. You’ll probably also want to run it on the result of the “Split Fasta” tool, just to double-check that the results are what you expected.

Hi Nick, thank you for your input :+1:. I did split the FASTA files into two sub-files and comparing them using LASTZ in Galaxy. The analysis is taking too long (the LASTZ-Galaxy is still running since Dec 11 night) but looking forward to see the results.

1 Like