Extract a subsequence from the whole genome assembly?

I have managed to map my interested gene region against my home built whole genome assembly. So I saw the results has returned a best matched scaffold id. So if it is possible to extract the target scaffold sequence out of the assembly and study it closer?

There are some ways to extract it, but just to be clear, can you tell us which files/formats are you talking about?

OK, I have a whole genome assembly in fasta format which includes many contigs and scaffolds. If I want to extract out one of the scaffold or contig, let’s say scaffold_100 or contig_50. How should I do it?

1 Like

You can use the Filter sequences by ID tool.

1 Like

Thank you very much, David! It is working!

1 Like

You’re welcome!

There is a new problem. The scaffold file I retrieved out is too large for NCBI blaster to do alignment between two sequences. Is there any way to compare a mRNA sequence against a bigger scaffold sequence? Thanks again!

@daikez, Too large? This seems odd. How are you doing it?

Yes, the scaffold I extracted out is 33.5Mb. And when I tried to Blaster my interested gene sequence agaisnst this scaffold, NCBI webside showed it won’t accept file larger than 10Mb. I had to devide the scaffold into several smaller parts. Is there any tools at Galaxy which can directly align GenBank entry against the saved dataset?

Ok. That’s the problem.
Why are you using blast at NCBI, instead of Galaxy? The NCBI web version of blast has many known limitations.

I suggest you take a substantial amount of time to check some Galaxy Training material.

1 Like

If there is possibility to direct fetch NCBI entry into Galaxy and align with the saved dataset, it will be perfect! Any suggestions?


It is possible.

You’ll find what you need to fetch databases entries in the Get Data section of tools panel (left of galaxy page), while Blast tools are in the NCBI BLAST+ section.

Remember to read some Galaxy Training so you’ll get acquainted to Galaxy.

1 Like