Reference Genome

I have a question with bowtie2. When I search using a genome on the list, seems to be fine. But when using an .fna genome from ncbi or other, it gives me an error. Must be something I’m dojng wrong?

1 Like

Welcome, @mycojon

This FAQ has instructions for how to format a custom reference genome.

Please let us know if that is enough or not :slight_smile:

So if I understand from what the instructions say, I need to do this with a sam or bam file. So I can’t use the .fna file that I downloaded?? If I can’t use .fna file, where is my source for the file. Or is the sam or bam if I want to create my own file?

If it matters, my use it to look at a couple of shotgun sequences that have a lot of diversity. I want to selectively look specific microbes in the sample, which I assume may be more accurate than trying to pick them up by trying to identify everything. Does this make sense, or is there a better way. Also with great plasmodium, there is also the issue of cross identification with human dna.

I personally am looking at two microbes.

  1. Klebsiella pneumoniae- This is probably fairly straightforward.
  2. Plasmodium ovale, maybe other Plasmodium.

Another question on these two microbes, when I have run these through a platform, they only show up on the ncbi nr protein translated database.

If your goal is to map against something, that would be a fasta file (same as .fna but given the “fasta” datatype in Galaxy). Nucleotide sequences in fasta format. I just re-wrote that FAQ so the instructions should work. If any part is unclear, share some data that you have and ask about the part that is unclear please and I can clarify more. The “format rules” are the same for use as a target with most tools.

The source can be from anywhere. But a bunch of short reads extracted from a SAM/BAM file used as a target … maybe not with Bowtie2, since the hits are curated for the “best match” and a semi non-redundant target is assumed. Those same short reads would be completely fine as a query (fastq format).

Instead, BLASTN is one example tool where the target could be short reads. Tune parameters to capture everything over some minimum criteria – likely both coverage and percent identity for this one to avoid exploding the results.

Would a metagenomics classification tool be a better fit for this? Kraken is one example. More are in our tutorials, or review under that section of the tool panel. Some are specific to read types – scroll down into the help section of each to quickly review what type of reads it was designed to work with.

That was using BLASTN or Megablast, correct?

Hits against the translated protein, and not the nucleotide, is likely a clue. That could be due to legitimate sub-species/strain variation, poor quality query reads (or untrimmed artifact!), species variation (you would know more about these genomes and potential variation than I will!), or incomplete/poor quality in the NR nucleotide target (also contains some “junk” eg adaptors, mislabeled species including human).

I would suggest checking for artifact first, and remove it if you can. You want super clean reads to capture meaningful hits.

Then, choose your tool and create your target database. Genbank/NCBI fasta (fna) files.

And, consider exploring metagenomics, ecology and maybe microbiome tools – see our tutorials for guides, and the tool panel (not everything is included in a tutorial – but if it is, that will be listed down near the end of the form – easier than searching the GTN site sometimes).

I want to selectively look specific microbes in the sample

To be clear – BLAST against those suspected targets is what BLASTN will do. Just keep in mind that some other species may be the true source, or your sample WGS reads may be too short individually to learn which even is the true source.

I would probably try Kraken/Kraken2 first to see what happens, then make decisions about which to pull the full genome from when doing a closer single-species mapping.

Maybe this helps :slight_smile:

Ok, I understand for the most part. I will explain the situation more. These are actually 2 shotgun samples from myself, one is whole blood. Everyone basically says that very little bacteria can be found in whole blood. Published studies say its rare to find even one bacteria to the species level, I believe either 16s or shotgun sequencing. My assumption is generally microbe dna is so degraded that it can’t be effectively assembled. Quite a lot has been identified to species.

So for 16s, plasmodium isn’t bacteria. I don’t believe klebsiella is identified well by 16s as I see 16s-its-23s is recommended for identification. Mycobacterium also not identified well by 16s.

For the two shotgun sequences below.

Metagenomics analysis, yes I have done that.

Plasmodium is a eukaryote, most classifieds don’t do eukaryotes.

One codex - identified significant plasmodium, also klebsiella. Also a high amount of Mycobacterium Leprosy.

Czid identified significant (8500) plasmodium on only on ncbi nr database. (150) Klebsiella on nr database. Validated with ncbi blastx looks like a good hit.

Kaiju- Which translates all reads to protein before analyzing, reportedly one of the most sensitive classifiers. Plasmodium reads were around 400,000, Klebsiella around 250,000.

Edge bioinformatics- No eukaryotics. But the kraken2 identified 360,000 klebsiella reads.

So you can see the counts are very high, but I’m not dead yet :grinning:. But you can see my confusion trying to do some confirmation by trying to map with single species. Maybe you can recommend a better process for me?

Also a question on mapping. Since my sequences include a high amount of human reads 95%+, I was trying to accurately clean the human reads with bowtie2 with the very sensitive end to end setting. Hopefully not removing non-human reads. Is there a better program for this? The files are so large, its difficult loading them through online programs.

Interesting project!

We have SARS-CoV-2 protocols in tutorials that screen out the human reads, then map to specific genome, plus variation analysis. Parts of this will probably be useful for you. See https://training.galaxyproject.org/training-material/search2?query=sars

This is the one specifically for the screening process and uses BWA-MEM for the mapping. It is a sort of all-purpose method. → https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/human-reads-removal/tutorial.html

That would at least get you “clean reads” free of human.

The prior replies I sent can help you to get a single genome into the right format for use as a custom genome fasta target across tools: Bowtie2, BWA/BWA-MEM, Blast, really any tool that expects a genome target (or transcriptome target). Meaning, the basic fasta formatting is the same for all.

In short:

  • query reads can be fastq or fasta, it depends on the tool, and the form will clarify.
  • target genome/transcriptome is usually in fasta across tools

Happy hunting! If you get stuck with a particular tool like before, and you think the fasta format is not right, and the FAQs didn’t help, you can post a few sequences from the very top of the file for format feedback, along with a screenshot of what the expanded dataset looks like in Galaxy (exposes the metadata), and the full tool name you are using. That should be enough for troubleshooting. Same for problematic query reads as needed.

A post was split to a new topic: Tool request: resources