Running Diamond

I’m having problems trying to run diamond. So I have R1/R2 pair from shotgun sequencing. I have looked but haven’t found an answer yet. I have been able to get it to run a couple times, not sure why.

  1. Assuming I get it to run, I may get a tubular file with 60,000 lines or so. How do I turn this into something like a kraken2 report?

Getting to run

  1. File input for paired end. I assume I can just include R1, or interleave R1/R2, or joined reads. Do these have to be in a specific file format in order to run?? Do they need to be uncompressed or normalized, or?? I have tried a couple ways, but even when using on default and standard database it still hits an error.

Hi @mycojon

Searching the tool panel finds a tool named Diamond view that works on Diamond outputs. One of the conversion choices is “taxonomic”.

If you need something else, you would need to custom parse the file yourself with text manipulation tools. This isn’t automatic, and would probably involve more than one tool. Tutorials → GTN Materials Search

Didn’t we discuss this in another thread already? You were experimenting since your reads were technically not what the tool was expecting to work with (scientifically).

The tool wants one file of reads in fastqsanger or fastqsangergz format (technically).

In short: If the basic technical requirements are met, and a tool is still failing, then you probably need to find a method that better fits your data scientifically. Publications are probably the best resource when deciding how to process your own data, even when developing novel methods.

Thanks Jennifer, that’s helpful. Not sure why files were not running.

Another question you may have answer for. Most of the online bioinformatics pipelines ask you to enter just the raw reads R1+ R2 So for a basic 16s sample, they do the qc, filtering etc. I don’t find the results very informative and inconsistent with different platforms. I just tried bbmerge and just uploading the contigs, this tremendously improved results. I can only assume that they use suboptimal parameters so that have less issue running a wider variety of samples.

This leads to my next question. In reading Standard Operating Procedures, it appears to be common with paired end reads to only use reads that merge and discard those that don’t merge. Though some use the singles too. This brings me to mate-paired reads where some overlap and many don’t, are these same programs discarding mate-pair reads that don’t overlap? This would fit with some results that I see that appear not to identify 90% of what some others do. Am I better off doing the qc and merging contigs then concatenation on the singles?

Thanks,
Jon