de-interleave issues

HI, I’m having a file problem.
I was processing some 16s fastq’s, after qc I used BBtools-bbmerge to merge the reads without merging (interleaved). Then I de-interleaved them so I have forward and reverse files. Seemed to work fine, but when I uploaded them to an online classifier it said the files don’t match. Any ideas what to do, as I would prefer two files.
Thanks

Hi @Jon_Colman
I could not find interleaved output option in BBtools-bbmerge. bbmerge returns unmerged reads in apparently initerleaved format. Do you mean these unmerged reads? If yes, have you completed QC on de-interleaved reads, for example, FastQC? Do you see identical number of F and R reads? If yes, the read order might be different. Run de-interleaved files through a proper interlace-deinterlace (interleave-deinterleave) tools to get output in the same order. It is also possible that the read names might not be compatible with the tool. Hard to say without looking into data.
Kind regards,
Igor

Hi Igor, maybe you can help with my other problem. Say I have a WGS sequence that I can map various reference sequences. For example tuberculosis, say I can map 1000 sequences using any of the following mapping programs (Hisat2, BBmap, Bowtie2). Yet human host removal is removing all of my mapped sequences. I have no idea of the problem, suggestions??

I have thought if I have my mapped sequences, and I pull out one of the fasta reads, is there a program that I can see if there is a certain part that isn’t matching up, and maybe I can trim that off?? Then if the same error is on all the reads I can process them altogether. Any ideas?? Its a bit over my head.
Thanks Jon

Hi @Jon_Colman

Just to confirm: you have a TB sample. You mapped 1k reads to the human genome and all reads were mapped to the genome. Is it right? Do you have stats for mapping? Note that BAM files usually contain all reads including unmapped. If you are confident that all reads are mapped to the host genome, the sample might have a high contamination with the host reads. Maybe try more reads, or map the reads to TB genome, or check the sample with Kraken2. Use mini-kraken db for faster processing. Do you see reads classified as TB?

Practically all read mapping programs support so called soft clipping. Soft clipping is recorded in CIGAR string in BAM/SAM files. Adapter sequences present at read’s ends are usually soft clipped. Adapters can be removed using Cutadapt, Trimmomatic, fastp or similar tools. I don’t know if it is related to your situation or not.

Kind regards,
Igor

I had three recent WGS samples done. Actually myself, my wife, and recently deceased dog. About 100M reads each, if you are aware of the KAIJU protein translated database. My wife and I both had 50,000-80,000 reads of Tuberculosis (could be Leprosy). My dog had 150,000 reads of Tuberculosis (could be Leprosy) as well as very high lev of Plasmodium. From what I have researched a positive result for WGS is 5-10 reads/million after host removal. The numbers are astronomical!! One of the online classifiers OneCodex on a sample last summer showed about 50,000 reads of Leprosy, they didn’t have the species of Plasmodium in their database so there were a fair amount of other species of plasmodium (csf leak from nose sample showed similar).

After fastp and bbduk I was mapping to Leprosy and Leprosy/Tuberculosis likely over 100,000 reads each. Basically same species as samples from last summer, only more.

The samples are fully trimmed plus an extra 2 bases on each end with fastp followed by BBduk. Here is another example, one sample is canine from one of my dogs that recently died. So should not be any human dna. I did canine host removal with hisat2.

Hi Jon,

Thank you additional information. I feel that I am not qualified for your questions. I am sorry that I cannot help much.

The best option for diagnostics is to follow established protocols. Sometimes even a different version of software might produce different results.

Kind regards,
Igor

Hi Igor, is there a way to download my files to my Google drive?? I’m over my storage and I’m stuck right now. Or can I save to the larger storage on Galaxy?? Any suggestions?

I’ve already deleted everything I can.

Hi @Jon_Colman
Try User (the top Galaxy menu) > Preferences > Storage location. The ORG server provides temporary 1 Tb storage for one month.
Do you have command line access on Google drive? If yes, try wget or curl. You can get links to Galaxy files by clicking at Copy link (chain icon). The links might require a minor editing: remove the question mark and everything after the question mark. As a test, try download of a tiny dataset with:
curl https://usegalaxy.org.au/api/datasets/a6e389a98c2d1678cc8cd0618d7489cc/display -o peaks.bed
Hope this helps.
Kind regards,
Igor

I was able to make it work. The next morning some space opened back up. I restarted one of my others on the USA site. Thanks for your help. Hopefully finish these up tomorrow.