Downloading abundance (frequency) table of ASVs

I had 104 single-end Fastq files without primers but with adapters (only Illumina small RNA 3’ was detected by FastQC in half of the files). I followed this pipeline to get to ASVs and frequency data but I can’t download them as readable files. I want to download a csv or tsv file because I want to build a machine learning algorithm for analyzing the 16S region and diagnosing diseases.

My pipeline:

  1. Pasted SRA accessions as txt in Galaxy

  2. Created a collection: Single-end data (fastq-dump) a list with 104 fastqsanger.gz datasets

  3. Ran FASTQC on all 104 files to get quality report and detect adapters

  4. Ran MULTIQC on 104 individual FASTQC reports to have a combined report

  5. qiime2 cutadapt trim-single by giving the adapter sequence as input to trim the adapters

  6. qiime2 tools export to export the trimmed fastq files

  7. qiime2 dada2 denoise-single on the trimmed fastq files to get the ASVs (abundance table, sequences and frequencies). This tool gave 3 outputs:
    representative_sequences.qza,
    table.qza,
    denoising_stats.qza

  8. qiime2 feature-table summarize on both representative_sequences.qza and table.qza to get corresponding qzv files.

  9. Finally I opened qzv files and I saw the information that I need.

    How do I download this as tsv or csv or in any file type that I can use for ML algorithms? I cannot download it from Galaxy. I tried theses tools:
    qiime2 tools export on data 2210 as BIOMV210DirFmt (feature-table) converted to biom1
    qiime2 feature-table tabulate-seqs

    They did not help and I don’t have any files. Please help me.
    I appreciate your sugggestions.

    Have a nice day, everyone.

Hello @cancamuz,

The .qza and .qzv files that Qiime2 makes are just .zip files. You can rename them to .zip, then Windows or OSX will extract them automatically when you double click on them!

Because they are zip archives, you can use Linux CLI tools too.

unzip -lx example.qzv
unzip -lx denoising_stats.qza

Let me know how that works for you!

2 Likes

That worked well! Did not know that the solution could be that simple! Thank you so so much!

I would like to ask for your personal opinion on this pipeline. I have built some ML models, basically I want to classify the disease stages depending on the microbiome.

I used ASV table to get taxons that the sequences belong to, and have built my models using taxons as features. I am sharing the dataset of my training data. Since I have only 104 patient samples, I know that it is a hard task. But still, my accuracy is lower than I expected. Can you tell me if I missed a point? I also did feature selection and elimination considering prevalence, abundance, and variance.

As the next step, I am planning to build pathway - functional importance table by using
PICRUSt. Maybe that provides more explanatory information and I can get a higher classification accuracy? What do you suggest?

Thank you and have a nice day.

Image: rows = patient samples, columns = taxons, values = abundance in that sample.

1 Like

I do this professionally if you need a hand! If you have funding for a domain expert, I can help more directly. (I can also get under NDA and look at the real data.)

I can also offer you some free advice to get started:

This is good idea!

Conversely, it may also be helpful to use the ASVs as features directly, as this will not be limited by the taxonomy resolution. (Noise from the database will be introduced during taxonomy assignment, so avoiding taxonomy is worth trying.)

I would use GitHub - picrust/picrust2: Code, unit tests, and tutorials for running PICRUSt2, though you may be doing that too because you are using ASVs.

Predicting functional capability using only amplicons is amazing, but it’s an inference and does contain noise. Actual shotgun data helps confirm/deny picrust2 predictions.