Problem with paired end transcriptomic data processing in CutAdapt uploaded as collection

Hi
I have uploaded Illumina sequenced paired-end transcriptomic data in (http://usegalaxy.org) using rule based uploader as collection, pasted the .txt file containing study accession, sample accession, experiment accession & fastq files, defined columns as list identifiers & URL, marked it as fastqsanger.gz & defined the genome as required. It created a simple dataset list.
Now to use CutAdapt how should I treat the generated file? paired end or paired collection? I have tried paired end collection but it is showing no fastqsanger, fastqsanger.gz, or fast collection available.

1 Like

Hello @Debarati_Chakraborty

Was the data originally interleaved/interlaced fastqsanger.gz? That is the only way paired-end data would end up as a single “list” collection (assuming your rules are Ok). If so, not many tools will not work with interlaced fastq directly. The data needs to be de-interlaced first. Tools: FASTQ de-interlacer (run once) or seqtk_seq (run twice, once to extract forward, once to extract reverse).

Paired-end data is best organized in one of these two ways. And you might decide or need to use both formats in the same workflow. It depends on the tools and manipulations. Converting from Interlaced to De-interlaced could be part of your workflow (early step), as could the Zip and Unzip collection tools (within the workflow).

  • One collection for the forwarded reads and one for the reverse reads. Each will be a collection type “list” and entered as two “paired-end” inputs on tool forms.

  • One collection that contains both ends, structured correctly. The collection type is “list-paired” and entered as one “paired collection” input on tool forms.

Tutorials that will help:

Thanks!

2 Likes

Hi Jennifer
I have tried it and got 4 files, right and left singles and right and left mates. What is the meaning of “mates”? I guess “singles” means data without mates? Sorry for the naive question!

Thanks
Debarati

1 Like


Hi @jennaj This is what I got on doing multiQC with HiSat2 output. % Aligned is 30.3%. Guess this is very bad, right?

Can U please help?

Thanks
Debarati

Mates are paired-end reads with long inserts. Got this article for mate pair sequencing. https://www.ecseq.com/support/ngs/what-is-mate-pair-sequencing-useful-for

Many tools (especially fastq QA tools) will produce 4 outputs: two datasets for the forward reads + reverse reads that are still paired, and two for those that are no longer paired.

Some tools require that both ends of a pair are input at runtime (and will fail if a pair is missing one end – or “mate”).

Other tools will accept paired + unpaired ends at runtime but any unpaired are ignored at that step, or possibly not output at all, or output but annotated as unpaired. Unpaired can also be filtered out as an independent step.

The presence of intact pairs depends on the tools/manipulations already applied to the data. Whether intact pairs matter or not for downstream tools/manipulations depends on the requirements of those tools/analysis goals.

Data grouped into a paired-end dataset collection will always have both ends present. Any pairs that become unpaired are broken out into distinct outputs.

If you are losing many intact pairs, that could indicate a true data quality problem or filtering/QA criteria that is too strict. Each case is different. Running FastQC or fastp (along with MultiQC for a summary) on read data before and after QA can help to interpret how that step/tool is impacting your particular data.

Long reply for a short answer, but hopefully, this helps! Tool form help and linked manuals usually cover how paired data is handled. If you are not sure about a particular tool, we can try to help more.

Thanks!

1 Like

The goal of mapping steps is to maximize the percentage of properly paired reads.

Run FastQC on your original data – if there is a quality issue, that tool will likely report it (how to interpret the reports: Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data). QC/QA tools like Cutadapt, Trimmomatic, fastp can address some of those issues to improve mapping rates. Other times the data has problems that cannot be addressed – library construction issues, sequencing problems.

Low mapping rates can also be produced for technical reasons: the wrong target genome/transcriptome/exome was mapped against, the target “-ome” is highly fragmented or a draft assembly, an incorrect annotation dataset was incorporated during mapping, and/or the setting used with the mapping tool are not a match for the data (read orientation is not correct? mapping criteria too strict?).

The tutorials linked above have example usage. You may also wish to review these FAQs to eliminate technical issues:

To find more Q&A about HISAT2 that covers common issues others have run into, search the forum with the keyword “mapping” or “hisat2”.

Thanks!

1 Like