Trying to create a workflow to analyze my nanopore sequencing data

nanohacker1 · December 7, 2024, 12:11am

Hello, I´m very very new to the Bioinformatic field!
I am working with nanopore sequencing data and trying to analyze my data.
I´ve tried to create a new workflow, nevertheless I´m not 100% sure if what I am doing is correct. Could you pls help me out

I´ve started out with following steps:

single fastq files → porechop (Galaxy Version 0.2.4+galaxy0) → fastp (Galaxy Version 0.24.0+galaxy3)

afterwards I wasn´t sure on which of the 3 ways would be the correct continuation:

Test 2.1) Minimap2 (Galaxy Version 2.28+galaxy0) (including the reference Genome: Homo_sapiens.GRCh38.cdna.all.fa) → featureCounts (Galaxy Version 2.0.6+galaxy0) (using Homo_sapiens.GRCh38.111.gtf)

Test 2.2A) Minimap2 (Galaxy Version 2.28+galaxy0) (including the reference Transcriptome: RefTranscriptRealShitGalaxy370-ob__gencode.v47.transcripts.fa.gz__cb.fasta uncompressed) → featureCounts (Galaxy Version 2.0.6+galaxy0) (using Homo_sapiens.GRCh38.111.gtf)

Test 2.2B) Minimap (including the reference Transcriptome: RefTranscriptRealShitGalaxy370-ob__gencode.v47.transcripts.fa.gz__cb.fasta uncompressed) → Sambamba sort (Galaxy Version 1.0.1+galaxy1) → Salmon quant (Galaxy Version 1.10.1+galaxy2) (using RefTranscriptome: RefTranscriptRealShitGalaxy370-ob__gencode.v47.transcripts.fa.gz__cb.fasta uncompressed and the BioMart Gene list: RefTranscriptRealShitGalaxy370-[gencode.v47.transcripts.fa.gz].fasta.gz)

Following inputs were used generally accordingly to the attachment:
Reference Genome fasta: Homo_sapiens.GRCh38.cdna.all.fa
GTFfile Genome: Homo_sapiens.GRCh38.111.gtf
Reference Transcriptome fasta: RefTranscriptRealShitGalaxy370-ob__gencode.v47.transcripts.fa.gz__cb.fasta uncompressed
BioMart Genelist: RefTranscriptRealShitGalaxy370-[gencode.v47.transcripts.fa.gz].fasta.gz

Also another question: porechop is taking reeeaaaally long - >24 h for a single file - is this normal? how can I make it chop faster?

Many thanks for your help!

jennaj · December 9, 2024, 9:33pm

Hi @nanohacker1

For this part

it could just mean the job was queued for a while, and is now executing. Or maybe it is still queued? This is how to check:

Then for your questions: you could try comparing how the genomic versus transcriptomic DE results turn out. But in general, if you are working with a known reference genome and known annotated genes, using that is usually a good idea.

Featurecounts → Map against a reference genome first, then generating the counts with this tool.
Salmon → Maps against a transcriptome at runtime.
DESeq2 can be used with either (counts, TMP values), but EdgeR and Limma will expect just counts.

We have a couple of guides that can help to get the reference data in order. I’m not sure why you are mixing up data sources. You can also use the built-in hg38 reference genome instead of loading up that fasta separately.

If you want to review some existing workflows for ideas, please start here.

Hope this helps! You should be able to get a good solid core of reference data organized for human. From there you can then explore and compare all of these tools in your workflow and make choices. The same reference data can be used with all of them.