scRNAseq Galaxy workflow

Hello Galaxy folks,

I have been using galaxy for a while to analyse RNAseq and I think it’s incredible. Now I was wondering if there would be any workflow to work with single cell RNAseq data from FASTQ files.

I have been trying to follow-up these tutorials

But so far I couldn’t finish it so I was wondering if any of you know about a complete workflow.

I liked this one from the EMBL-EBI https://www.youtube.com/watch?v=1w4sA-qyO3g
but they use data from GXA as the source, so the preprocessing of the data it’s not done and I am struggling with it.

Hope you can help me.

Best,

When you say “couldn’t finish”, do you mean that the workflow failed? And if so, how did it fail?

Hello Astrov, thanks for replying.

It failed because I tried to make a hybrid workflow from the pre-processing data analysis workflow and the one explained in the youtube video from EMBL-EBI by Wendy. What I fail at is in the pre-processing analysis from the Fastq files.

I followed this workflow:


In which they specify the fastq files “read1” and “read2” as the containers of the barcodes and the sequence respectively. And also a Cell ranger whitelist which can be provided by them in the Zenodo Link. (picture attached)

When I try to do these steps, Galaxy says wrong input file and this is where my first very naive question comes. I think I don’t fully understand the outcome of the sequencing from the scRNAseq analyses.

I have sent two samples and what I’ve received is 8 fastq files.
I was counting with 2 fastq files (pair end sequencing) per sample but I have 4 per sample, i.e (picture attached):

Looking at other workflows that allows you to use pre-loaded scRNAseq files like this one 10X STAR SOLO workflow (picture attached). I could understand that every read generate two files that would be 4 files per sample but in my case I have 8 and I don’t fully understand why

Any ideas about this?

Thanks in advance.

Javi

Hi Javi,

When you preview your FASTQ files (the eye icon), can you read them in plain text or is it garbled?

If garbled, can you change the datatype to make them fastqsanger.gz.

If not garbled, can you change the datatype to fastqsanger

If you private message me your galaxy history, I can further debug this a bit more.

Best,
Mehmet

1 Like

Hello mtekman,

I think they are not garbled, it’s the normal structure of a FASTQ file. I tried to send you a private message with the history but it didn’t allow me, dunno why. sorry man.

Thanks!

Can you share your history with me

1 Like
Table of Contents
_________________

1. Determining the 10x inputs
.. 1. Order of inputs
.. 2. Barcode Size and Chemistry
2. Obtaining annotation data
.. 1. GTF and Whitelist
3. Running the tool
.. 1. Setting Basic Parameters
..... 1. Setting the input pairs
..... 2. (Optional) Multiple input pairs
..... 3. GTF and Whitelist
.. 2. Setting Advanced Parameters


Hi Javi, thanks for sharing your history with me.


1 Determining the 10x inputs
============================

1.1 Order of inputs
~~~~~~~~~~~~~~~~~~~

  The first thing you need to do is to determine which of your paired
  reads is the cDNA read, and which is the barcode read. Typically
  barcode is Read1 and cDNA is Read2, but we should confirm this
  manually:

  * By picking two _read1.fastq's at random, we can see that these are
    the barcoded reads since most of sequences end with poly-T tails.
  * By picking two _read2.fastq's at random, we can infer that these are
    cDNA reads because the sequence is more varied.


1.2 Barcode Size and Chemistry
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Next, we need to find out what 10x chemistry we are using, and we can
  determine that by looking at the size of our barcodes, as given in the
  [tutorial].

  * By picking a few random reads in a _read1.fastq file, we can count
    from the start of a sequence to the first 'TTTT' (allowing for a few
    small mismatches), and we see that the barcode size is 26bp, meaning
    this is v2 10x chemistry.


[tutorial]
<https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/scrna-preprocessing-tenx/tutorial.html#10x-chemistries>


2 Obtaining annotation data
===========================

  To run the input FASTQ data on STARsolo, we need the whitelist of cell
  barcodes and the GTF file with the gene annotations.


2.1 GTF and Whitelist
~~~~~~~~~~~~~~~~~~~~~

  From the [Zenodo link]:
  * Copy the urls of the *3M-february-2018.txt.gz* and
    *Homo_sapiens.GRCh37.75.gtf*
  * Paste them into Galaxy using the upload data dialog
  * Wait for them to be imported into your history


[Zenodo link] <https://zenodo.org/record/3457880>


3 Running the tool
==================

  Finally we can run the tool using all the inputs and additional files
  we have obtained.


3.1 Setting Basic Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1.1 Setting the input pairs
-----------------------------

  Since your FASTQ files are individual files and not collections, we
  set the *Input Type* to "Single files". Then, for each Input Pairs we
  set
  * *Barcode reads* to a <prefix>_read1.fastq.gz file
  * *cDNA reads* to a <prefix>_read2.fastq.gz file

  where the <prefix> should be identical for each pair of inputs.


3.1.2 (Optional) Multiple input pairs
-------------------------------------

  If you have multiple input pairs you can insert them all at once by
  clicking on the *+ Insert Input Pairs* button and repeating the above
  for a different <prefix>. This approach comes with some
  advantages/disadvantages:

  * Pro - Multiple input pairs at once has the benefit of having one
    large all-inclusive count matrix at the end of the run.
  * Con - If just *one* of your inputs is malformed, then the entire
    tool fails and it will be hard to know which dataset was
    responsible.


  Normally one should first try setting all inputs at once, and if that
  run fails, the sequential runs approach.


3.1.3 GTF and Whitelist
-----------------------

  Now we need to use our additional files. For *RNA-Seq Cell Barcode
  Whitelist* set this to your "3M-february-2018" dataset.

  Set the *Custom or built-in reference genome* to "Use a built-in
  index", and set the *Reference genome with or without an annotation*
  to "use genome reference without builtin gene-model".

  For *Select reference genome* set this "Human Dec. .... hg19", since
  this is the version of our GTF file.

  Then for *Gene model (gff3,gtf) file for splice junctions* set this to
  our GTF file.


3.2 Setting Advanced Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Now, under the *Configure Chemistry Options* set this "Cell Ranger v2"
  as we determined from our barcode size before.

  Finally we need to tell RNA STARsolo that the barcode size is variable
  due to all the poly-T bases in the barcode sequence. If you scroll
  down in the RNA STARsolo tool there is an option *Barcode Size is same
  size of the Read*, which is set to Yes by default. Set this to "No".

  After that hit execute, and your jobs should commence!

  Best, Mehmet
2 Likes

Hello again mtekman,

Thanks for the reply, it’s really elaborated. I’m definitely gonna try your input and let you know if this has worked for me.

Agains, thanks a lot for your time.

best, Javi

1 Like

Happy to help! Sorry if the format of the last reply was weird, it’s just how I write sometimes :wink:

2 Likes

Hello mtekman!

Here i go again, sorry for the late response. Your pipeline looks very good, I understood everything, very self explanatory.

I still have a issue though running the pipeline. The issue is related to the GTF file. When running STARsolo it fails and the error message says the following:

Fatal INPUT FILE error, no valid exon lines in the GTF file: /data/dnb02/galaxy_db/files/020/723/dataset_20723381.dat
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

Jun 12 09:27:16 ...... FATAL ERROR, exiting

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe

gzip: stdout: Broken pipePreformatted text

What can I do to solve this? Can I somehow modify the GTF file itself?

Cheers man!

Javi

1 Like

Hi @SciJrb

The GTF may be a mismatch with the reference genome you are using, or the GTF is not annotated with features that are set on the tool form to “group by”, or if a custom genome is used – it may not be formatted in a way that tools can interpret. The first and last problems usually show up after mapping, when downstream steps that include annotation are incorporated. But could show up if you are incorporating annotation during the mapping step.

Tools use chromosome identifiers to match data up – and they must be an exact match from the same genome build/version – to produce correct results. Tools may not even fail when there is a mismatch problem – just produce odd results (not always easy to detect).

These FAQs may help:

Hope that helps!

1 Like

hi, I have some same problem. I am doing single cell pre-processing but I thing I do data upload in a wrong way cos its not available for this workflow. can u help me please?

Hi, is this related to STRT-seq data?

sure, we have talk about it