Table of Contents
1. Determining the 10x inputs
.. 1. Order of inputs
.. 2. Barcode Size and Chemistry
2. Obtaining annotation data
.. 1. GTF and Whitelist
3. Running the tool
.. 1. Setting Basic Parameters
..... 1. Setting the input pairs
..... 2. (Optional) Multiple input pairs
..... 3. GTF and Whitelist
.. 2. Setting Advanced Parameters
Hi Javi, thanks for sharing your history with me.
1 Determining the 10x inputs
1.1 Order of inputs
The first thing you need to do is to determine which of your paired
reads is the cDNA read, and which is the barcode read. Typically
barcode is Read1 and cDNA is Read2, but we should confirm this
* By picking two _read1.fastq's at random, we can see that these are
the barcoded reads since most of sequences end with poly-T tails.
* By picking two _read2.fastq's at random, we can infer that these are
cDNA reads because the sequence is more varied.
1.2 Barcode Size and Chemistry
Next, we need to find out what 10x chemistry we are using, and we can
determine that by looking at the size of our barcodes, as given in the
* By picking a few random reads in a _read1.fastq file, we can count
from the start of a sequence to the first 'TTTT' (allowing for a few
small mismatches), and we see that the barcode size is 26bp, meaning
this is v2 10x chemistry.
2 Obtaining annotation data
To run the input FASTQ data on STARsolo, we need the whitelist of cell
barcodes and the GTF file with the gene annotations.
2.1 GTF and Whitelist
From the [Zenodo link]:
* Copy the urls of the *3M-february-2018.txt.gz* and
* Paste them into Galaxy using the upload data dialog
* Wait for them to be imported into your history
[Zenodo link] <https://zenodo.org/record/3457880>
3 Running the tool
Finally we can run the tool using all the inputs and additional files
we have obtained.
3.1 Setting Basic Parameters
3.1.1 Setting the input pairs
Since your FASTQ files are individual files and not collections, we
set the *Input Type* to "Single files". Then, for each Input Pairs we
* *Barcode reads* to a <prefix>_read1.fastq.gz file
* *cDNA reads* to a <prefix>_read2.fastq.gz file
where the <prefix> should be identical for each pair of inputs.
3.1.2 (Optional) Multiple input pairs
If you have multiple input pairs you can insert them all at once by
clicking on the *+ Insert Input Pairs* button and repeating the above
for a different <prefix>. This approach comes with some
* Pro - Multiple input pairs at once has the benefit of having one
large all-inclusive count matrix at the end of the run.
* Con - If just *one* of your inputs is malformed, then the entire
tool fails and it will be hard to know which dataset was
Normally one should first try setting all inputs at once, and if that
run fails, the sequential runs approach.
3.1.3 GTF and Whitelist
Now we need to use our additional files. For *RNA-Seq Cell Barcode
Whitelist* set this to your "3M-february-2018" dataset.
Set the *Custom or built-in reference genome* to "Use a built-in
index", and set the *Reference genome with or without an annotation*
to "use genome reference without builtin gene-model".
For *Select reference genome* set this "Human Dec. .... hg19", since
this is the version of our GTF file.
Then for *Gene model (gff3,gtf) file for splice junctions* set this to
our GTF file.
3.2 Setting Advanced Parameters
Now, under the *Configure Chemistry Options* set this "Cell Ranger v2"
as we determined from our barcode size before.
Finally we need to tell RNA STARsolo that the barcode size is variable
due to all the poly-T bases in the barcode sequence. If you scroll
down in the RNA STARsolo tool there is an option *Barcode Size is same
size of the Read*, which is set to Yes by default. Set this to "No".
After that hit execute, and your jobs should commence!