RNA STARSolo parameters for 10x 5' data

Hello

I’m trying to get FASTQ files that have been generated using the 10x 5’ protocol aligned using RNA STARsolo.

Data was obtained from here: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-8474

I’ve been trying for days to work out what parameters to put in STARsolo in Galaxy to align these. The other 3 patients in the study were 10x v2 and gave me no issues!

Galaxy here’s a link to my history, if you would like to have a look at the reads and/or play with the files.

Also, do I need a barcode whitelist for this or can STARsolo do it without as long as I tell it how long the barcode and UMI are? (16bp and 10bp respectively)?

Thanks in advance for any help you can offer me.

Welcome, @miRlyKayleigh

I’m not sure I see what the problem is? Did you share the history with the error? Maybe explain a bit more about what is going wrong or what you think needs to be changed? All of the reference data looks good and the other inputs seem to be in the right places.

If you are not getting what you consider enough hits, you could look at the alignment parameters. Example: these reads appear to be shorter than 100 bases after the 5’ trimmed region would be removed – so, that could impact the value used for “length around splice junctions” (seem to be currently at the default).

Thanks for sharing the history! Maybe you worked out what to try between the time asking the question and now?

Hello,

Sorry, the history with the problems is here:

history 1

Patients #417 and #390c are the ones that are affected. Caveat - these files look OK and went green. However, downstream, when I clustered the data, none of the gene expression appears the same as the other 3 patients.

history 2

Can you see what I mean in plots 148 and 149 of the above history? And the cell numbers are lower than I would expect, and I know that this data has B and T cells in because I clustered the data in matrix form before going to the raw data (because the matrix has certain genes missing, like it’s been processed).

Does this make sense? I thought everything was working, and it looked like it had, but something is clearly wrong here.

I’ve run a couple of reads in the first history I linked to and run multiQC on it but I’m not experienced enough to know if the QC looks normal or not.

Could you suggest a value for length around splice junctions that might work? Sorry for so many questions :')

Hi @miRlyKayleigh

I don’t think the histories are fully shared this time, could you check? When working at the EU server some of the defaults are more private, well, by default! Try toggling the share slider on/off and notice any options for the datasets.

For this part

See the help text under the option. For the read length to subtract 1 from, that would be the length of the reads after any trimming is applied. The tool is making an index around each of the slice junctions – it is extending in either direction in a way that would allow the read to map across the junction wherever it is found in the read. Too short matters more than too long for this but tuning to fit the data was just one example of the input data not matching an option and was the first I noticed. Using all defaults is unlikely to give the best results with many tools, and other parameters can matter more.

The best advice for tuning the settings is to try to find a publication that uses data like yours, and this same tool, and then try out what they did as a starting place.

This is one of the files we can’t see.

It is entirely possible that some data was mixed up of course, even with the tagging.

Try this: extract a workflow from your history, then try to run it (send to a new history to keep things organized). If it won’t run at all, go into the workflow editor and see if you can visually spot where things were potentially mixed up. It is sort of a way to graph what you did along with a way to make a small change then re-do it without all the tedious clicking.

Thanks for your reply. I’ve set the histories to totally public now.

What’s the length that it trims? I notice that the first patient’s reads are 100bp long and the second’s are 91 bp long, but I’m trying to work out how much of that would be trimmed off. Apparently not clipping at all is not a valid option, because that errors out and tells me to select a clip type.

You can estimate this by reviewing the lengths of your barcodes and anything else you are trimming off (what is supplied in the other inputs).

For a review of how this works, these are really nice short resources → GTN Materials Search (query=barcodes)

So, review for mix-ups, then if you need to rerun, you can adjust parameters in your workflow, rerun and send it to a new history. You could do this a few times and compare the summaries. You’ll see this in some publications that are comparing methods/tools. They’ll do a matrix. Even outside of publications this it is pretty common just for analysis/scientific reasons.