Hi all,
I was trying to run a pipeline these days. Unfortunately, I don’t know the strandedness of the sample (fastq files) I have. Is there any way to check this? I tried already to convert my fastqs to BAMs with bowtie and then use Infer Experiment to check for that. I was wondering if there is any other way.
Hi, I am new to this area too! I have been reading about the same thing. I did an RNA seq experiment using the illumina TruSeq stranded mRNA kit and as part of this protocol strand specificity is achieved by replacing dTTP with dUTP in the SMM (Second Strand Marking Mix), followed by second strand cDNA synthesis using DNA Polymerase I and RNase H. The incorporation of dUTP in second strand synthesis quenches the second strand during amplification. I think that means that only the forward strand is taken forward so the library and resulting data is forward stranded. Hope this helps or that I can be corrected!
Alysha
To check actual strandedness, as a QA/QC step for data with unknown protocols, or your own data with a known (to confirm), try this:
Map the reads to the reference genome as unstranded
Run the tool Infer Experiment on the resulting bam
Note this will require a bed dataset with 12 columns
Depending on how your obtained/created the bed data, it may have the datatype bed or bed12 assigned. Either is Ok, but check to make sure the data really has 12 columns
UCSC’s Table Brower is one source for complete bed data (tool: Get Data > UCSC main).
If your genome is not supported by UCSC, or you want to base the gene model on some other data provider’s genome annotation, you can convert any gtf dataset to a bed12 dataset (tool: Convert GTF to BED12)
Do not obtain gtf data from the UCSC Table Browser for most purposes
UCSC has a gtf appropriate for RNA-seq tools for selected genomes in their Downloads area
Map the reads again to your reference genome setting the strand correctly based on the results of Infer Experiment. Use this bam result for analysis.
More Help:
UCSC Table Browser options to get a bed with 12 columns
Set the “clade + genome + assembly”
Pick a track from the “group” Gene and Gene Predictions
Set “region” = genome
Set “output format” = bed and check the box to send the output to Galaxy
Submit the query by clicking on the button “get output”
Not done yet! There will be another sub-form presented next to specify more details…
Choose “Create one BED record per” > “Whole Gene” to output a bed with 12 columns
Finish by clicking on the button for “get bed”
If you are logged into a public Galaxy server known by UCSC (usegalaxy.org, usegalaxy.eu, and usegalaxy.org.eu are “known”), the output will be sent to your active History.
If working somewhere else, or you want a downloaded copy, you can use the Table Browser directly (Table Browser) and not check the box to send the output to Galaxy. Once the file is downloaded, use the Upload tool to get it into Galaxy.
Generating and Interpreting the results from `Infer Experiment`
See these “Galaxy Training Network” (GTN) tutorials:
A search with the keyword “gtf” will find more topics, too.
Converting Fastq > BAM produces a bam dataset without any mapping information. If you actually mapped already with Bowtie (this wasn’t clear to me!), maybe there was some other problem. RNA-seq data might not map well enough with a DNA mapping tool – use an RNA mapping tool like HISAT2 instead to see if that helps.
Try the method above to produce a bam with mapping results for QA/QC purposes, run Infer Experiment on that, interpret the results, then run the analysis mapping with strand settings that match your data.
Hello,
HISAT2 has three options under “specify strand information” with the single-end library:
Unstranded
Forward (F)
Reverse (R)
However, the strandedness and software settings table on this link Reference-based RNA-Seq data analysis
does not have the same parameters listed above.
For example, after I run InferExperiment, the result (±,-+) refers me to “First Strand R/RF” for running HISAT2. So, in that case, am I supposed to select “Reverse (R)”?
Yes, from what you explain, “reverse” is probably correct for the HISAT2 choices. You could try it and compare what results to confirm.
‘F’ means a read corresponds to a transcript. ‘R’ means a read corresponds to the reverse complemented counterpart of a transcript. With this option being used, every read alignment will have an XS attribute tag: ‘+’ means a read belongs to a transcript on ‘+’ strand of genome. ‘-’ means a read belongs to a transcript on ‘-’ strand of genome. (–rna-strandness)