I have a question regarding the “CLIP-Seq data analysis from pre-processing to motif detection” training. I understand the meaning behind switching the R2 mates with the R1 mates in “umi-tools extract” and “STAR” steps. However, I do not get why in Van Nostrand et al. 2016 they do not switch R2 with R1 as well. Even in the processing pipeline associated with the libraries I am interested in (https://www.encodeproject.org/documents/739ca190-8d43-4a68-90ce-1a0ddfffc6fd/@@download/attachment/eCLIP_analysisSOP_v2.2.pdf) they do not switch R1 and R2 as well,m although is very similar to the training pipeline. I hope I have been clear enough.
Thanks in advance, also I want to congratulate for the amazing job you are doing!
It is not wrong, how Van Nostrand et al. 2016 or the Galaxy training material, analyzed their data. Two points:
(a) They filter out the second read and do a peak calling with a single read peak caller Clipper. This is the step
samtools view: Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on.
in their analysis protocol. In the Galaxy training material we use a paired-end peak caller, which takes both reads into consideration.
(b) If you kept R2 as the reverse read in STAR then your peaks are reverse oriented, if you investigated them in a Genome browser. This is not incorrect, but for a better demonstration and to reduce confusion with the peak orientation it is better to switch the read R1 and R2 in STAR.
The training material is a bit outdated and a few steps can be done nowadays differently. for example, Cutadapt supports adapter trimming for both Reads R1 and R2 and a double call is not necessary anymore. However, the double call still takes a bit care of double ligation events, that was still a problem for eCLIP at that time.
The people around Van Nostrand et al. 2016 updated frequently their analysis protocol and thus some steps changed over time in the description of the paper.
I hope I could help.
Thank you for you answer, I have another doubt regarding CLIP-seq analyses. I am dealing with a single end library,the author report that the Rand103Tr3 sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC) is at the 3’ end of the reads and a 6 nc UMI, should I assume that the UMI is at the 3’ end in the reads and not at the 5’ end? It would be a little bit strange but as far as I know the UMI is located right before the Rand103Tr3 sequence. I am referring to this paper in particular (A novel class of microRNA-recognition elements that function only within open reading frames | Nature Structural & Molecular Biology). I hope my question makes sense
Thank you in advance
I have not read the paper. Assuming is never a good idea in data analysis because it leads to errors. For me, it is a bit unlikely the UMI is at the 3’ end in single end fashion because the sequencing error is the highest at the 3’ end.
Maybe contact the authors and clarify the read library design. I would probably also try to locate the adapter sequence and thus get an idea, where the UMI might be.