Question about genome-wide alternative splicing analysis own samples

Hi, I have to analyse alternative splicing in my own samples.
I have followed the tutorial of Genome-wide alternative splicing analysis.
However, I’m struggling to continue with my own samples:

  • at the start of the tutorial, there are some “additional datasets”
    (Genome-wide alternative splicing analysis)
    Are these additional datasets specifically for the tutorial (ex. since the file names include specifically chr 5)? If so, do I need similar additional datasets for my own samples?
  • the tutorial starts from .fastq files. Our samples were delivered in .bam and .bai files already mapped with hg38. Can I continue with our .bam files or do we really need to start from the .fastq files?
    Thanks so much for any help!

Hello @jnguyen1

If you are running this with your own samples, then yes, you’ll need to provide reference data. That can be a reference genome indexed on the server where you are working, or a custom reference genome in fasta format that you upload. For the reference annotation, you will always provide that. For either case, both need to be based on the same exact genome assembly because the tool is matching up the identifiers between the files (chromosome names) and interpreting genomic coordinates (locations in the fasta).

If you already have the mapping, yes, you can use those. But keep in mind the first point above – all the data must be based on the same reference data. Whatever genome you mapped your reads against will need to be supplied to downstream tools. If any conflicts are introduced … well … that is what most of the discussion at this forum is about resolving :slight_smile: .

If you map in Galaxy, then all of the reference data is already in Galaxy for reuse. That can make the analysis easier but isn’t required. Just check yourself to make sure that you are using the same reference data for all steps, including upstream steps run outside of Galaxy.

You can search this forum with keywords to find the reference data help in many contexts. If you get stuck, we can follow up. Share the first few lines of each reference file, or better, put those in a shared history first, then screenshot or explain what you are having trouble with and we can try to help more.

The version of hg38 natively indexed at the public Galaxy servers was sourced from UCSC. You could compare your version to that, to make decisions. UCSC also hosts reference annotation (GTF) for that genome, and so does Gencode. UCSC Genome Browser Downloads

Hi @jennaj

Thanks so much for your reply!

Great to hear that I can use the .bam files. Sorry for the many questions, but if I’m using the bam files, where could I start with the steps of the tutorial? Would that be from the start as well or from the trasncriptome assembly part perhaps?

I believe all our samples were not mapped in Galaxy, but with CLCBio. I received a gtf and bed12 file (HG38_dec2013_GTF" and HG38_dec2013_bed12), are those the reference data that I should upload as well? Would I need any other file?

Sorry for the many questions and thanks so much for helping me out!

Hi @jnguyen1

Yes, load up the data you have, then double check that the exact same base reference genome matches the UCSC hg38 assembly. Methods for the comparison, and potential adjustments if needed are here: Mismatched Chromosome identifiers and how to avoid them

To reproduce the protocol, you’ll probably want to do all the same steps, except for the read QA and mapping parts at the start since those are already done.