Hi,
I am trying to fetch the genomic coordinates of mass spec identified proteins. The process of identifying these proteins is as follows:
- Align RNA-Seq paired-end read files to the human reference genome (input of 43 x 2 paired-end read files → output of 43 x BAM files).
- Assemble transcripts via Stringtie with reference to Ensemble GTF (output = 43 x GTF files).
- Merge all 43 GTFs into 1 single GTF file (output = 1 x merged.gtf file).
- Extract transcript sequences of the merged.gtf using human reference genome (output = 1 x merged.fasta file)
- Translate the sequences in 3 frames (output = 1 x 3-frame-translated.fasta).
- Split the translated sequences at every stop codon (output = 1 x protein_db.fasta).
I would now like to map the proteins from step 6 back to their genomic coordinates. I tried to follow this Proteogenomics tutorial starting from the “Transcript Assembly” section. Starting at the “Evaluate the assembly with annotated transcripts” section, I input merged.gtf
from the above step 3. I then follow this up with the “Translate transcripts” section which uses the merged.gtf
and Homo_sapiens.GRCh38.dna.primary_assembly.2bit
as input. Next, I tried to follow the instructions in the “Creating FASTA Databases” section but I cannot do so as I do not have a genomic_mapping.sqlite
database as input.
Can anyone advise how I might go about solving this problem? I am open to using non-galaxy tools as well although I have had much less success there.