I am currently working on annotating a new genome.
The genome of this crop is extremely large, ranging from 15G to 18G.
My approach is to align RNA-seq data from multiple samples to this genome,
then input the BAM files into StringTie for assembly to obtain GTF files for each sample, and finally use StringTie Merge to integrate them into a comprehensive genome annotation document.
Therefore, I have designed a workflow.
{
"a_galaxy_workflow": "true",
"annotation": "As a basis for this workflow, I used the Galaxy tutorial made by Anthony Bretaudeau.\n\nReferences: - https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker/tutorial.html",
"comments": [
{
"child_steps": [
4,
5,
8
],
"color": "black",
"data": {
"title": "Structural Annotation"
},
"id": 1,
"position": [
600,
560
],
"size": [
240,
740
],
"type": "frame"
},
{
"child_steps": [
2,
3
],
"color": "black",
"data": {
"title": "Genome Assembly Quality Analysis"
},
"id": 0,
"position": [
600,
0
],
"size": [
250,
440
],
"type": "frame"
}
],
"creator": [
{
"class": "Person",
"email": "mailto:teixeiratuchinski@gmail.com",
"name": "Giovanna Teixeira Tuchinski"
}
],
"format-version": "0.1",
"license": "MIT",
"name": "Plant Genome Structural Annotation and NLR Annotation",
"report": {
"markdown": "\n# Workflow Execution Report\n\n## Workflow Inputs\n```galaxy\ninvocation_inputs()\n```\n\n## Workflow Outputs\n```galaxy\ninvocation_outputs()\n```\n\n## Workflow\n```galaxy\nworkflow_display()\n```\n"
},
"steps": {
"0": {
"annotation": "Input a genome file in fasta format. \n\nUnless changes are made, for this workflow are best suited genomes from plsnts genetically similar to Solanum lycopersicum.",
"content_id": null,
"errors": null,
"id": 0,
"input_connections": {},
"inputs": [
{
"description": "Input a genome file in fasta format. \n\nUnless changes are made, for this workflow are best suited genomes from plsnts genetically similar to Solanum lycopersicum.",
"name": "Input genome"
}
],
"label": "Input genome",
"name": "Input dataset",
"outputs": [],
"position": {
"left": 0,
"top": 960
},
"tool_id": null,
"tool_state": "{\"optional\": false, \"format\": [\"fasta\"], \"tag\": null}",
"tool_version": null,
"type": "data_input",
"uuid": "4a353865-cafd-4c54-be39-2e4d762786c2",
"when": null,
"workflow_outputs": []
},
"1": {
"annotation": "Use the NLR-Annotator tool to predict NLR-associated loci in a plant genome.\n\nSteuernagel, Burkhard, et al. \u201cThe NLR-Annotator Tool Enables Annotation of the Intracellular Immune Receptor Repertoire.\u201d Plant Physiology, vol. 183, no. 2, Oxford University Press, Mar. 2020, pp. 468\u201382, https://doi.org/10.1104/pp.19.01273.",
"content_id": null,
"errors": null,
"id": 1,
"input_connections": {},
"inputs": [
{
"description": "Use the NLR-Annotator tool to predict NLR-associated loci in a plant genome.\n\nSteuernagel, Burkhard, et al. \u201cThe NLR-Annotator Tool Enables Annotation of the Intracellular Immune Receptor Repertoire.\u201d Plant Physiology, vol. 183, no. 2, Oxford University Press, Mar. 2020, pp. 468\u201382, https://doi.org/10.1104/pp.19.01273.",
"name": " NLR-Annotator"
}
],
"label": " NLR-Annotator",
"name": "Input dataset",
"outputs": [],
"position": {
"left": 1000,
"top": 600
},
"tool_id": null,
"tool_state": "{\"optional\": false, \"format\": [\"gff\"], \"tag\": null}",
"tool_version": null,
"type": "data_input",
"uuid": "3ad456db-a124-40b3-b216-a102e2b48b45",
"when": null,
"workflow_outputs": []
},
"2": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0",
"errors": null,
"id": 2,
"input_connections": {
"fasta": {
"id": 0,
"output_name": "output"
}
},
"inputs": [],
"label": null,
"name": "Fasta Statistics",
"outputs": [
{
"name": "stats_output",
"type": "tabular"
}
],
"position": {
"left": 625.3913880237574,
"top": 65.90583574348494
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0",
"tool_shed_repository": {
"changeset_revision": "0dbb995c7d35",
"name": "fasta_stats",
"owner": "iuc",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"fasta\": {\"__class__\": \"ConnectedValue\"}, \"gaps_option\": false, \"genome_size\": null, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.0",
"type": "tool",
"uuid": "5acf6d10-ec62-46df-8b00-e56d5af7622d",
"when": null,
"workflow_outputs": []
},
"3": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0",
"errors": null,
"id": 3,
"input_connections": {
"input": {
"id": 0,
"output_name": "output"
}
},
"inputs": [],
"label": null,
"name": "Busco",
"outputs": [
{
"name": "busco_sum",
"type": "txt"
},
{
"name": "busco_table",
"type": "tabular"
},
{
"name": "busco_missing",
"type": "tabular"
},
{
"name": "summary_image",
"type": "png"
},
{
"name": "busco_miniprot",
"type": "gff3"
}
],
"position": {
"left": 630,
"top": 180
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0",
"tool_shed_repository": {
"changeset_revision": "ea8146ee148f",
"name": "busco",
"owner": "iuc",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"adv\": {\"evalue\": \"0.001\", \"limit\": \"3\", \"contig_break\": \"10\"}, \"busco_mode\": {\"mode\": \"geno\", \"__current_case__\": 0, \"miniprot\": true, \"use_augustus\": {\"use_augustus_selector\": \"yes\", \"__current_case__\": 1, \"aug_prediction\": {\"augustus_mode\": \"builtin\", \"__current_case__\": 2, \"augustus_species\": \"tomato\"}, \"long\": true}}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"lineage\": {\"lineage_mode\": \"select_lineage\", \"__current_case__\": 1, \"lineage_dataset\": \"eudicots_odb10\"}, \"lineage_conditional\": {\"selector\": \"download\", \"__current_case__\": 1}, \"outputs\": [\"short_summary\", \"missing\", \"image\"], \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "5.5.0+galaxy0",
"type": "tool",
"uuid": "8c3bccc3-711c-47b5-a080-1322487c9c6a",
"when": null,
"workflow_outputs": []
},
"4": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/augustus/augustus/3.4.0+galaxy1",
"errors": null,
"id": 4,
"input_connections": {
"input_genome": {
"id": 0,
"output_name": "output"
}
},
"inputs": [],
"label": null,
"name": "Augustus",
"outputs": [
{
"name": "output",
"type": "gtf"
},
{
"name": "protein_output",
"type": "fasta"
},
{
"name": "codingseq_output",
"type": "fasta"
}
],
"position": {
"left": 620,
"top": 610
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/augustus/augustus/3.4.0+galaxy1",
"tool_shed_repository": {
"changeset_revision": "28433faa6e42",
"name": "augustus",
"owner": "bgruening",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"genemodel\": \"complete\", \"gff\": true, \"hints\": {\"usehints\": \"F\", \"__current_case__\": 1}, \"input_genome\": {\"__class__\": \"ConnectedValue\"}, \"model\": {\"augustus_mode\": \"builtin\", \"__current_case__\": 1, \"organism\": \"tomato\"}, \"noInFrameStop\": false, \"outputs\": [\"protein\", \"codingseq\", \"introns\", \"start\", \"stop\", \"cds\"], \"range\": {\"userange\": \"F\", \"__current_case__\": 1}, \"singlestrand\": false, \"softmasking\": false, \"strand\": \"both\", \"utr\": false, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "3.4.0+galaxy1",
"type": "tool",
"uuid": "2e533046-439d-4076-99aa-b31fda343d71",
"when": null,
"workflow_outputs": [
{
"label": "annotation",
"output_name": "output",
"uuid": "67f1c1df-af43-4115-af38-bc5f3b68c1ee"
}
]
},
"5": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.4+galaxy0",
"errors": null,
"id": 5,
"input_connections": {
"input": {
"id": 4,
"output_name": "output"
},
"reference_genome|genome_fasta": {
"id": 0,
"output_name": "output"
}
},
"inputs": [
{
"description": "runtime parameter for tool gffread",
"name": "chr_replace"
},
{
"description": "runtime parameter for tool gffread",
"name": "reference_genome"
}
],
"label": null,
"name": "gffread",
"outputs": [
{
"name": "output_exons",
"type": "fasta"
}
],
"position": {
"left": 620,
"top": 810
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.4+galaxy0",
"tool_shed_repository": {
"changeset_revision": "3e436657dcd0",
"name": "gffread",
"owner": "devteam",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"chr_replace\": {\"__class__\": \"RuntimeValue\"}, \"decode_url\": true, \"expose\": true, \"filtering\": null, \"full_gff_attribute_preservation\": true, \"gffs\": {\"gff_fmt\": \"none\", \"__current_case__\": 0}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"maxintron\": null, \"merging\": {\"merge_sel\": \"none\", \"__current_case__\": 0}, \"reference_genome\": {\"source\": \"history\", \"__current_case__\": 2, \"genome_fasta\": {\"__class__\": \"ConnectedValue\"}, \"ref_filtering\": null, \"fa_outputs\": [\"-w exons.fa\"]}, \"region\": {\"region_filter\": \"none\", \"__current_case__\": 0}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.2.1.4+galaxy0",
"type": "tool",
"uuid": "0cf62ad6-a114-420f-bbe0-a5dbf5b86e46",
"when": null,
"workflow_outputs": []
},
"6": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.31.1+galaxy0",
"errors": null,
"id": 6,
"input_connections": {
"inputA": {
"id": 4,
"output_name": "output"
},
"reduce_or_iterate|inputB": {
"id": 1,
"output_name": "output"
}
},
"inputs": [
{
"description": "runtime parameter for tool bedtools Intersect intervals",
"name": "inputA"
},
{
"description": "runtime parameter for tool bedtools Intersect intervals",
"name": "reduce_or_iterate"
}
],
"label": null,
"name": "bedtools Intersect intervals",
"outputs": [
{
"name": "output",
"type": "input"
}
],
"position": {
"left": 1000,
"top": 690
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.31.1+galaxy0",
"tool_shed_repository": {
"changeset_revision": "64e2edfe7a2c",
"name": "bedtools",
"owner": "iuc",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"bed\": false, \"count\": false, \"fraction_cond\": {\"fraction_select\": \"default\", \"__current_case__\": 0}, \"genome_file_opts\": {\"genome_file_opts_selector\": \"loc\", \"__current_case__\": 0, \"genome\": null}, \"header\": false, \"inputA\": {\"__class__\": \"RuntimeValue\"}, \"invert\": false, \"once\": false, \"overlap_mode\": [\"-wa\"], \"reduce_or_iterate\": {\"reduce_or_iterate_selector\": \"iterate\", \"__current_case__\": 0, \"inputB\": {\"__class__\": \"RuntimeValue\"}}, \"sorted\": false, \"split\": false, \"strand\": \"-s\", \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.31.1+galaxy0",
"type": "tool",
"uuid": "68ec912f-3eaf-446c-af71-a1a950689c5c",
"when": null,
"workflow_outputs": []
},
"7": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.11+galaxy1",
"errors": null,
"id": 7,
"input_connections": {
"reference_genome|genome": {
"id": 0,
"output_name": "output"
},
"track_groups_0|data_tracks_0|data_format|annotation": {
"id": 4,
"output_name": "output"
}
},
"inputs": [
{
"description": "runtime parameter for tool JBrowse",
"name": "reference_genome"
}
],
"label": null,
"name": "JBrowse",
"outputs": [
{
"name": "output",
"type": "html"
}
],
"position": {
"left": 1000,
"top": 960
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.11+galaxy1",
"tool_shed_repository": {
"changeset_revision": "a6e57ff585c0",
"name": "jbrowse",
"owner": "iuc",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"action\": {\"action_select\": \"create\", \"__current_case__\": 0}, \"gencode\": \"1\", \"jbgen\": {\"defaultLocation\": \"\", \"trackPadding\": \"20\", \"shareLink\": true, \"aboutDescription\": \"\", \"show_tracklist\": true, \"show_nav\": true, \"show_overview\": true, \"show_menu\": true, \"hideGenomeOptions\": false}, \"plugins\": {\"BlastView\": true, \"ComboTrackSelector\": false, \"GCContent\": false}, \"reference_genome\": {\"genome_type_select\": \"history\", \"__current_case__\": 1, \"genome\": {\"__class__\": \"ConnectedValue\"}}, \"standalone\": \"minimal\", \"track_groups\": [{\"__index__\": 0, \"category\": \"Default\", \"data_tracks\": [{\"__index__\": 0, \"data_format\": {\"data_format_select\": \"gene_calls\", \"__current_case__\": 2, \"annotation\": {\"__class__\": \"ConnectedValue\"}, \"match_part\": {\"match_part_select\": false, \"__current_case__\": 1}, \"index\": false, \"track_config\": {\"track_class\": \"NeatHTMLFeatures/View/Track/NeatFeatures\", \"__current_case__\": 3, \"html_options\": {\"topLevelFeatures\": null}}, \"jbstyle\": {\"style_classname\": \"feature\", \"style_label\": \"product,name,id\", \"style_description\": \"note,description\", \"style_height\": \"10px\", \"max_height\": \"600\"}, \"jbcolor_scale\": {\"color_score\": {\"color_score_select\": \"none\", \"__current_case__\": 0, \"color\": {\"color_select\": \"automatic\", \"__current_case__\": 0}}}, \"jb_custom_config\": {\"option\": []}, \"jbmenu\": {\"track_menu\": []}, \"track_visibility\": \"default_off\", \"override_apollo_plugins\": \"False\", \"override_apollo_drag\": \"False\"}}]}], \"uglyTestingHack\": \"\", \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "1.16.11+galaxy1",
"type": "tool",
"uuid": "f7335355-3558-43c9-b672-c9e0b3dcc5df",
"when": null,
"workflow_outputs": []
},
"8": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0",
"errors": null,
"id": 8,
"input_connections": {
"input": {
"id": 5,
"output_name": "output_exons"
}
},
"inputs": [],
"label": null,
"name": "Busco",
"outputs": [
{
"name": "busco_sum",
"type": "txt"
},
{
"name": "busco_table",
"type": "tabular"
},
{
"name": "busco_missing",
"type": "tabular"
},
{
"name": "summary_image",
"type": "png"
},
{
"name": "busco_gff",
"type": "gff3"
}
],
"position": {
"left": 620,
"top": 1040
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0",
"tool_shed_repository": {
"changeset_revision": "ea8146ee148f",
"name": "busco",
"owner": "iuc",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"adv\": {\"evalue\": \"0.001\", \"limit\": \"3\", \"contig_break\": \"10\"}, \"busco_mode\": {\"mode\": \"tran\", \"__current_case__\": 1}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"lineage\": {\"lineage_mode\": \"select_lineage\", \"__current_case__\": 1, \"lineage_dataset\": \"eudicots_odb10\"}, \"lineage_conditional\": {\"selector\": \"download\", \"__current_case__\": 1}, \"outputs\": [\"short_summary\", \"missing\", \"image\", \"gff\"], \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "5.5.0+galaxy0",
"type": "tool",
"uuid": "764c2a67-8c20-43f1-ba5c-ce54b4db4dc4",
"when": null,
"workflow_outputs": []
},
"9": {
"annotation": "In a basic editor or spreadsheet, filter the gene IDs into a txt file.",
"content_id": "export_remote",
"errors": null,
"id": 9,
"input_connections": {
"export_type|infiles": {
"id": 6,
"output_name": "output"
}
},
"inputs": [
{
"description": "runtime parameter for tool Export datasets",
"name": "export_type"
}
],
"label": "ID list",
"name": "Export datasets",
"outputs": [
{
"name": "out",
"type": "txt"
}
],
"position": {
"left": 1300,
"top": 650
},
"post_job_actions": {},
"tool_id": "export_remote",
"tool_state": "{\"d_uri\": \"\", \"export_type\": {\"export_type_selector\": \"datasets_auto\", \"__current_case__\": 0, \"infiles\": {\"__class__\": \"RuntimeValue\"}}, \"include_metadata_files\": true, \"invalid_chars\": \"/\", \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "0.1.0",
"type": "tool",
"uuid": "f6a1f688-0c13-4af0-ae03-f4dc45441cde",
"when": null,
"workflow_outputs": []
},
"10": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/galaxyp/filter_by_fasta_ids/filter_by_fasta_ids/2.3",
"errors": null,
"id": 10,
"input_connections": {
"header_criteria|identifiers": {
"id": 9,
"output_name": "out"
},
"input": {
"id": 4,
"output_name": "codingseq_output"
}
},
"inputs": [
{
"description": "runtime parameter for tool Filter FASTA",
"name": "header_criteria"
},
{
"description": "runtime parameter for tool Filter FASTA",
"name": "input"
}
],
"label": null,
"name": "Filter FASTA",
"outputs": [
{
"name": "output",
"type": "fasta"
}
],
"position": {
"left": 1600,
"top": 550
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/galaxyp/filter_by_fasta_ids/filter_by_fasta_ids/2.3",
"tool_shed_repository": {
"changeset_revision": "dff7df6fcab5",
"name": "filter_by_fasta_ids",
"owner": "galaxyp",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"dedup\": false, \"header_criteria\": {\"header_criteria_select\": \"id_list\", \"__current_case__\": 1, \"identifiers\": {\"__class__\": \"RuntimeValue\"}, \"id_regex\": {\"find\": \"beginning\", \"__current_case__\": 0}}, \"input\": {\"__class__\": \"RuntimeValue\"}, \"output_discarded\": false, \"sequence_criteria\": {\"sequence_criteria_select\": \"\", \"__current_case__\": 0}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.3",
"type": "tool",
"uuid": "46ac74cf-b556-425a-ba21-09cf24f06eb0",
"when": null,
"workflow_outputs": [
{
"label": "predicted-nlr-CDS",
"output_name": "output",
"uuid": "48d672c0-bb28-49bb-99bd-413b09ac0bce"
}
]
},
"11": {
"annotation": "",
"content_id": "toolshed.g2.bx.psu.edu/repos/galaxyp/filter_by_fasta_ids/filter_by_fasta_ids/2.3",
"errors": null,
"id": 11,
"input_connections": {
"header_criteria|identifiers": {
"id": 9,
"output_name": "out"
},
"input": {
"id": 4,
"output_name": "protein_output"
}
},
"inputs": [
{
"description": "runtime parameter for tool Filter FASTA",
"name": "header_criteria"
},
{
"description": "runtime parameter for tool Filter FASTA",
"name": "input"
}
],
"label": null,
"name": "Filter FASTA",
"outputs": [
{
"name": "output",
"type": "fasta"
}
],
"position": {
"left": 1600,
"top": 730
},
"post_job_actions": {},
"tool_id": "toolshed.g2.bx.psu.edu/repos/galaxyp/filter_by_fasta_ids/filter_by_fasta_ids/2.3",
"tool_shed_repository": {
"changeset_revision": "dff7df6fcab5",
"name": "filter_by_fasta_ids",
"owner": "galaxyp",
"tool_shed": "toolshed.g2.bx.psu.edu"
},
"tool_state": "{\"dedup\": false, \"header_criteria\": {\"header_criteria_select\": \"id_list\", \"__current_case__\": 1, \"identifiers\": {\"__class__\": \"RuntimeValue\"}, \"id_regex\": {\"find\": \"beginning\", \"__current_case__\": 0}}, \"input\": {\"__class__\": \"RuntimeValue\"}, \"output_discarded\": false, \"sequence_criteria\": {\"sequence_criteria_select\": \"\", \"__current_case__\": 0}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
"tool_version": "2.3",
"type": "tool",
"uuid": "ef6ee489-a704-46bf-849d-5c0379ba423f",
"when": null,
"workflow_outputs": [
{
"label": "predicted-nlr-proteins",
"output_name": "output",
"uuid": "bcd73363-1da0-4566-9c6e-55b5c7c3eabc"
}
]
}
},
"tags": [
"Genome",
"Plant",
"DNA",
"QualityAssessment",
"genome-annotation"
],
"uuid": "0901a080-c94b-4f68-8923-3a6e5059420b",
"version": 19
}
The difficulty I am currently facing is:
When using STAR for mapping, it requires loading the index of this extremely large genome, along with the sequencing data. A single STAR process can consume over 200G of memory.
When I invoke this workflow, multiple independent STAR mappings are launched in parallel. This easily leads to memory overflow, causing interruptions.
I have thought of two solutions:
- Enable shared memory when calling STAR (using the STAR --genomeLoad option). This way, multiple STAR processes can share the index, but this would require modifying the rg_rnaStar.xml file.
- Another solution is to prevent the multiple processes in the workflow from running concurrently. Instead, STAR + StringTie assembly should be launched sequentially, one after another in a queue. After the last STAR + StringTie assembly process is completed, StringTie Merge would be executed. However, I am uncertain whether the workflow has an option to introduce a wait trigger mechanism between multiple processes.