FAQ: NCBI reference data

:speech_balloon: How to find NCBI Reference data?

Q: Where can I find the URL links to a reference genome FASTA, annotation, or GenBank file?

A: Use the NCBI Genome Assembly page for the organism of interest — for example:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000819615.1/

This page provides:

  • Download links for genome sequences (FASTA)
  • Annotation files (GFF, GTF)
  • GenBank format files (used with SnpEff build)
  • Direct links to the associated FTP directories

Screenshot of a NCBI genome assembly view for GCF_000819615.1

Screenshot of a NCBI genome FTP directory view for GCF_000819615.1

File name Description
genomic.gbff.gz Reference GenBank record (for SnpEff build)
genomic.gtf.gz Reference gene annotation (GTF format)
genomic.fna.gz Reference genome FASTA sequence

:link: How to get the data into Galaxy?

  1. Capture the URL link for your file(s)
  2. Paste into the :up_arrow: Upload tool
  3. Click on Start using all default settings
  4. The files will be loaded into your current Active history

:hammer_and_wrench: Standardizing the format

  1. With some tools, getting the data loaded is enough!
  2. With others, you may want to remove the description content from the FASTA > lines and/or remove the # header lines from the GTF.

:scientist: How to use the data with tools or a workflow?

  1. Use the input option to Choose Reference Data from the History
  2. Select your dataset file.
  3. If the data requires a tool index (example: mapping tools), the index will be automatically created at runtime.
  4. If you need a custom database key for visualization options, you can create and assign one!

:blue_book: Resources