Splitting interleaved/interlaced fastq data and Extracting fastq data from an sra archive

Hi i am using paired end data and downloaded the sra file from ncbi and uploaded it on galaxy server. and can i use this data directly for assembly or first i have to split the files then use it

1 Like

Hi @NITISH_DAVE

It depends on which assembly tool you are using, but some do require that interleaved reads are split into forward/reverse first. None accept an sra archive directly (that I am aware of). QA/QC is also strongly recommended (FastQC, fastp, Trimmomatic).

Please see these FAQs, in particular, the last one:

Also, there two primary tools to extract fastq data from NCBI. The first will output data as it is formatted from NCBI. The second will split any pairs and place all data into Dataset Collections and can extract fastq sequences from sra data that has been Uploaded (probably the option you need).

  • Download and Extract Reads in FASTA/Q format from NCBI SRA
  • Faster Download and Extract Reads in FASTQ format from NCBI SRA

Thanks!

thank you for reply i used Faster Download and Extract Reads in FASTQ format from NCBI SRA and it yielded 4 files however according to metadata it contains data from two different tissue but here i only observed pair end data with separate forward and reverse file. where can i find separate data as the data is presented in single SRA file at ncbi.file

1 Like

I’m sure that I understand.

  • Did you enter a distinct accession for each distinct sample? The form accepts individual or lists of accessions.
  • Or, did you extract data from the SRA archive that you had already loaded? Which archive (URL)?

Thanks?

I used one accession which i extracted using Faster Download and Extract Reads in FASTQ format from NCBI SRA. does that single accession contains both tissue data? Below is metadata for reference.

Run, Assay Type, AvgSpotLen, Bases, BioProject, BioSample, BioSampleModel, Bytes, Center Name, Collection_Date, Consent, DATASTORE filetype, DATASTORE provider, DATASTORE region, dev_stage, Experiment, geo_loc_name_country, geo_loc_name_country_continent, geo_loc_name, Instrument, Isolate, Isolation_source, Library Name, LibraryLayout, LibrarySelection, LibrarySource, Organism, Platform, ReleaseDate, Sample Name, sample_type, SRA Study, tissue
SRR4299636, CLONE, 298, 9557003942, PRJNA344442, SAMN05823576, Plant, 3924705078, TIANJIN UNIVERSITY OF SCIENCE AND TECHNOLOGY, 2015-11-27, public, sra, "ncbi,gs,s3", "gs.US,s3.us-east-1,ncbi.public", Young leaf, SRX2194189, China, Asia, China, NextSeq 500, Leaves, soil, PRJNA344442SAMN05823576, PAIRED, cDNA, TRANSCRIPTOMIC, Azadirachta indica, ILLUMINA,2017-04-26T00:00:00Z, Neem advewntitioius root, Tropic plant, SRP090539, Neem adventitious root in MS medium

admin edit: text reformat

1 Like

The RUN accession represents a single sequencing event. In this case, it is paired-end fastq data. At the highest level, the BioProject accession only includes one sequencing run. Same for the SRA Study accession. There is no information indicating that more than one tissue was sequenced (just Neem adventitious root in MS medium).

Thanks again for reply, in their research manuscript they have mentioned about neem leaf and neem adventitious root transcriptome so, i thought that they done sequencing for both leaf and root sample. I have attached the image please have a look. can you suggest any way to compare the differential gene expression between these two tissue.

manuscript acc

1 Like

Publications should clearly reference all data.

The SRR accession appears to be a match for the AR reads, and are the only reads submitted for the BioProject.

Choices:

  1. Contact the author(s) for clarification.
  2. Consider the publication, and associated data, flawed and avoid. The SRR submission has some issues: Library Name is oddly annotated. Sample Name has a typo. ReleaseDate is after the paper’s publication date.