Splitting interleaved/interlaced fastq data and Extracting fastq data from an sra archive

Hi i am using paired end data and downloaded the sra file from ncbi and uploaded it on galaxy server. and can i use this data directly for assembly or first i have to split the files then use it

1 Like

Hi @NITISH_DAVE

It depends on which assembly tool you are using, but some do require that interleaved reads are split into forward/reverse first. None accept an sra archive directly (that I am aware of). QA/QC is also strongly recommended (FastQC, fastp, Trimmomatic).

Please see these FAQs, in particular, the last one:

Also, there two primary tools to extract fastq data from NCBI. The first will output data as it is formatted from NCBI. The second will split any pairs and place all data into Dataset Collections and can extract fastq sequences from sra data that has been Uploaded (probably the option you need).

  • Download and Extract Reads in FASTA/Q format from NCBI SRA
  • Faster Download and Extract Reads in FASTQ format from NCBI SRA

Thanks!

thank you for reply i used Faster Download and Extract Reads in FASTQ format from NCBI SRA and it yielded 4 files however according to metadata it contains data from two different tissue but here i only observed pair end data with separate forward and reverse file. where can i find separate data as the data is presented in single SRA file at ncbi.file

1 Like

I’m sure that I understand.

  • Did you enter a distinct accession for each distinct sample? The form accepts individual or lists of accessions.
  • Or, did you extract data from the SRA archive that you had already loaded? Which archive (URL)?

Thanks?

I used one accession which i extracted using Faster Download and Extract Reads in FASTQ format from NCBI SRA. does that single accession contains both tissue data? Below is metadata for reference.

Run, Assay Type, AvgSpotLen, Bases, BioProject, BioSample, BioSampleModel, Bytes, Center Name, Collection_Date, Consent, DATASTORE filetype, DATASTORE provider, DATASTORE region, dev_stage, Experiment, geo_loc_name_country, geo_loc_name_country_continent, geo_loc_name, Instrument, Isolate, Isolation_source, Library Name, LibraryLayout, LibrarySelection, LibrarySource, Organism, Platform, ReleaseDate, Sample Name, sample_type, SRA Study, tissue
SRR4299636, CLONE, 298, 9557003942, PRJNA344442, SAMN05823576, Plant, 3924705078, TIANJIN UNIVERSITY OF SCIENCE AND TECHNOLOGY, 2015-11-27, public, sra, "ncbi,gs,s3", "gs.US,s3.us-east-1,ncbi.public", Young leaf, SRX2194189, China, Asia, China, NextSeq 500, Leaves, soil, PRJNA344442SAMN05823576, PAIRED, cDNA, TRANSCRIPTOMIC, Azadirachta indica, ILLUMINA,2017-04-26T00:00:00Z, Neem advewntitioius root, Tropic plant, SRP090539, Neem adventitious root in MS medium

admin edit: text reformat

1 Like

The RUN accession represents a single sequencing event. In this case, it is paired-end fastq data. At the highest level, the BioProject accession only includes one sequencing run. Same for the SRA Study accession. There is no information indicating that more than one tissue was sequenced (just Neem adventitious root in MS medium).

Thanks again for reply, in their research manuscript they have mentioned about neem leaf and neem adventitious root transcriptome so, i thought that they done sequencing for both leaf and root sample. I have attached the image please have a look. can you suggest any way to compare the differential gene expression between these two tissue.

manuscript acc

1 Like

Publications should clearly reference all data.

The SRR accession appears to be a match for the AR reads, and are the only reads submitted for the BioProject.

Choices:

  1. Contact the author(s) for clarification.
  2. Consider the publication, and associated data, flawed and avoid. The SRR submission has some issues: Library Name is oddly annotated. Sample Name has a typo. ReleaseDate is after the paper’s publication date.

For your Trinity questions:

  1. Trinity requires that paired-end inputs are “matched pairs”. Meaning, both ends of the same read are input.
  2. If one of the ends fails QA/QC (Trimmomatic, Fastp, others), then the other associated end cannot be used, even if it happens to passes QA.
  3. When extracting from SRR, the original data will be paired.
  4. When running through QA tools, the data can become un-paired. That said, the output from QA tools, for example Trimmomatic, the data will be sorted into four datasets.
    1. Paired forward + Paired reverse = use these for assembly inputs
    2. Single forward + Single reverse = do not use these for assembly inputs. One end of the original pair did not pass QA, and the assembly will fail if input.

Please be aware of a few current factors that can impact assembly success/failures when using the public Galaxy Main https://usegalaxy.org server right now. There is a banner on the server explaining. More details:

  1. Trinity and Unicycler are running with reduced memory allocation at this time.
  2. Make sure to use the most current version of all tools, or unexpected problems can occur. The most current version of any tool’s form will load from the Tool Panel.
  3. If your job fails, confirm that you are using the most current tool version.
  • If not, rerun using the updated version.
  • If yes, then the failure may be due to the reduced memory resources. Try one rerun. If that fails again, there may be some other problem with your inputs. How to check for common input problems is discussed in the topic below.
  • My inputs are Ok – How to work-around the reduced memory allocation? a) Consider using an alternative public Galaxy server b) Decide if down/sub-sampling your inputs will meet your goals (see Seqtk tools).

Hope that helps!