Deleting sequence identifier line

Hello,

I have a FASTA file with the identifier lines between them. Thus it is like this:

>A00597:359:H53WCDSXC:3:1101:1127:23390/1
CCAGTACCCACTTAGAAAGAAATAAAAAAACAAATCAGACAACAAAGGCTTAATCTCAGCAGATCGTAACAACAAGGCTACTCTACTGCTTACAATACCCCGTTGTACATCTAAGTCGTATACAAATGATT
>A00597:359:H53WCDSXC:3:1101:1127:23390/2
CCAGTACCCACTTAGAAAGAAATAAAAAAACAAATCAGACAACAAAGGCTTAATCTCAGCAGATCGTAACAACAAGGCTACTCTACTGCTTACAATACCCCGTTGTACATCTAAGTCGTATACAAATGATT

I was wondering if it is possible to remove these identifier lines, and how i can do that. And I was wondering why i get two linet with the same sequence? I can see that one is labeled /1 and one /2 but what is the difference?
Hopefully anyone can help me. Thanks in advance!

Kind regards,
Isa

Hi @Isa

I’m going to show you how to navigate our tutorials to frame your questions a bit more, then answer…



Sequences in fasta format will need something on the the > title line. Minimally, an identifier and optionally a description.

Sequences in fastq format will also need something on the @ lines but the + lines are usually left blank.

The /1 and /2 is nomenclature for paired-end data. You will usually want to keep that intact since tools will use it.

Your examples looks like NGS reads in fastq format, not fasta, however I don’t see any lines for the quality scores. Maybe you know why that is (upstream manipulations?).

Please review that NGS tutorial, then ask more questions here if you get stuck. We would be interested in seeing 1) the original data and learning about 2) where it came from and 3) what you plan to do with the data.

Let’s start there :slight_smile: