Assembling Einkorn genome

I have a Illumina sequence of Einkorn wheat that I have grown in the arctic. I have done several analyzes on it in Illumina basespace and also here in Galaxy. I have four files, two of them, R1 and R2 are 171 gb each and two files on 1.8 gb, which are sequenced on another Illumina instrument.

The 200 gb space limit in galaxy does not allow me to join these together to a single collection, I have tried several trimming tools, but the file size is the same after running the trimming programs.

I also have a PacBio long read sequence of Einkorn, it has been decided that this sequence is to be reference. Nevertheless, I want to assemble the Illumina reading, to a reference and make it available
through the IWGSC.

Any suggestions on which tool to chose, and how to join the two Illumina files. Any tips or help is appreciated, also if someone would like to join me and take part in this, please let me know.

Hi @AOEynes

You should run this at Find the quota request form on the homepage of the server under Our Data Policy.

Tutorials for assembly as available here, and include workflows you can import and adjust for your species.

Wheat is a bit tricky with such long chromosomes but I suppose you already know about that :slight_smile:

For collaboration opportunities, I’ll let the community comment. There is a vertebrate genome assembly working group but not one for plants or I just don’t know about it yet. You could also ask at broader scientific forums and state that you plan to do this using Galaxy with extra quota and workflows.

Best wishes!

Ok, thanks for the advice. I am running my analyzes on, it doesn’t seem like I have access to eu. It should be ok running them on the nordic machine cluster also, but I don’t know.

I have seen the tutorials, they looked like they could be helpful, but I am also having problems with the size of the files, they are to big to be joined and analyzed together in the galaxy workspace. The trimming tool I have tried on them did not do anything with the size of the files.

I originally thought that this wheat had two chromosones, since it is a diploid, but no, it has 7 chromosones. It will be interesting to find them in the data.

I have seen the Vertebrate project, it looks very interesting. Fluid genetics. I would think that assembly of my einkorn genome, actually the northermost wheat in the world, would be a interesting little task and not very time consuming, since I am so fortunate that I also have a very high quality Pac Bio as reference. Originally I wanted my to be a De Novo assembly, building a liberay from scratch, tailoring the bases so to speak, and have this as a reference genome for other similar sequences, but the goal now is firstly to assemble this decently and deposit it at the IWGSC. I would very much like that someone took interest in this and helped me out a bit. After assembly and submission to the consortium I hope to start analyzing the structures more thoroughly. Anyone with knowledge in assembling with a reference are welcome, I would be very thankful.

I suggested the EU server since it has been scaled up to handle assembly they can grant extra working space. Available working memory on the cluster nodes is a different technical consideration. If the NO server admins can provide the resources, too, then you are good to go!

And, just so you know … anyone can create an account any of the usegalaxy.* servers, or any other public server. Having an account at each provides access to distinct resources :slight_smile: Institutional login can be server-specific but just a regular email registration should work anywhere that is truly “public”.

Ok, perhaps I’ll apply for a second account, but I am still looking for someone who wish to help and join this small project, do you think I should start a new thread? Also, I have been running Velvet optimizer for 72 hours now, the file is only 3.6 gb. Should I cancel the analyze, I personally think it has gone wrong. I am only a few hundred mb away from my 200 gb limit.

That tool is only available for training purposes at the other usegalaxy.* server. Not sure if the NO admins are fully supporting it or not (consumes a LOT of compute resources).

That said, a tool that is still executing still has some chance of succeeding unless the admins tell you differently.

And, sure you can start up another topic here but I’m not sure what it will gain. This topic’s title is pretty clear. You decide. And, try casting your net wider. You are looking for more plant people to do an assembly with – so find out where they chat online and post there too.

How you achieve that assembly would be the Galaxy part. Tools and resources.

Why to spend effort on that assembly is the scientist part, along with the analysis details and goals. Not everyone will be at this forum already … and others could certainly be interested, especially if you are organizing the technical resources/support.

The group was working on crops at one point, so maybe reach out to them too. The other smaller servers tend to focus on a domain – I see a few about plants/crops with a search. And … that’s all I know about that! Hope that gives you more ideas. Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub

Ok, I have registered a new account at the *.eu servers. I have started to upload my two sangerfiles. I have applied for a quota on 1 tb for a month for assembling the larger Illumina and Pac Bio sequences.

I have started browsing for other forums that discusses these and similar topics., and

If you would like to tinkle with these files, I would very comfortably and relieved let you do it. I am sure you have enough to do, but you will be fully credited of course. The goal is to assemble and submit to IWGSC, international wheat genome sequence consortium. All the the data is owned and aquired by myself, except for the PacBio, which I have a received a copy of from their group for use with assembling.

At least I have asked you, anyway, you have been very helpful and thanks a lot.

If someone is interested in wheat genomics and want to help out, please let me know. Some work has been done with this,but I still want to come in contact with someone who could take it a bit further and make the most of the very solid raw data material in posession.