Assembling Einkorn genome

AOEynes · July 24, 2023, 8:33pm

I have a Illumina sequence of Einkorn wheat that I have grown in the arctic. I have done several analyzes on it in Illumina basespace and also here in Galaxy. I have four files, two of them, R1 and R2 are 171 gb each and two files on 1.8 gb, which are sequenced on another Illumina instrument.

The 200 gb space limit in galaxy does not allow me to join these together to a single collection, I have tried several trimming tools, but the file size is the same after running the trimming programs.

I also have a PacBio long read sequence of Einkorn, it has been decided that this sequence is to be reference. Nevertheless, I want to assemble the Illumina reading, to a reference and make it available
through the IWGSC.

Any suggestions on which tool to chose, and how to join the two Illumina files. Any tips or help is appreciated, also if someone would like to join me and take part in this, please let me know.

jennaj · July 25, 2023, 6:18pm

Hi @AOEynes

You should run this at UseGalaxy.eu. Find the quota request form on the homepage of the server under Our Data Policy.

Tutorials for assembly as available here, and include workflows you can import and adjust for your species. https://training.galaxyproject.org/

Wheat is a bit tricky with such long chromosomes but I suppose you already know about that

For collaboration opportunities, I’ll let the community comment. There is a vertebrate genome assembly working group but not one for plants or I just don’t know about it yet. You could also ask at broader scientific forums and state that you plan to do this using Galaxy with extra quota and workflows.

Best wishes!

AOEynes · July 26, 2023, 9:23am

Ok, thanks for the advice. I am running my analyzes on usegalexy.no, it doesn’t seem like I have access to eu. It should be ok running them on the nordic machine cluster also, but I don’t know.

I have seen the tutorials, they looked like they could be helpful, but I am also having problems with the size of the files, they are to big to be joined and analyzed together in the galaxy workspace. The trimming tool I have tried on them did not do anything with the size of the files.

I originally thought that this wheat had two chromosones, since it is a diploid, but no, it has 7 chromosones. It will be interesting to find them in the data.

I have seen the Vertebrate project, it looks very interesting. Fluid genetics. I would think that assembly of my einkorn genome, actually the northermost wheat in the world, would be a interesting little task and not very time consuming, since I am so fortunate that I also have a very high quality Pac Bio as reference. Originally I wanted my to be a De Novo assembly, building a liberay from scratch, tailoring the bases so to speak, and have this as a reference genome for other similar sequences, but the goal now is firstly to assemble this decently and deposit it at the IWGSC. I would very much like that someone took interest in this and helped me out a bit. After assembly and submission to the consortium I hope to start analyzing the structures more thoroughly. Anyone with knowledge in assembling with a reference are welcome, I would be very thankful.

jennaj · July 26, 2023, 7:05pm

I suggested the EU server since it has been scaled up to handle assembly they can grant extra working space. Available working memory on the cluster nodes is a different technical consideration. If the NO server admins can provide the resources, too, then you are good to go!

And, just so you know … anyone can create an account any of the usegalaxy.* servers, or any other public server. Having an account at each provides access to distinct resources Institutional login can be server-specific but just a regular email registration should work anywhere that is truly “public”.

AOEynes · July 27, 2023, 9:13pm

Ok, perhaps I’ll apply for a second account, but I am still looking for someone who wish to help and join this small project, do you think I should start a new thread? Also, I have been running Velvet optimizer for 72 hours now, the file is only 3.6 gb. Should I cancel the analyze, I personally think it has gone wrong. I am only a few hundred mb away from my 200 gb limit.

jennaj · July 27, 2023, 11:44pm

That tool is only available for training purposes at the other usegalaxy.* server. Not sure if the NO admins are fully supporting it or not (consumes a LOT of compute resources).

That said, a tool that is still executing still has some chance of succeeding unless the admins tell you differently.

And, sure you can start up another topic here but I’m not sure what it will gain. This topic’s title is pretty clear. You decide. And, try casting your net wider. You are looking for more plant people to do an assembly with – so find out where they chat online and post there too.

How you achieve that assembly would be the Galaxy part. Tools and resources.

Why to spend effort on that assembly is the scientist part, along with the analysis details and goals. Not everyone will be at this forum already … and others could certainly be interested, especially if you are organizing the technical resources/support.

The UseGalaxy.be group was working on crops at one point, so maybe reach out to them too. The other smaller servers tend to focus on a domain – I see a few about plants/crops with a search. And … that’s all I know about that! Hope that gives you more ideas. Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub

AOEynes · July 29, 2023, 4:16pm

Ok, I have registered a new account at the *.eu servers. I have started to upload my two sangerfiles. I have applied for a quota on 1 tb for a month for assembling the larger Illumina and Pac Bio sequences.

I have started browsing for other forums that discusses these and similar topics. soil.com, crop.com and agronomy.com.

If you would like to tinkle with these files, I would very comfortably and relieved let you do it. I am sure you have enough to do, but you will be fully credited of course. The goal is to assemble and submit to IWGSC, international wheat genome sequence consortium. All the the data is owned and aquired by myself, except for the PacBio, which I have a received a copy of from their group for use with assembling.

At least I have asked you, anyway, you have been very helpful and thanks a lot.

AOEynes · October 7, 2023, 12:36pm

If someone is interested in wheat genomics and want to help out, please let me know. Some work has been done with this,but I still want to come in contact with someone who could take it a bit further and make the most of the very solid raw data material in posession.

Topic		Replies	Views
Tools for closing gaps and construct a complete chromosome in Galaxy assembly , troubleshooting	2	106	January 31, 2024
de novo assembly using Trinity in Galaxy assembly , transcriptomics , fastqsanger , quality-control	7	5205	February 25, 2020
performing my own genome resequencing. Noobie looking for general advice? usegalaxy.org support gtn-tutorial , workflow	2	299	March 21, 2023
Allowed makeblastdb file size (1GB), cut genome to smaller pieces? usegalaxy.org support database , genome , mapping , blast	2	1980	September 3, 2019
Genome assembly and gene annotation variant-analysis	1	192	June 16, 2023

Assembling Einkorn genome

Related Topics