Almost Everything about ChIP-Seq data analysis. Focus on a beginner

1113 · June 14, 2019, 10:51am

Hi all,

I am new to Galaxy. I need to analyse my Chip-Seq data.I found very useful training resources of Galaxy:
https://galaxyproject.org/tutorials/chip/
https://galaxyproject.org/tutorials/ngs/
and also about using Galaxy for Chip-Seq data analysis from Abcam:

However, it is not enough for a complete beginner. I am missing a discussion where I can ask about specific cases and tools application. For example, why in Chip-seq Galaxy tutorial we use BWA-MEM not Bowtie2 for mapping? Why we do not trim adapters with trimmomatic? What to do with 4 files produced by trimmomatic (when you trim adapters ir produces 2 files with paired reads and 2 fails for unpaired reads) etc. There is a whole section on ChIP-seq in Galaxy. What is there? and of course: What my results mean

Some answers could be found on other forums. It would be very useful to have a good discussion in one place. For example here where we can share our experiences using Galaxy for Chip-seq data

I plan to post my questions here.

My first question would be: After using trimmomatic (removing TrueSeq 3 adapters and sliding window), the reads were mapped with BWA-MEM. I removed duplicates and had a pick of the data to see that my sequences mapped only to Chromosome 1… in input and IP samples. That sounds very wrong. May be BAM file can not be all viewed in Galaxy? There is no indications that the file is not fully visible. Or my trimming removed some significant part of good reads?

I tried to use plotfingerprint and my input and IP samples look like my protein binds to very broad regions. I have heard that for data like that I have to use specific peak caller. Which one?

Thank you all!

jennaj · June 14, 2019, 7:01pm

Hello @1113

This help below will aid when getting started with Galaxy. It includes links to the Galaxy GTN Tutorials, which include a ChIP-seq batch of tutorials that cover many tool options, workflows, and examples. If you find a problem with a tutorial, there is a feedback form at the end of each one. If you want to discuss a tutorial for clarification, try the GTN Gitter chat: https://gitter.im/Galaxy-Training-Network/Lobby

For your other question:

BAM datasets are very large. A message is shown at the top of the window (after clicking on the “eye” icon) that describes that only part of the data is displayed.

To learn how to generate and interpret statistics on datasets, including BAM datasets, please see this prior Q&A: How to interpret and generate statistics for any dataset, including BAM or SAM or Tabular data

Thanks!

1113 · June 17, 2019, 10:22am

Thank you very much jennaj

I will try Galaxy Training Network/Lobby to clarify how to deal with my data.

Also, I found ChIP-Seq data processing standards approved by ENCODE. They have generated a list of tools (pipeline) to be used in one or another ChIP-seq cases:
https://www.encodeproject.org/chip-seq/transcription_factor/#standards
Unfortunately, there is not enough description as for a beginner. I am afraid they skip some operations between using different tools.

As for paired reads trimming with trimmomatic, I found a comment on Biostars:
“…After trimming with trimmomatic, some reads can be discared, and their mate become “unpaired”. Usually, only a small fraction of the reads become unpaired, and they can be ignored. Alternatively, you can map them as single reads along with the remaining paired reads…”
https://www.biostars.org/p/287477/

BAM files, indeed, when loaded as excel sheet list all chromosomes.

eyoungman · June 17, 2019, 7:10pm

Hmm, I am getting notifications about this thread as if I asked the question, but this was asked by a different “E.” Should this happen?

1113 · June 17, 2019, 10:06pm

Well, @eyoungman do you still get notifications? My account was deleted with all my data

jennaj · June 18, 2019, 2:47pm

@eyoungman Check your account settings and update your preferences. You can adjust which threads to follow.

The forum is still a bit new. So let us if you need a closer review.

jennaj · June 18, 2019, 2:48pm

@1113 Please see: Accounts deleted from Galaxy Main to enforce posted Terms and Conditions

1113 · June 18, 2019, 9:38pm

@jennaj, thank you for helping,

Q1. I feel like this topic might become something like my blog on using Galaxy, e.g., not only about using Galaxy for ChiP-Seq data, but also on general questions. Is such deviation accepted?

Q2. About data speed processing. What shall be the speed of data processing on Galaxy? For example, trimming (trimmomatic) 5 sets of paired reads data could take 4 or more hours. So, if I do FastQC, filtering, FastQC, mapping and peak calling it could take me about 2 days (not mentioning that tools sometimes fail). Is that how fast Galaxy shall work? What is the language majority of scripts are written in? For example, if it is Python, if I run the same Python scripts on a local computer, will my data be processed faster? What specifications my local computer has to have to run data like I have (5-10 ~3.5 Gb paired reads files) faster than on Galaxy?

Q3. About MACS2 callpeak for paired reads (Galaxy tutorial Chip-Seq data processing).
Do I understand correctly, that if I run paired reads data on MACS2, I do not need to calculate parameter d with MACS2 predictd?

Q4. I hope I can formulate this Q: My data comes from cell lines with multiple gene rearrangements. I have two input controls (an input and IgG chiped DNA control). Is this correct to map Chiped DNA samples to a standard genome (e.g., hg38)? I am asking because I guess it might be difficult to map (align? not sure I understand the difference between map and align) input to normal human genome. For example, there are inter-chromosomal tranlocations, duplications and deletions. Would it be more correct to use input as a standard genome? Is it possible to create a custom genome with the input (40-50 million of reads), one repeat?

jennaj · June 19, 2019, 4:26pm

Q1: Feel free to post whatever you want as long as it is related to Galaxy and your experience using it. You might not get an answer, or replies will be delayed, if the question has already been asked/answered in easily found places online already, including but not limited to this forum and the tool form help itself.

Important: If you find tutorials, other reputable public forum Q&A, publications, or tool manuals that cover the same topic/answer your own question, but need clarification for how to translate how to do something in Galaxy (tool form setting, etc), those should be referenced in your question. It saves time and makes your questions clearer, so more people can help quickly. This is a community forum.

Q2: Some analysis will be slower when using a public Galaxy server. These are shared resources across communities, which means jobs will queue in many cases. It depends on how busy the server is.

Actual execution time for a job (once the job turns yellow) can vary widely based on your input size/content, the parameters used, and how that tool normally runs (outside of Galaxy).

In general, a tool run in Galaxy requires the same resources as if run line-command. Most 3rd party tool manuals/publications, usually linked from the tool form near the bottom, will specify the minimum resources needed.

Tools are written in all sorts of programming languages. See the ToolShed for details about a particular tool. All tools are “wrapped” for Galaxy: underneath, some have tools written for Galaxy specifically and some have 3rd party authored tools. Galaxy | Tool Shed

A local Galaxy should have at least 16GB of RAM and enough disk storage space for your data. Many tools need more RAM – you’ll need to check the tool manuals to see what each requires. If you are running tools concurrently, using memory intensive tools or large data, customizing your local into “production mode” or using a cloud Galaxy is often better. Galaxy Choices - Galaxy Community Hub && Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub

Q3: True. Estimating fragment size is not needed for paired-end data. The model is not created/used when using paired-end inputs. From the tool form help:

For Paired-end BAM (BAMPE) the ‘Build model step’ will be ignored and the real fragments will be used for each template defined by leftmost and rightmost mapping positions. Default: Single-end BAM (–format)

Q4: Map = Align. Map all reads to the same reference genome, or you will have problems. NGS reads will not work as a custom genome. The mapping jobs will fail for exceeding computing resources and would create noisy data anyway even if somehow successful on some other server. My job ended with an error. What can I do? >> Job and Tool Error Help - Galaxy Community Hub

1113 · June 20, 2019, 11:44am

Comment of how fast Galaxy runs jobs:
I found a very useful reference (hopefully up to date) here:
https://galaxyproject.org/main/#user-data-and-job-quotas
For example BWA-MEM average running time is ~5 h. No more then 6 jobs/ account at a time.
Potentially one can check computation requirements on Stampede… their website looks scarily complicated for a novice.

RECOMMENDATION for galaxy: I wish Galaxy has some indicator of how much computationally demanding a job could be (and how much data it would produce - to estimate if a user goes over their quota).

DATA and space issues:
It looks like I need more space.
Q1. I thought to temporarily remove datasets from my Galaxy (main) account, but downloading takes ages. I do data upload via FTP and that works faster. Is there a way to download data from Galaxy faster?

Q2. Looking for the solutions for my “space” problem I read this
https://galaxyproject.org/cloudman/
I guess running a cloud cluster is not something for users with 1 project only.

I am trying to understand how I can use Amazon for temporary data storage. I checked the calculator,
https://calculator.s3.amazonaws.com/index.html
but I can not figure our what shall be my inputs
Does anybody know how much space (type of storage) I might need on Amazon (having in mind download/upload from Galaxy):
initial files (10) ~ 100 Gb;
trimmed reads I guess have similar size;
mapped - about 8-10 Gb each
Do not know yet peaks files size
I skip quality reports etc here

jennaj · June 20, 2019, 7:43pm

This is almost impossible to predict, for reasons I already stated:

Try using wget or curl to download larger datasets. FAQ: Downloading Data

Actually, many scientists use Cloudman/AWS for all sorts of reasons, including personal use when their needs are more than the public server can supply. It has much pre-configured, resource allocation is flexible, and you’ll be the admin so can add in (or remove) tools as you want. If costs are a concern, AWS offers grants for research work. Most of the configuration is web-based, except for some of the initial steps involved when setting up an AWS account. We have detailed FAQs to help with that part, as does AWS.

FAQs & links:

Galaxy Choices - Galaxy Community Hub
Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub
This area is for teaching but includes advice about what types of resources to allocate – even if just using a cloud version of Galaxy for your own work: Galaxy Community Hub - Galaxy Community Hub
The broader Galaxy Ecosystem overview for reference: https://galaxyproject.github.io/

To manage your account quota space at any Galaxy server (in particular, public Galaxy servers), be aware that you need to permanently delete (purge) data to have it not count toward quota. Deleting is not enough. FAQ: Checking for active vs deleted vs permanently deleted (purged) datasets and histories

It seems like a lot of your questions have already come up or have existing FAQs/tutorials. The post I sent before (Troubleshooting resources for errors or unexpected results) has much of this type of help information consolidated in one place. You can also search prior Q&A and all other Galaxy resources directly: Galaxy Support - Galaxy Community Hub >> the first option points to here: Galaxy Community Hub - Galaxy Community Hub