Giving workflow parameters to BioBlend (params)

api
usegalaxyorg
bioblend

#1

I am using the Python package BioBlend to programmatically interact with Galaxy and run a workflow (which I made in the GUI). I now want to run this workflow but this time, with parameters which are “set at runtime”.

I have a simple workflow which takes 2 input files: input-counts.txt and input-factors.txt. These files are given to edgeR and differential expression in performed. Originally, I had set the following parameters to fixed default values: “minimum CPM” and “Minimum samples”, both parameters under the “Filter Low Counts” section in edgeR. I have now changed this to “set at runtime”. In Bioblend, my workflow now looks as such:

    {'inputs': {'0': {'uuid': '83376380-45b0-44a8-aaf5-a124b9737082',
   'value': '',
   'label': 'counts'},
  '1': {'uuid': 'f1c7d923-8d1e-44cb-b97d-c01dc7082d8c',
   'value': '',
   'label': 'factors'}},
 'name': 'rnaseq-postqc-starting-matrix-180119',
 'tags': [],
 'deleted': False,
 'latest_workflow_uuid': '9ad01a66-5b6d-4df9-9ad6-f138b4e75017',
 'annotation': 'This workflow is an attempt to start from the matrix count rather than the BAM files (thereby bypassing the need for featureCounts)\n',
 'url': '/api/workflows/3ee04d3ad6be7c54',
 'version': 4,
 'steps': {'0': {'tool_id': None,
   'tool_version': None,
   'id': 0,
   'input_steps': {},
   'tool_inputs': {},
   'type': 'data_input',
   'annotation': None},
  '1': {'tool_id': None,
   'tool_version': None,
   'id': 1,
   'input_steps': {},
   'tool_inputs': {},
   'type': 'data_input',
   'annotation': None},
  '2': {'tool_id': 'toolshed.g2.bx.psu.edu/repos/iuc/edger/edger/3.20.7.2',
   'tool_version': '3.20.7.2',
   'id': 2,
   'input_steps': {'input|counts': {'step_output': 'output', 'source_step': 0},
    'input|fact|finfo': {'step_output': 'output', 'source_step': 1}},
   'tool_inputs': {'adv': {'pAdjust': 'BH',
     'lrtOption': 'false',
     'lfc': '0.0',
     'robOption': 'true',
     'pVal': '0.05',
     'normalisationOption': 'TMM'},
    '__page__': None,
    'f': {'filt': {'cformat': {'cpmReq': {'__class__': 'RuntimeValue'},
       'format_select': 'cpm',
       'cpmSampleReq': {'__class__': 'RuntimeValue'},
       '__current_case__': 0},
      '__current_case__': 0,
      'filt_select': 'yes'}},
    '__rerun_remap_job_id__': None,
    'anno': {'annoOpt': 'no', '__current_case__': 1},
    'rep_contrast': [{'__index__': 0, 'contrast': 'OMD-WT'},
     {'__index__': 1, 'contrast': 'PRELP-WT'}],
    'input': {'counts': {'__class__': 'RuntimeValue'},
     '__current_case__': 1,
     'fact': {'ffile': 'yes',
      'finfo': {'__class__': 'RuntimeValue'},
      '__current_case__': 0},
     'format': 'matrix'},
    'out': {'rscript': 'true', 'normCounts': 'false', 'rdaOption': 'true'}},
   'type': 'tool',
   'annotation': None}},
 'published': False,
 'owner': 'm93',
 'model_class': 'StoredWorkflow',
 'id': '3ee04d3ad6be7c54'}

My question is: how can I run my workflow and set my parameters? I cannot find an example in the Bioblend documentation. My code is:

# When all parameters are already set (works), there is no need for "params"
gi.workflows.run_workflow(wf["id"], history_id=hi["id"], dataset_map=datamap)

# When I need to set paramaters (like here) (not working)
 tool_id = "toolshed.g2.bx.psu.edu/repos/iuc/edger/edger/3.20.7.2"
param_dict = dict()
param_dict["cpmReq"] = 10
param_dict["cpmSampleReq"] = 3
params = {tool_id:param_dict}    

print(params)
{'toolshed.g2.bx.psu.edu/repos/iuc/edger/edger/3.20.7.2': {'cpmReq': 10, 'cpmSampleReq': 3}}

gi.workflows.run_workflow(wf["id"], history_id=hi["id"], dataset_map=datamap, params=params)

This command is unsuccesful. The job is sent to Galaxy but in the GUI, one of the datasets is empty and the other 2 show up with “An error occured in this dataset: a real number is required”.

Any help would be deeply appreciated, I am quite lost as to how to make this params. Thanks.

UPDATE
After many attempts, I find that using something like this works:

runtime_params = dict()
runtime_params["2"] = {"f|filt|cformat": {"cpmReq": "10", "cpmSampleReq":"3"}}

gi.workflows.run_workflow(wf["id"], history_id=hi["id"], dataset_map=datamap, params=runtime_params)

However, I now would like to set up not 2 Runtime parameters but 4. And the 2/4 are contrasts I want to do in edgeR. After modifying my workflow, it looks like this (just showing the part with the Runtime inputs):

'f': {'filt': {'cformat': {'cpmReq': {'__class__': 'RuntimeValue'},
       'format_select': 'cpm',
       'cpmSampleReq': {'__class__': 'RuntimeValue'},
       '__current_case__': 0},
      '__current_case__': 0,
      'filt_select': 'yes'}},
    '__rerun_remap_job_id__': None,
    'anno': {'annoOpt': 'no', '__current_case__': 1},
    'rep_contrast': [{'__index__': 0,
      'contrast': {'__class__': 'RuntimeValue'}},
     {'__index__': 1, 'contrast': {'__class__': 'RuntimeValue'}}],

Notice the last 3 lines, from “rep_contrast”. I am struggling to format my runtime_params to adjust for this. I have tried this but it doesn’t work:

runtime_params["2"] = {"f|filt|cformat": {"cpmReq": "10", "cpmSampleReq":"3"}, "rep_contrast|0": {"contrast":"OMD-WT"}, "rep_contrast|1": {"contrast":"PRELP-WT"}}

Still quite confused how all these nested dictionaries work, how do I get past these “index”… Thanks.


#2

can you please post the workflow json of the complete step with the edgeR tool?


#3

How do I obtain the json workflow? The only format I can get when I go in the GUI and click on “Download” is a .ga file.


#4

I meant that one, it is in the json format.


#5

Here it is. I can’t upload as a file because this forum only seems to accept jpeg and png file times, not even .txt files.

{"uuid": "32142421-09ef-40d0-885e-6c479a3e0cfc", "tags": [], "format-version": "0.1", "name": "rnaseq-postqc-starting-matrix-180119", "version": 8, "steps": {"0": {"tool_id": null, "tool_version": null, "outputs": [], "workflow_outputs": [{"output_name": "output", "uuid": "2ae9e77d-1766-4b7e-910a-954e46191e27", "label": null}], "input_connections": {}, "tool_state": "{}", "id": 0, "uuid": "83376380-45b0-44a8-aaf5-a124b9737082", "errors": null, "name": "Input dataset", "label": "counts", "inputs": [], "position": {"top": 116.5, "left": 246.5}, "annotation": "", "content_id": null, "type": "data_input"}, "1": {"tool_id": null, "tool_version": null, "outputs": [], "workflow_outputs": [{"output_name": "output", "uuid": "9295aa50-602e-4749-aad6-566854e2eaed", "label": null}], "input_connections": {}, "tool_state": "{}", "id": 1, "uuid": "f1c7d923-8d1e-44cb-b97d-c01dc7082d8c", "errors": null, "name": "Input dataset", "label": "factors", "inputs": [], "position": {"top": 383.5, "left": 251.5}, "annotation": "", "content_id": null, "type": "data_input"}, "2": {"tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/edger/edger/3.20.7.2", "tool_version": "3.20.7.2", "outputs": [{"type": "input", "name": "outTables"}, {"type": "html", "name": "outReport"}, {"type": "txt", "name": "rscript"}], "workflow_outputs": [{"output_name": "rscript", "uuid": "c1a0be32-b99f-4876-9839-e09e58d7c36a", "label": null}, {"output_name": "outTables", "uuid": "a6ac515f-196c-4adf-9bd6-f8731c6f3a44", "label": null}, {"output_name": "outReport", "uuid": "e13adeb1-ee61-454d-a934-3a285edbc8ea", "label": null}], "input_connections": {"input|counts": {"output_name": "output", "id": 0}, "input|fact|finfo": {"output_name": "output", "id": 1}}, "tool_state": "{\"adv\": \"{\\\"lfc\\\": \\\"0.0\\\", \\\"lrtOption\\\": \\\"false\\\", \\\"normalisationOption\\\": \\\"TMM\\\", \\\"pAdjust\\\": \\\"BH\\\", \\\"pVal\\\": \\\"0.05\\\", \\\"robOption\\\": \\\"true\\\"}\", \"__page__\": null, \"f\": \"{\\\"filt\\\": {\\\"__current_case__\\\": 0, \\\"cformat\\\": {\\\"__current_case__\\\": 0, \\\"cpmReq\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}, \\\"cpmSampleReq\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}, \\\"format_select\\\": \\\"cpm\\\"}, \\\"filt_select\\\": \\\"yes\\\"}}\", \"__rerun_remap_job_id__\": null, \"anno\": \"{\\\"__current_case__\\\": 1, \\\"annoOpt\\\": \\\"no\\\"}\", \"rep_contrast\": \"[{\\\"__index__\\\": 0, \\\"contrast\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}}, {\\\"__index__\\\": 1, \\\"contrast\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}}]\", \"input\": \"{\\\"__current_case__\\\": 1, \\\"counts\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}, \\\"fact\\\": {\\\"__current_case__\\\": 0, \\\"ffile\\\": \\\"yes\\\", \\\"finfo\\\": {\\\"__class__\\\": \\\"RuntimeValue\\\"}}, \\\"format\\\": \\\"matrix\\\"}\", \"out\": \"{\\\"normCounts\\\": \\\"false\\\", \\\"rdaOption\\\": \\\"true\\\", \\\"rscript\\\": \\\"true\\\"}\"}", "id": 2, "tool_shed_repository": {"owner": "iuc", "changeset_revision": "d79ed3ec25fe", "name": "edger", "tool_shed": "toolshed.g2.bx.psu.edu"}, "uuid": "2094692c-7183-4388-91d3-c19d3df8b4ea", "errors": null, "name": "edgeR", "post_job_actions": {}, "label": null, "inputs": [{"name": "input", "description": "runtime parameter for tool edgeR"}], "position": {"top": 200.5, "left": 496.5}, "annotation": "", "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/edger/edger/3.20.7.2", "type": "tool"}}, "annotation": "This is the same workflow as \"rnaseq-postqc-starting-matrix-051218\". The only difference is that the \"cpmReq\" and \"cpmSampleReq\" (i.e. the 2 gene filtering parameters of edgeR), as well as the contrasts (also in edgeR) \nneed to be set at Runtime.", "a_galaxy_workflow": "true"}

#6

This is true. For larger files, you can share a link. Gist is one good choice and preserves formatting: https://gist.github.com/


#7

@marten have you had the chance to look at my workflow by any chance? I am desperate to understand how to set multiple parameters at once and I am confused by these nested dictionary structures. Many thanks.


#8

You can find out the structure of the needed param dictionary and modify it with:

import copy

from bioblend.galaxy import GalaxyInstance

gi = GalaxyInstance('galaxy_url', 'api_key')
wf_id = 'some_val'
wf_data = gi.workflows.show_workflow(wf_id)
step2 = wf_data['steps']['2']
step2_params = copy.deepcopy(step2['tool_inputs'])
step2_params[''][''] = ''  # modify your step params here
params = {'2': step2_params}
ret = gi.workflows.run_workflow(wf_id, dataset_map=datamap, params=params, history_id=hist_id)

#9

Thanks nsoranzo. I tried following your instructions but I still get an error unfortunately.

My code to make the params is:

step2 = wf['steps']['2']
step2_params = copy.deepcopy(step2['tool_inputs'])
step2_params["f"]["filt"]["cformat"]["cpmReq"] = "10"
step2_params["f"]["filt"]["cformat"]["cpmSampleReq"] = "3"
step2_params["rep_contrast"][0]["contrast"] = "OMD-WT"
step2_params["rep_contrast"][1]["contrast"] = "PRELP-WT"
params = {'2': step2_params}

However, I still get an error:

tool error
An error occurred with this dataset:
expected string or buffer

I am confused but i think the error is to do with the last 2 parameters I am setting (“OMD-WT” and “PRELP-WT”). Am I doing something wrong?

Thanks


#10

Not sure, where do you see this error (web interface? Galaxy logs? tool standard error? BioBlend?)


#11

I see

I see in the web interface (see picture). The run_workflow() command technically works and outputs:

{'inputs': {},
 'update_time': '2019-01-31T12:57:09.897014',
 'uuid': 'b82d2c54-2557-11e9-90ed-005056ba55fb',
 'outputs': ['bbd44e69cb8906b5b2f13a93d3d0c653',
  '14c8edd62f074b56',
  'bbd44e69cb8906b50535d253e88003a1'],
 'history_id': 'a0562deccb82ea5d',
 'state': 'scheduled',
 'output_collections': {},
 'workflow_id': '293d97b5305f655e',
 'steps': [{'workflow_step_uuid': '2094692c-7183-4388-91d3-c19d3df8b4ea',
   'update_time': '2019-01-31T12:57:09.950920',
   'job_id': 'bbd44e69cb8906b5da2193de7e8f39ef',
   'order_index': 2,
   'workflow_step_label': '',
   'state': 'scheduled',
   'action': None,
   'model_class': 'WorkflowInvocationStep',
   'workflow_step_id': '33de5b1bd7179fad',
   'id': 'ada44a5b483e9eba'},
  {'workflow_step_uuid': 'f1c7d923-8d1e-44cb-b97d-c01dc7082d8c',
   'update_time': '2019-01-31T12:57:09.519068',
   'job_id': None,
   'order_index': 1,
   'workflow_step_label': 'factors',
   'state': 'scheduled',
   'action': None,
   'model_class': 'WorkflowInvocationStep',
   'workflow_step_id': '1b3bb61148c4fa4a',
   'id': 'a140f92f0d9ee939'},
  {'workflow_step_uuid': '83376380-45b0-44a8-aaf5-a124b9737082',
   'update_time': '2019-01-31T12:57:09.506708',
   'job_id': None,
   'order_index': 0,
   'workflow_step_label': 'counts',
   'state': 'scheduled',
   'action': None,
   'model_class': 'WorkflowInvocationStep',
  'workflow_step_id': 'ed2a73fa3492644d',
   'id': '0a147bfaf7320890'}],
 'model_class': 'WorkflowInvocation',
 'id': '1defcd0dfb58de44',
 'history': 'a0562deccb82ea5d'}

#12

Can you check with the i button if the parameters are filled correctly?
Also, is that on a server where you are an admin or a public one?


#13

Here is the information I get from “i” page. The inputs look correct to me. This job previously run well when I only set 2 paramaters in params: “Minimum CPM” and “Minimum Samples”. I started getting errors when I was trying also set the contrasts for edgeR - which I just realised don’t appear in the “i” page.

No, I am just using the public https://usegalaxy.org/ server.


#14

Mmmh, can you try to replace:

step2_params["rep_contrast"][0]["contrast"] = "OMD-WT"
step2_params["rep_contrast"][1]["contrast"] = "PRELP-WT"

with:

step2_params["rep_contrast_0"]["contrast"] = "OMD-WT"
step2_params["rep_contrast_1"]["contrast"] = "PRELP-WT"

in your code and see if that works?


#15

No it doesn’t work. For both commands, I get:

KeyError: 'rep_contrast_0' (or KeyError: 'rep_contrast_1').

Do you think its something to do with not accessing the “contrast” properly? When I do the following:

step2 = wf['steps']['2']
step2_params = copy.deepcopy(step2['tool_inputs'])
step2_params["f"]["filt"]["cformat"]["cpmReq"] = "10.0"
step2_params["f"]["filt"]["cformat"]["cpmSampleReq"] = "3"
step2_params["rep_contrast"][0]["contrast"] = "OMD-WT"
step2_params["rep_contrast"][1]["contrast"] = "PRELP-WT"
params = {"2": step2_params}

The resulting “params” looks like this (see below). It looks good to me: all 4 variables (cpmReq,cpmSampleReq, contrast 1 and contrast 2 look good. So I don’t know why Galaxy is giving an error about a string or buffer (don’t even know what a buffer is?)

{'2': {'adv': {'pAdjust': 'BH',
   'lrtOption': 'false',
   'lfc': '0.0',
   'robOption': 'true',
   'pVal': '0.05',
   'normalisationOption': 'TMM'},
  '__page__': None,
  'f': {'filt': {'cformat': {'cpmReq': '10.0',
     'format_select': 'cpm',
     'cpmSampleReq': '3',
     '__current_case__': 0},
    '__current_case__': 0,
    'filt_select': 'yes'}},
  '__rerun_remap_job_id__': None,
  'anno': {'annoOpt': 'no', '__current_case__': 1},
  'rep_contrast': [{'__index__': 0, 'contrast': 'OMD-WT'},
   {'__index__': 1, 'contrast': 'PRELP-WT'}],
  'input': {'counts': {'__class__': 'RuntimeValue'},
   '__current_case__': 1,
   'fact': {'ffile': 'yes',
    'finfo': {'__class__': 'RuntimeValue'},
    '__current_case__': 0},
   'format': 'matrix'},
  'out': {'rscript': 'true', 'normCounts': 'false', 'rdaOption': 'true'}}}

#16

Hi nsoranzo, I was thinking: is it possible that the error is coming from the edgeR script itself? I.e. the R script that Galaxy is using to run edgeR? It seems as if it can’t seem to process the contrasts and somehow doesn’t seem them as strings.

I have managed to download the script that Galaxy uses (through another analysis which worked - Galaxy only provides an output Rscript if the run is successful which is a bit annoying). I am pasting it below (sorry its long)

When looking at:

  • lines 62-6: the sanitiseEquation() function
  • lines 153-175: the spec matrix and getopts commans
  • lines 300-303: the sanitisation using sanitiseEquation() of the input contrast

It seems to me that maybe the R script is not built to hanlde contrast that have been given as RuntimeValues? I.e. that it is expecting you to be setting the manually at runtime? Not sure at all, maybe an idea.

# This tool takes in a matrix of feature counts as well as gene annotations and
# outputs a table of top expressions as well as various plots for differential
# expression analysis
#
# ARGS: htmlPath", "R", 1, "character"      -Path to html file linking to other outputs
#       outPath", "o", 1, "character"       -Path to folder to write all output to
#       filesPath", "j", 2, "character"     -JSON list object if multiple files input
#       matrixPath", "m", 2, "character"    -Path to count matrix
#       factFile", "f", 2, "character"      -Path to factor information file
#       factInput", "i", 2, "character"     -String containing factors if manually input  
#       annoPath", "a", 2, "character"      -Path to input containing gene annotations
#       contrastData", "C", 1, "character"  -String containing contrasts of interest
#       cpmReq", "c", 2, "double"           -Float specifying cpm requirement
#       cntReq", "z", 2, "integer"          -Integer specifying minimum total count requirement
#       sampleReq", "s", 2, "integer"       -Integer specifying cpm requirement
#       normCounts", "x", 0, "logical"      -String specifying if normalised counts should be output 
#       rdaOpt", "r", 0, "logical"          -String specifying if RData should be output
#       lfcReq", "l", 1, "double"           -Float specifying the log-fold-change requirement   
#       pValReq", "p", 1, "double"          -Float specifying the p-value requirement
#       pAdjOpt", "d", 1, "character"       -String specifying the p-value adjustment method 
#       normOpt", "n", 1, "character"       -String specifying type of normalisation used 
#       robOpt", "b", 0, "logical"          -String specifying if robust options should be used 
#       lrtOpt", "t", 0, "logical"          -String specifying whether to perform LRT test instead 
#
# OUT: 
#       MDS Plot 
#       BCV Plot
#       QL Plot
#       MD Plot
#       Expression Table
#       HTML file linking to the ouputs
# Optional:
#       Normalised counts Table
#       RData file
#
# Author: Shian Su - registertonysu@gmail.com - Jan 2014
# Modified by: Maria Doyle - Oct 2017 (some code taken from the DESeq2 wrapper)

# Record starting time
timeStart <- as.character(Sys.time())

# setup R error handling to go to stderr
options( show.error.messages=F, error = function () { cat( geterrmessage(), file=stderr() ); q( "no", 1, F ) } )

# we need that to not crash galaxy with an UTF8 error on German LC settings.
loc <- Sys.setlocale("LC_MESSAGES", "en_US.UTF-8")

# Load all required libraries
library(methods, quietly=TRUE, warn.conflicts=FALSE)
library(statmod, quietly=TRUE, warn.conflicts=FALSE)
library(splines, quietly=TRUE, warn.conflicts=FALSE)
library(edgeR, quietly=TRUE, warn.conflicts=FALSE)
library(limma, quietly=TRUE, warn.conflicts=FALSE)
library(scales, quietly=TRUE, warn.conflicts=FALSE)
library(getopt, quietly=TRUE, warn.conflicts=FALSE)

################################################################################
### Function Delcaration
################################################################################
# Function to sanitise contrast equations so there are no whitespaces
# surrounding the arithmetic operators, leading or trailing whitespace
sanitiseEquation <- function(equation) {
    equation <- gsub(" *[+] *", "+", equation)
    equation <- gsub(" *[-] *", "-", equation)
    equation <- gsub(" *[/] *", "/", equation)
    equation <- gsub(" *[*] *", "*", equation)
    equation <- gsub("^\\s+|\\s+$", "", equation)
    return(equation)
}

# Function to sanitise group information
sanitiseGroups <- function(string) {
    string <- gsub(" *[,] *", ",", string)
    string <- gsub("^\\s+|\\s+$", "", string)
    return(string)
}

# Function to change periods to whitespace in a string
unmake.names <- function(string) {
    string <- gsub(".", " ", string, fixed=TRUE)
    return(string)
}

# Generate output folder and paths
makeOut <- function(filename) {
    return(paste0(opt$outPath, "/", filename))
}

# Generating design information
pasteListName <- function(string) {
    return(paste0("factors$", string))
}

# Create cata function: default path set, default seperator empty and appending
# true by default (Ripped straight from the cat function with altered argument
# defaults)
cata <- function(..., file=opt$htmlPath, sep="", fill=FALSE, labels=NULL, 
                                 append=TRUE) {
    if (is.character(file)) 
        if (file == "") 
            file <- stdout()
    else if (substring(file, 1L, 1L) == "|") {
        file <- pipe(substring(file, 2L), "w")
        on.exit(close(file))
    }
    else {
        file <- file(file, ifelse(append, "a", "w"))
        on.exit(close(file))
    }
    .Internal(cat(list(...), file, sep, fill, labels, append))
}

# Function to write code for html head and title
HtmlHead <- function(title) {
    cata("<head>\n")
    cata("<title>", title, "</title>\n")
    cata("</head>\n")
}

# Function to write code for html links
HtmlLink <- function(address, label=address) {
    cata("<a href=\"", address, "\" target=\"_blank\">", label, "</a><br />\n")
}

# Function to write code for html images
HtmlImage <- function(source, label=source, height=600, width=600) {
    cata("<img src=\"", source, "\" alt=\"", label, "\" height=\"", height)
    cata("\" width=\"", width, "\"/>\n")
}

# Function to write code for html list items
ListItem <- function(...) {
    cata("<li>", ..., "</li>\n")
}

TableItem <- function(...) {
    cata("<td>", ..., "</td>\n")
}

TableHeadItem <- function(...) {
    cata("<th>", ..., "</th>\n")
}

################################################################################
### Input Processing
################################################################################

# Collect arguments from command line
args <- commandArgs(trailingOnly=TRUE)

# Get options, using the spec as defined by the enclosed list.
# Read the options from the default: commandArgs(TRUE).
spec <- matrix(c(
    "htmlPath", "R", 1, "character",
    "outPath", "o", 1, "character",
    "filesPath", "j", 2, "character",
    "matrixPath", "m", 2, "character",
    "factFile", "f", 2, "character",
    "factInput", "i", 2, "character",
    "annoPath", "a", 2, "character",
    "contrastData", "C", 1, "character",
    "cpmReq", "c", 1, "double",
    "totReq", "y", 0, "logical",
    "cntReq", "z", 1, "integer",
    "sampleReq", "s", 1, "integer",
    "normCounts", "x", 0, "logical",
    "rdaOpt", "r", 0, "logical",
    "lfcReq", "l", 1, "double",
    "pValReq", "p", 1, "double",
    "pAdjOpt", "d", 1, "character",
    "normOpt", "n", 1, "character",
    "robOpt", "b", 0, "logical",
    "lrtOpt", "t", 0, "logical"),
    byrow=TRUE, ncol=4)
opt <- getopt(spec)


if (is.null(opt$matrixPath) & is.null(opt$filesPath)) {
    cat("A counts matrix (or a set of counts files) is required.\n")
    q(status=1)
}

if (is.null(opt$cpmReq)) {
    filtCPM <- FALSE
} else {
    filtCPM <- TRUE
}

if (is.null(opt$cntReq) || is.null(opt$sampleReq)) {
    filtSmpCount <- FALSE
} else {
    filtSmpCount <- TRUE
}

if (is.null(opt$totReq)) {
    filtTotCount <- FALSE
} else {
    filtTotCount <- TRUE
}

if (is.null(opt$lrtOpt)) {
    wantLRT <- FALSE
} else {
    wantLRT <- TRUE
}

if (is.null(opt$rdaOpt)) {
    wantRda <- FALSE
} else {
    wantRda <- TRUE   
}

if (is.null(opt$annoPath)) {
    haveAnno <- FALSE
} else {
    haveAnno <- TRUE
}

if (is.null(opt$normCounts)) {
    wantNorm <- FALSE
} else {   
    wantNorm <- TRUE
}

if (is.null(opt$robOpt)) {
    wantRobust <- FALSE
} else {
    wantRobust <- TRUE
}


if (!is.null(opt$filesPath)) {
    # Process the separate count files (adapted from DESeq2 wrapper)
    library("rjson")
    parser <- newJSONParser()
    parser$addData(opt$filesPath)
    factorList <- parser$getObject()
    factors <- sapply(factorList, function(x) x[[1]])
    filenamesIn <- unname(unlist(factorList[[1]][[2]]))
    sampleTable <- data.frame(sample=basename(filenamesIn),
                            filename=filenamesIn,
                            row.names=filenamesIn,
                            stringsAsFactors=FALSE)
    for (factor in factorList) {
        factorName <- factor[[1]]
        sampleTable[[factorName]] <- character(nrow(sampleTable))
        lvls <- sapply(factor[[2]], function(x) names(x))
        for (i in seq_along(factor[[2]])) {
            files <- factor[[2]][[i]][[1]]
            sampleTable[files,factorName] <- lvls[i]
        }
        sampleTable[[factorName]] <- factor(sampleTable[[factorName]], levels=lvls)
    }
    rownames(sampleTable) <- sampleTable$sample
    rem <- c("sample","filename")
    factors <- sampleTable[, !(names(sampleTable) %in% rem), drop=FALSE]
    
    #read in count files and create single table
    countfiles <- lapply(sampleTable$filename, function(x){read.delim(x, row.names=1)})
    counts <- do.call("cbind", countfiles)
    
} else {
    # Process the single count matrix
    counts <- read.table(opt$matrixPath, header=TRUE, sep="\t", stringsAsFactors=FALSE)
    row.names(counts) <- counts[, 1]
    counts <- counts[ , -1]
    countsRows <- nrow(counts)

    # Process factors
    if (is.null(opt$factInput)) {
            factorData <- read.table(opt$factFile, header=TRUE, sep="\t")
            factors <- factorData[, -1, drop=FALSE]
    }  else {
            factors <- unlist(strsplit(opt$factInput, "|", fixed=TRUE))
            factorData <- list()
            for (fact in factors) {
                newFact <- unlist(strsplit(fact, split="::"))
                factorData <- rbind(factorData, newFact)
            } # Factors have the form: FACT_NAME::LEVEL,LEVEL,LEVEL,LEVEL,... The first factor is the Primary Factor.

            # Set the row names to be the name of the factor and delete first row
            row.names(factorData) <- factorData[, 1]
            factorData <- factorData[, -1]
            factorData <- sapply(factorData, sanitiseGroups)
            factorData <- sapply(factorData, strsplit, split=",")
            factorData <- sapply(factorData, make.names)
            # Transform factor data into data frame of R factor objects
            factors <- data.frame(factorData)
    }
}

 # if annotation file provided
if (haveAnno) {
    geneanno <- read.table(opt$annoPath, header=TRUE, sep="\t", stringsAsFactors=FALSE)
}

#Create output directory
dir.create(opt$outPath, showWarnings=FALSE)

# Split up contrasts separated by comma into a vector then sanitise
contrastData <- unlist(strsplit(opt$contrastData, split=","))
contrastData <- sanitiseEquation(contrastData)
contrastData <- gsub(" ", ".", contrastData, fixed=TRUE)

bcvOutPdf <- makeOut("bcvplot.pdf")
bcvOutPng <- makeOut("bcvplot.png")
qlOutPdf <- makeOut("qlplot.pdf")
qlOutPng <- makeOut("qlplot.png")
mdsOutPdf <- character()   # Initialise character vector
mdsOutPng <- character()
for (i in 1:ncol(factors)) {
    mdsOutPdf[i] <- makeOut(paste0("mdsplot_", names(factors)[i], ".pdf"))
    mdsOutPng[i] <- makeOut(paste0("mdsplot_", names(factors)[i], ".png"))
}
mdOutPdf <- character()
mdOutPng <- character()
topOut <- character()
for (i in 1:length(contrastData)) {
    mdOutPdf[i] <- makeOut(paste0("mdplot_", contrastData[i], ".pdf"))
    mdOutPng[i] <- makeOut(paste0("mdplot_", contrastData[i], ".png"))
    topOut[i] <- makeOut(paste0("edgeR_", contrastData[i], ".tsv"))
}   # Save output paths for each contrast as vectors
normOut <- makeOut("edgeR_normcounts.tsv")
rdaOut <- makeOut("edgeR_analysis.RData")
sessionOut <- makeOut("session_info.txt")

# Initialise data for html links and images, data frame with columns Label and 
# Link
linkData <- data.frame(Label=character(), Link=character(), stringsAsFactors=FALSE)
imageData <- data.frame(Label=character(), Link=character(), stringsAsFactors=FALSE)

# Initialise vectors for storage of up/down/neutral regulated counts
upCount <- numeric()
downCount <- numeric()
flatCount <- numeric()

################################################################################
### Data Processing
################################################################################

# Extract counts and annotation data
data <- list()
data$counts <- counts
if (haveAnno) {
  # order annotation by genes in counts (assumes gene ids are in 1st column of geneanno)
  annoord <- geneanno[match(row.names(counts), geneanno[,1]), ]
  data$genes <- annoord
} else {
    data$genes <- data.frame(GeneID=row.names(counts))
}

# If filter crieteria set, filter out genes that do not have a required cpm/counts in a required number of
# samples. Default is no filtering
preFilterCount <- nrow(data$counts)

if (filtCPM || filtSmpCount || filtTotCount) {

    if (filtTotCount) {
        keep <- rowSums(data$counts) >= opt$cntReq
    } else if (filtSmpCount) {
        keep <- rowSums(data$counts >= opt$cntReq) >= opt$sampleReq
    } else if (filtCPM) {
        keep <- rowSums(cpm(data$counts) >= opt$cpmReq) >= opt$sampleReq
    }

    data$counts <- data$counts[keep, ]
    data$genes <- data$genes[keep, , drop=FALSE]
}

postFilterCount <- nrow(data$counts)
filteredCount <- preFilterCount-postFilterCount

# Creating naming data
samplenames <- colnames(data$counts)
sampleanno <- data.frame("sampleID"=samplenames, factors)


# Generating the DGEList object "data"
data$samples <- sampleanno
data$samples$lib.size <- colSums(data$counts)
data$samples$norm.factors <- 1
row.names(data$samples) <- colnames(data$counts)
data <- new("DGEList", data)

# Name rows of factors according to their sample
row.names(factors) <- names(data$counts)
factorList <- sapply(names(factors), pasteListName)

formula <- "~0" 
for (i in 1:length(factorList)) {
    formula <- paste(formula, factorList[i], sep="+")
}

formula <- formula(formula)
design <- model.matrix(formula)

for (i in 1:length(factorList)) {
    colnames(design) <- gsub(factorList[i], "", colnames(design), fixed=TRUE)
}

# Calculating normalising factor, estimating dispersion
data <- calcNormFactors(data, method=opt$normOpt)

if (wantRobust) {
    data <- estimateDisp(data, design=design, robust=TRUE)
} else {
    data <- estimateDisp(data, design=design)
}

# Generate contrasts information
contrasts <- makeContrasts(contrasts=contrastData, levels=design)

################################################################################
### Data Output
################################################################################

# Plot MDS
labels <- names(counts)

# MDS plot
png(mdsOutPng, width=600, height=600)
plotMDS(data, labels=labels, col=as.numeric(factors[, 1]), cex=0.8, main=paste("MDS Plot:", names(factors)[1]))
imgName <- paste0("MDS Plot_", names(factors)[1], ".png")
imgAddr <- paste0("mdsplot_", names(factors)[1], ".png")
imageData[1, ] <- c(imgName, imgAddr)
invisible(dev.off())

pdf(mdsOutPdf)
plotMDS(data, labels=labels, col=as.numeric(factors[, 1]), cex=0.8, main=paste("MDS Plot:", names(factors)[1]))
linkName <- paste0("MDS Plot_", names(factors)[1], ".pdf")
linkAddr <- paste0("mdsplot_", names(factors)[1], ".pdf")
linkData[1, ] <- c(linkName, linkAddr)
invisible(dev.off())

# If additional factors create additional MDS plots coloured by factor
if (ncol(factors) > 1) {
    for (i in 2:ncol(factors)) {
        png(mdsOutPng[i], width=600, height=600)
        plotMDS(data, labels=labels, col=as.numeric(factors[, i]), cex=0.8, main=paste("MDS Plot:", names(factors)[i]))
        imgName <- paste0("MDS Plot_", names(factors)[i], ".png")
        imgAddr <- paste0("mdsplot_", names(factors)[i], ".png")
        imageData <- rbind(imageData, c(imgName, imgAddr))
        invisible(dev.off())

        pdf(mdsOutPdf[i])
        plotMDS(data, labels=labels, col=as.numeric(factors[, i]), cex=0.8, main=paste("MDS Plot:", names(factors)[i]))
        linkName <- paste0("MDS Plot_", names(factors)[i], ".pdf")
        linkAddr <- paste0("mdsplot_", names(factors)[i], ".pdf")
        linkData <- rbind(linkData, c(linkName, linkAddr))
        invisible(dev.off())
    }
}

# BCV Plot
png(bcvOutPng, width=600, height=600)
plotBCV(data, main="BCV Plot")
imgName <- "BCV Plot"
imgAddr <- "bcvplot.png"
imageData <- rbind(imageData, c(imgName, imgAddr))
invisible(dev.off())

pdf(bcvOutPdf)
plotBCV(data, main="BCV Plot")
linkName <- paste0("BCV Plot.pdf")
linkAddr <- paste0("bcvplot.pdf")
linkData <- rbind(linkData, c(linkName, linkAddr))
invisible(dev.off())

# Generate fit
if (wantLRT) {
    
    fit <- glmFit(data, design)
    
} else {
    
    if (wantRobust) {
        fit <- glmQLFit(data, design, robust=TRUE)
    } else {
        fit <- glmQLFit(data, design)
    }

    # Plot QL dispersions
    png(qlOutPng, width=600, height=600)
    plotQLDisp(fit, main="QL Plot")
    imgName <- "QL Plot"
    imgAddr <- "qlplot.png"
    imageData <- rbind(imageData, c(imgName, imgAddr))
    invisible(dev.off())

    pdf(qlOutPdf)
    plotQLDisp(fit, main="QL Plot")
    linkName <- "QL Plot.pdf"
    linkAddr <- "qlplot.pdf"
    linkData <- rbind(linkData, c(linkName, linkAddr))
    invisible(dev.off())
}

 # Save normalised counts (log2cpm)
if (wantNorm) { 
        normalisedCounts <- cpm(data, normalized.lib.sizes=TRUE, log=TRUE) 
        normalisedCounts <- data.frame(data$genes, normalisedCounts)
        write.table (normalisedCounts, file=normOut, row.names=FALSE, sep="\t", quote=FALSE)
        linkData <- rbind(linkData, c("edgeR_normcounts.tsv", "edgeR_normcounts.tsv"))
}


for (i in 1:length(contrastData)) {
    if (wantLRT) {
        res <- glmLRT(fit, contrast=contrasts[, i])
    } else {
        res <- glmQLFTest(fit, contrast=contrasts[, i])
    }

    status = decideTestsDGE(res, adjust.method=opt$pAdjOpt, p.value=opt$pValReq,
                                             lfc=opt$lfcReq)
    sumStatus <- summary(status)

    # Collect counts for differential expression
    upCount[i] <- sumStatus["Up", ]
    downCount[i] <- sumStatus["Down", ]
    flatCount[i] <- sumStatus["NotSig", ]
                                             
    # Write top expressions table
    top <- topTags(res, n=Inf, sort.by="PValue")
    write.table(top, file=topOut[i], row.names=FALSE, sep="\t", quote=FALSE)
    
    linkName <- paste0("edgeR_", contrastData[i], ".tsv")
    linkAddr <- paste0("edgeR_", contrastData[i], ".tsv")
    linkData <- rbind(linkData, c(linkName, linkAddr))
    
    # Plot MD (log ratios vs mean difference) using limma package
    pdf(mdOutPdf[i])
    limma::plotMD(res, status=status,
                                main=paste("MD Plot:", unmake.names(contrastData[i])), 
                                hl.col=alpha(c("firebrick", "blue"), 0.4), values=c(1, -1),
                                xlab="Average Expression", ylab="logFC")
    
    abline(h=0, col="grey", lty=2)
    
    linkName <- paste0("MD Plot_", contrastData[i], ".pdf")
    linkAddr <- paste0("mdplot_", contrastData[i], ".pdf")
    linkData <- rbind(linkData, c(linkName, linkAddr))
    invisible(dev.off())
    
    png(mdOutPng[i], height=600, width=600)
    limma::plotMD(res, status=status,
                                main=paste("MD Plot:", unmake.names(contrastData[i])), 
                                hl.col=alpha(c("firebrick", "blue"), 0.4), values=c(1, -1),
                                xlab="Average Expression", ylab="logFC")
    
    abline(h=0, col="grey", lty=2)
    
    imgName <- paste0("MD Plot_", contrastData[i], ".png")
    imgAddr <- paste0("mdplot_", contrastData[i], ".png")
    imageData <- rbind(imageData, c(imgName, imgAddr))
    invisible(dev.off())
}
sigDiff <- data.frame(Up=upCount, Flat=flatCount, Down=downCount)
row.names(sigDiff) <- contrastData

# Save relevant items as rda object
if (wantRda) {
    if (wantNorm) {
        save(counts, data, status, normalisedCounts, labels, factors, fit, res, top, contrasts, design,
                 file=rdaOut, ascii=TRUE)
    } else {
        save(counts, data, status, labels, factors, fit, res, top, contrasts, design,
                 file=rdaOut, ascii=TRUE)
    }
    linkData <- rbind(linkData, c("edgeR_analysis.RData", "edgeR_analysis.RData"))
}

# Record session info
writeLines(capture.output(sessionInfo()), sessionOut)
linkData <- rbind(linkData, c("Session Info", "session_info.txt"))

# Record ending time and calculate total run time
timeEnd <- as.character(Sys.time())
timeTaken <- capture.output(round(difftime(timeEnd, timeStart), digits=3))
timeTaken <- gsub("Time difference of ", "", timeTaken, fixed=TRUE)

################################################################################
### HTML Generation
################################################################################

# Clear file
cat("", file=opt$htmlPath)

cata("<html>\n")

cata("<body>\n")
cata("<h3>edgeR Analysis Output:</h3>\n")
cata("Links to PDF copies of plots are in 'Plots' section below.<br />\n")

HtmlImage(imageData$Link[1], imageData$Label[1])

for (i in 2:nrow(imageData)) {
    HtmlImage(imageData$Link[i], imageData$Label[i])
}

cata("<h4>Differential Expression Counts:</h4>\n")

cata("<table border=\"1\" cellpadding=\"4\">\n")
cata("<tr>\n")
TableItem()
for (i in colnames(sigDiff)) {
    TableHeadItem(i)
}
cata("</tr>\n")
for (i in 1:nrow(sigDiff)) {
    cata("<tr>\n")
    TableHeadItem(unmake.names(row.names(sigDiff)[i]))
    for (j in 1:ncol(sigDiff)) {
        TableItem(as.character(sigDiff[i, j]))
    }
    cata("</tr>\n")
}
cata("</table>")

cata("<h4>Plots:</h4>\n")
for (i in 1:nrow(linkData)) {
    if (grepl(".pdf", linkData$Link[i])) {
        HtmlLink(linkData$Link[i], linkData$Label[i])
    }
}

cata("<h4>Tables:</h4>\n")
for (i in 1:nrow(linkData)) {
    if (grepl(".tsv", linkData$Link[i])) {
        HtmlLink(linkData$Link[i], linkData$Label[i])
    }
}

if (wantRda) {
    cata("<h4>R Data Objects:</h4>\n")
    for (i in 1:nrow(linkData)) {
        if (grepl(".RData", linkData$Link[i])) {
            HtmlLink(linkData$Link[i], linkData$Label[i])
        }
    }
}

cata("<p>Alt-click links to download file.</p>\n")
cata("<p>Click floppy disc icon associated history item to download ")
cata("all files.</p>\n")
cata("<p>.tsv files can be viewed in Excel or any spreadsheet program.</p>\n")

cata("<h4>Additional Information</h4>\n")
cata("<ul>\n")

if (filtCPM || filtSmpCount || filtTotCount) {
    if (filtCPM) {
    tempStr <- paste("Genes without more than", opt$cpmReq,
                                     "CPM in at least", opt$sampleReq, "samples are insignificant",
                                     "and filtered out.")
    } else if (filtSmpCount) {
        tempStr <- paste("Genes without more than", opt$cntReq,
                                     "counts in at least", opt$sampleReq, "samples are insignificant",
                                     "and filtered out.")
    } else if (filtTotCount) {
            tempStr <- paste("Genes without more than", opt$cntReq,
                                     "counts, after summing counts for all samples, are insignificant",
                                     "and filtered out.")
    }

    ListItem(tempStr)
    filterProp <- round(filteredCount/preFilterCount*100, digits=2)
    tempStr <- paste0(filteredCount, " of ", preFilterCount," (", filterProp,
                                     "%) genes were filtered out for low expression.")
    ListItem(tempStr)
}
ListItem(opt$normOpt, " was the method used to normalise library sizes.")
if (wantLRT) {
    ListItem("The edgeR likelihood ratio test was used.")
} else {
    if (wantRobust) {
        ListItem("The edgeR quasi-likelihood test was used with robust settings (robust=TRUE with estimateDisp and glmQLFit).")
    } else {
            ListItem("The edgeR quasi-likelihood test was used.")
    }
}
if (opt$pAdjOpt!="none") {
    if (opt$pAdjOpt=="BH" || opt$pAdjOpt=="BY") {
        tempStr <- paste0("MD-Plot highlighted genes are significant at FDR ",
                                            "of ", opt$pValReq," and exhibit log2-fold-change of at ", 
                                            "least ", opt$lfcReq, ".")
        ListItem(tempStr)
    } else if (opt$pAdjOpt=="holm") {
        tempStr <- paste0("MD-Plot highlighted genes are significant at adjusted ",
                                            "p-value of ", opt$pValReq,"  by the Holm(1979) ",
                                            "method, and exhibit log2-fold-change of at least ", 
                                            opt$lfcReq, ".")
        ListItem(tempStr)
    }
} else {
    tempStr <- paste0("MD-Plot highlighted genes are significant at p-value ",
                                        "of ", opt$pValReq," and exhibit log2-fold-change of at ", 
                                        "least ", opt$lfcReq, ".")
    ListItem(tempStr)
}
cata("</ul>\n")

cata("<h4>Summary of experimental data:</h4>\n")

cata("<p>*CHECK THAT SAMPLES ARE ASSOCIATED WITH CORRECT GROUP(S)*</p>\n")

cata("<table border=\"1\" cellpadding=\"3\">\n")
cata("<tr>\n")
TableHeadItem("SampleID")
TableHeadItem(names(factors)[1], " (Primary Factor)")

    if (ncol(factors) > 1) {
        for (i in names(factors)[2:length(names(factors))]) {
            TableHeadItem(i)
        }
        cata("</tr>\n")
    }

for (i in 1:nrow(factors)) {
    cata("<tr>\n")
    TableHeadItem(row.names(factors)[i])
    for (j in 1:ncol(factors)) {
        TableItem(as.character(unmake.names(factors[i, j])))
    }
    cata("</tr>\n")
}
cata("</table>")

for (i in 1:nrow(linkData)) {
    if (grepl("session_info", linkData$Link[i])) {
        HtmlLink(linkData$Link[i], linkData$Label[i])
    }
}

cata("<table border=\"0\">\n")
cata("<tr>\n")
TableItem("Task started at:"); TableItem(timeStart)
cata("</tr>\n")
cata("<tr>\n")
TableItem("Task ended at:"); TableItem(timeEnd)
cata("</tr>\n")
cata("<tr>\n")
TableItem("Task run time:"); TableItem(timeTaken)
cata("<tr>\n")
cata("</table>\n")

cata("</body>\n")
cata("</html>")

#17

I may be wrong but I don’t think it’s the edgeR script, I think it’s more likely a Bioblend/Galaxy issue. The planemo test that tests the multiple contrasts for the edgeR tool has similar to what you’ve got "rep_contrast": "[{\"__index__\": 0, \"contrast\": \"Mut-WT\"}, {\"__index__\": 1, \"contrast\": \"WT-Mut\"}]"


#18

@mblue9 thanks. You’re right, it’s probably Bioblend or Galaxy. Tried various different combinations of Bioblend code but I’ve run out of ideas now.