Hey, I encountered an issue: I was working on Galaxy and trying to run lefse but the output files were different when I changed the way microorganisms are named.
Here is everything that I know:
The dataset is green, the job did not fail.
The lef_usual.txt is a input file whose first column contained many names of microorganisms, and these names had some special characters like “()”, “-” and “'”, like “Clostridium_sp._DL-VIII”.
The lef_usual_result.svg is the output file. There are 5 biomarkers for “meNorChild”.
Then, I replaced all the names of microorganisms in the first column in lef_usual.txt with another way of naming, such as s1, s2 and s3. The new input file was named lef_number.txt.
The lef_number_result.svg is the new output file. There are 11 biomarkers for “meNorChild”.
I replied to your email, but for others reading, please follow this formatting advice for any tool / analysis if you get odd results.
Why? Tools can be picky about format. Cleaning up data to the most basic of formats, and making sure that all inputs exactly match when a common key (identifier) is involved, can eliminate technical problems so you can focus on scientific results and parameter tuning.
Changing identifiers in the middle of an analysis could cause all sorts of odd problems with any tool.
It would be best to adjust identifier names at the very start of processing for any analysis.
Some tools prefer that labels include only alphanumeric characters A-Z a-z 0-9. Underscores _ are usually Ok. Avoid using a number 0-9 at the very start of an identifier. Avoid including any whitespace space or tab.
If you include other characters, especially a dot . or pipe |, that could result in any content after that character being truncated. Or, other weird behavior. It is difficult to predict.
This is general advice – some tools do expect that special characters are used. The tool documentation will usually have examples and explanations.
Cleaning up data is one of the most important steps for any technical processing, not just bioinformatics analysis. Try an internet search with the term “data cleaning”. It is a vast topic!
Hi,thanks for your reply. I have replaced all the special characters to underscores since you said that “Clostridium_sp_DL_VIII” is enough for Lefse. The result is uploaded.
I found there were still differences between these two results. It seems that changing all the special characters to underscores does not help. But cleaning up to the most basic of formats like “ClostridiumspDLVIII” is hard to read and it is also difficult to search “ClostridiumspDLVIII” on the Internet. So, can Galaxy solve this problem thoroughly? Can the program of Lefse analysis be optimized?
Many thanks.
Ok, it does look like Lefse is actually interpreting the underscores, not just masking some characters to an underscore. The original tool authors would have the best advice, and are who would modify the tool. They host a slightly different (custom) version of Lefse at their own server. I’m not sure where you are running the analysis but the usage is known to be different from what is available at other public Galaxy servers.
I’ll modify the original reply. I also added a tag that points to prior Q&A involving the Huttenhower server. Most include that same contact information. The authors are the best people to clarify the usage details. If you want to post that back here, along with where you are running the tool (+ version), that would be helpful.
I went to find out about Huttenhower Lab - Galaxy Community Hub but the forum doesn’t seem to be active. Do you have the email address of the author?
Thanks.