the way microorganisms are named affect the final results of lefse

Hey, I encountered an issue: I was working on Galaxy and trying to run lefse but the output files were different when I changed the way microorganisms are named.

Here is everything that I know:

The dataset is green, the job did not fail.
The lef_usual.txt is a input file whose first column contained many names of microorganisms, and these names had some special characters like “()”, “-” and “'”, like “Clostridium_sp._DL-VIII”.
The lef_usual_result.svg is the output file. There are 5 biomarkers for “meNorChild”.

Then, I replaced all the names of microorganisms in the first column in lef_usual.txt with another way of naming, such as s1, s2 and s3. The new input file was named lef_number.txt.
The lef_number_result.svg is the new output file. There are 11 biomarkers for “meNorChild”.

The relevant parameters of the two tasks are the same, only the naming method of the first column of the input file has changed.
So, why does just changing the way microorganisms are named affect the final results? What format of naming will give the most correct results?
The process screenshots, input files and output files have been uploaded.
Uploading: lef_number.png…
Uploading: lef_number_result.png…
Uploading: lef_usual.png…
Uploading: lef_usual_result.png…
Uploading: process.jpg…

Could you help me?

Hi @hlsw2001

I replied to your email, but for others reading, please follow this formatting advice for any tool / analysis if you get odd results.

Why? Tools can be picky about format. Cleaning up data to the most basic of formats, and making sure that all inputs exactly match when a common key (identifier) is involved, can eliminate technical problems so you can focus on scientific results and parameter tuning.

  1. Changing identifiers in the middle of an analysis could cause all sorts of odd problems with any tool.
  2. It would be best to adjust identifier names at the very start of processing for any analysis.
  3. Some tools prefer that labels include only alphanumeric characters A-Z a-z 0-9. Underscores _ are usually Ok. Avoid using a number 0-9 at the very start of an identifier. Avoid including any whitespace space or tab.
  4. If you include other characters, especially a dot . or pipe |, that could result in any content after that character being truncated. Or, other weird behavior. It is difficult to predict.
  5. Lefse in particular doesn’t handle some special characters well – whether used in Galaxy or directly. Example Q&A: Problem with LEfSe - LEfSe - The bioBakery help forum
  6. This is general advice – some tools do expect that special characters are used. The tool documentation will usually have examples and explanations.
  7. Cleaning up data is one of the most important steps for any technical processing, not just bioinformatics analysis. Try an internet search with the term “data cleaning”. It is a vast topic!
  8. FAQ

What to try for your use case:

original

Clostridium_sp._DL-VIII

cleaned up to a basic format (enough for Lefse)

Clostridium_sp_DL_VIII

cleaned up to the most basic of formats

ClostridiumspDLVIII


Hope that helps :slight_smile:

Hi,thanks for your reply. I have replaced all the special characters to underscores since you said that “Clostridium_sp_DL_VIII” is enough for Lefse. The result is uploaded.

And to test out it, I also cleaned up the data to the most basic of formats like “ClostridiumspDLVIII”. The result is uploaded here.

I found there were still differences between these two results. It seems that changing all the special characters to underscores does not help. But cleaning up to the most basic of formats like “ClostridiumspDLVIII” is hard to read and it is also difficult to search “ClostridiumspDLVIII” on the Internet. So, can Galaxy solve this problem thoroughly? Can the program of Lefse analysis be optimized?
Many thanks.

1 Like

Ok, it does look like Lefse is actually interpreting the underscores, not just masking some characters to an underscore. The original tool authors would have the best advice, and are who would modify the tool. They host a slightly different (custom) version of Lefse at their own server. I’m not sure where you are running the analysis but the usage is known to be different from what is available at other public Galaxy servers.

Author’s public Galaxy server details, including the URL and help resources/forum: Huttenhower Lab - Galaxy Community Hub

I’ll modify the original reply. I also added a tag that points to prior Q&A involving the Huttenhower server. Most include that same contact information. The authors are the best people to clarify the usage details. If you want to post that back here, along with where you are running the tool (+ version), that would be helpful.

I went to find out about Huttenhower Lab - Galaxy Community Hub but the forum doesn’t seem to be active. Do you have the email address of the author?
Thanks.

The forum has several Q&A threads from this week, and I see your post. You’ll probably need to be patient for an answer. The way microorganisms are named affect the final results of lefse - LEfSe - The bioBakery help forum

The homepage of their server has contact information for the lab. For Q&A, the forum is better.