HISAT2 reporting an error on hg19 full genome. Local install

Hello,
I am getting an error when running HISAT2 using hg19 full specifically as the genome. When HISAT2 is run using hg19 canonical or hg38, the job completed without any issues. I re-ran HISAT2 on a previously completed job using hg19 full, and it throws an error. The error seems to be stemming from the hg19 index on CVMFS. I am posting the error and the command used below:

Error reading _ebwt[] array: 14118, 15360
Error: Encountered internal HISAT2 exception (#1)
Command: /home/galaxyp_user/galaxyp/galaxy/database/dependencies/_conda/envs/mulled-v1-ad7c0e574419219598c842c5e534a388c10d33e19c5718205a00928715733608/bin/hisat2-align-s --wrapper basic-0 -p 1 -x /cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19 --pen-cansplice 0 --pen-noncansplice 12 --pen-canintronlen G,-8.0,1.0 --pen-noncanintronlen G,-8.0,1.0 --min-intronlen 20 --max-intronlen 500000 --dta --read-lengths 151 -1 /tmp/3452387.inpipe1 -2 /tmp/3452387.inpipe2 
(ERR): hisat2-align exited with value 1
samtools sort: failed to read header from "-"
[main_samview] fail to read the header from "-".

The command is as follows:

set -o pipefail;  ln -f -s '/home/galaxyp_user/galaxyp/galaxy/database/objects/d/2/b/dataset_d2b3484d-ef9f-4e56-9735-2d45a1493ed3.dat' input_f.fastq.gz &&  ln -f -s '/home/galaxyp_user/galaxyp/galaxy/database/objects/d/a/5/dataset_da5930b7-36d4-467e-afe4-76062554ab50.dat' input_r.fastq.gz &&     hisat2  -p ${GALAXY_SLOTS:-1}  -x '/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19'    -1 'input_f.fastq.gz' -2 'input_r.fastq.gz'                  --pen-cansplice 0 --pen-noncansplice 12 --pen-canintronlen G,-8.0,1.0 --pen-noncanintronlen G,-8.0,1.0  --min-intronlen 20 --max-intronlen 500000 --dta                 | samtools sort --no-PG -l 0 -T "${TMPDIR:-.}" -O bam | samtools view --no-PG -O bam -@ ${GALAXY_SLOTS:-1} -o '/home/galaxyp_user/galaxyp/galaxy/database/objects/e/a/c/dataset_eacad32c-e139-4a60-a286-a364d28d6d70.dat'

My research keeps pointing to a corrupt hg19 index file. When tested using command line interface and point Galaxy and HISAT2 to use local hg19 indexed files (downloaded from Index of /managed/hisat2_index/hg19/), it does not throw an error which suggests that the hg19 index on CVMFS seems to be corrupt.
Since we are running a local Galaxy instance and use CVMFS, is there a way to unmount and remount CVMFS to correct this issue?
Any other suggestions that you may have to solve this?

Thanks

Hi @prao123

(correction was here)

The database key for this index is exactly hg19 in CVMFS (and at UseGalaxy.org). You are assigning a different database? Some servers use hg19full. That can work, but the loc file needs to map hg19full (variant assembly) → hg19 (primary assembly).

If you decided to use CVMFS directly instead of downloading indexes, that would all match up. If you download and customize the location (.loc) tables and index paths locally, then maybe double check that your tables are set up correctly?

The HISAT2 index files are here if interested in that part, and the entire site can be navigated. http://datacache.galaxyproject.org/managed/hisat2_index/

Tutorials are here https://training.galaxyproject.org/training-material/search?query=cvmfs

Hello @jennaj
Thanks for your response. I wanted to clarify a few things I may have missed in my post:
Yes, the database key I have used is hg19 (shown below) in CVMFS which results in an error (Error reading ebwt[] array: 14118, 15360).

When I use another database key for example: hg19canon , also from CVMFS, HISAT2 does not give me an error. Similarly for hg38, HISAT2 does not produce an error. I want to state that HISAT2 was working well to produce the BAM files. But when I rerun it on the same datafiles which was successful before, the error comes up again indicating something wrong with hg19 index.

Do you have any recommendations for addressing the error that I get when using hg19 through CVMFS?

Thanks

So this part used to work, and is now failing? When did that start? Were any more indexes added in to your Galaxy during this time frame (local indexes)? If so, maybe a duplicated database/dbkey was introduced? (I’ve done that! It breaks the new and old index … the tables expect unique primary keys). When there are multiple tables per index, those are merged, and can create unexpected duplicates.

Just to be clear – this is at the mapping step, correct? Was that a fresh run or a “rerun” (using :arrows_counterclockwise: icon) of a prior job? Do both ways fail, or just one of those? Which?

Under the Admin tab the data tables will be presented. How many for HISAT2 do you have at your Galaxy?

The CVMFS resource hasn’t changed on our side. But comment back on those questions and we can dig more. It really should “just work” but there are ways to complicate it of course :slight_smile: You could disconnect and reconnect, and the server should be restarted after added in local indexes (ancillary to CVMFS) but I’m assuming you have done that already.

No, no new indexes were introduced.

Yes, this is a mapping step. I have tried both a fresh run on a new dataset (unsuccessful), a rerun using an old dataset (unsuccessful now but which worked before).

I have three (as shown in the picture below).

Yes, we did restart Galaxy. Please ignore my comment on local indexes (it is just complicating things). Nothing on the Galaxy configuration was changed to include any local indexes.

I am sorry, I forgot to mention that I first noticed this issue on June 12th.

Ok, thanks.

I’m going to ask at the Admin chat to see if anyone recognizes what is going on. They may reply here or there and feel free to join the chat You're invited to talk on Matrix

And please check one more thing and post that back here, since it does one more duplicate check:

  1. for any dataset, click on the pencil icon
  2. click into the “database” drop down menu
  3. type in hg19
  4. how many lines have the database hg19 listed? Exact, not the variations. The value is in the latter part of the genome identifier line: Genus Species Release (altkey/hg19)

Thanks. Will do that.

I followed those steps and I see two: Human Feb. 2009 (GRCh37/hg19) (hg19).
Attached image:

Thank you!

Ah, I see the duplicate. Do you? I also see this at UseGalaxy.org. It is probably coming in from the galaxyp_user HISAT2 loc table.

Guess: Maybe that isn’t a problem anymore in the most current release but still impacts older releases. What release of Galaxy are you running?

I’m also running a test in here https://usegalaxy.org/u/jen-galaxyproject/h/copy-of-test-data-human-mapping-rnaseq (updated link!!)

Hello @jennaj

I do see two instances of GRCh37/hg19 shown in the drop down menu. I am running Galaxy ver 22.05
What is the current version of Galaxy? Hope this is the issue that is causing the tool to fail.

Thanks

Releases Releases — Galaxy Project 23.0.2.dev0 documentation

Updating is usually the first step.

I restarted the test history above. It has a HISAT2 job running. If that is successful, then the CVMFS indexes are OK (we also use it).

Hi @jennaj

I just talked with my collaborator who is maintaining the server. I wanted to come back to this point about the duplicate hg19 databases in the “file upload” section. In HISAT2, when we select the reference genome, we do not see the same duplication of hg19. There is only one instance of hg19 (please see attached pic below):

Does this still create an issue?

We think that the CVMFS mounted on our system may have some issues.

Right, this is how the problem used to present. The hg19 selected is not the hg19 with the indexes, it is the other one. So, we hid some of the merged database items, and unhid them a little over a month ago. Which is why I was asking about the release you are running.

The test run at ORG in my history was successful. Maybe update your server? 23.0 has other important fixes, not just enhancements.

Hello @jennaj
I really appreciate the prompt responses from you. Updating the server is a task that may take some coordination at my workplace to complete due to the number of users. Would it be possible to fix the hg19 reference so that it points to the correct index? Since this problem is manifesting with just one genome and is not prevalent across the whole Galaxy system?

Thank you

Hi @prao123

You could try removing the .loc file for the index that is creating the duplicate. Seems like it is the /home/galaxyp_user/… but you can check in the UI – it might be the /home/data.galaxyproject.og/byhand/… file instead (is what we added back last month).

The idea here is to test: when hg19 is only listed once under the pencil icon, and the index for HISAT2 is still listed on the tool form drop-down, does that point to the right place. Some fiddling around is needed. Want to try that?

Hello @jennaj
We can try that. I had posted this picture before:


And I see that there are three HISAT2 indexes. And all of them have the address “/home/galaxyp_user/galaxyp/galaxy/tool-data”. Do you think I should remove the one that has /byhand/?

The UI when I click on the pencil icon does not give any information about which file is it referencing - the “managed” or “byhand”.

Thanks

Just one has the path /home/.. and the other two are from /cvmfs/... Keeping only the latter is what I was suggesting as a first pass try.

You would only need /home/galaxyp_user/ is using proteomics tools, and maybe some plant genomes. The remainder would be in /cvmfs. But I can be proven wrong.

You can try removing that one instead (or, also).

The /byhand/ contains some indexes that were not able to be moved into the /managed/ area yet. Technical reasons, plus the whole system is pending an update later this year so a bit of risk management on our side to keep what is working for most people, still working.

I can’t remember all new/duplicated indexes involved in /byhand/ (even though I made most of it!), but do know that it involves hg19. We added it back to specifically make the 2bit/twoBit index that UCSC tools depend on available for other people running local servers, and at the US + AU public servers.

Interestingly … AU is running the same version of Galaxy you are, and I’m not aware of any problems using HISAT2 with hg19 at that server. This points back to /home/galaxyp_user/ being the potential problem! That’s why I suggested trying combinations to see what fixes this for you.

If none solve the problem, share back the combinations tried and what resulted for each. Might have more clues that will help avoid a server update. It not, at least we can confirm that is needed.

Great! Thanks for those details. We will give them a try!

1 Like

Hello @jennaj
These are the things that we tried and the outputs
1.

Output: We removed the particular file /home/… which is hisat2 index loc file. It did not solve our issue. The duplicates for hg19 still remained next to the pencil icon and HISAT2 still produced the same error. We cannot edit the cvmfs files since they are read only.

  1. Next, we downloaded the hg19 indexed files from Index of /managed/hisat2_index/ that you had referenced and ran the HISAT2 align command on our server
/home/galaxyp_user/galaxyp/galaxy/database/dependencies/_conda/pkgs/hisat2-2.2.1-h87f3376_4/bin/hisat2 -x '/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19' -1 'input_f.fastq.gz' -2 'input_r.fastq.gz'

Output:It produces the same error as in Galaxy:
Error reading _ebwt array: 14118, 15360
Error: Encountered internal HISAT2 exception (#1)

However, when we run the following command on the local server after downloading the HISAT2 indexed files from Index of /managed/hisat2_index/, it does not produce an error:

/home/galaxyp_user/galaxyp/galaxy/database/dependencies/_conda/pkgs/hisat2-2.2.1-h87f3376_4/bin/hisat2 -x '/home/galaxyp_user/galaxyp/z_troubleshoot/tmp/hisat2_index/hg19/hg19' -1 'input_f.fastq.gz' -2 'input_r.fastq.gz'
  1. Next, we tried to copy each individual index file from /cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19 to our local server.
    Output: The result is that one file labeled “hg19.5.ht2” could not be copied and outputs an error. I have copied our code below for your reference:
cp -vr  /cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19 ./
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19' -> './hg19'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.8.ht2' -> './hg19/hg19.8.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.7.ht2' -> './hg19/hg19.7.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.3.ht2' -> './hg19/hg19.3.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.6.ht2' -> './hg19/hg19.6.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.4.ht2' -> './hg19/hg19.4.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.2.ht2' -> './hg19/hg19.2.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.1.ht2' -> './hg19/hg19.1.ht2'
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.5.ht2' -> './hg19/hg19.5.ht2'
cp: error reading '/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.5.ht2': Input/output error
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.fa' -> './hg19/hg19.fa'

Repeating it only on one file:

cp -vr  /cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19//hg19.5.ht2 ./
'/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19//hg19.5.ht2' -> './hg19.5.ht2'
cp: error reading '/cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19//hg19.5.ht2': Input/output error

It seems that one specific file hg19.5.ht2 is giving issues on cvmfs and is the root cause of this error. Since it is mounted by cvmfs, we are not able to do anything else with it.

  1. We also ran the command to restart the autofs service:
>> systemctl restart autofs

Output: This did not solve the error.

Hope this helps towards pointing to the potential root of the issue that we are facing.

Thank you!

1 Like

Ah, super helpful! Thank you.

Not sure exactly why this is going on, but we should definitely be able to fix it. I’m sorting things out in the chat here You're invited to talk on Matrix. Resolution will be sometime next week, and I’ll post back any tickets + updates.

Whew! :grimacing:


Details:

  • Corrupted file: /cvmfs/data.galaxyproject.org/managed/hisat2_index/hg19/hg19.5.ht2
  • Correct (source) file: /managed/hg19/hg19/hisat2_index/hg19/hg19.5.ht2
  • Solution: move the source to cvmfs again, retest, check for others with similar problem
  • Who: Jen