Saturday, August 22, 2009

WebSphere Java process hangs and freezes

We recently had an issue where the websphere Java process got hung and freezes in 4 servers almost at the same time, where 3 server nodes are part of a cluster and the other one is a standalone. Restarting of websphere AppServer fixed the issue. This issue was still puzzling as to why all the servers got hung at the same time and even the one that is not part of the cluster got hung as well. We did some investigation and found the commonality among all these servers is that all the websphere installation directory is nfs mounted on a NAS (Network Attached storage) device. We suspected that either nfs mount or the NAS might have had problems as there was no better explanation for all the server to go down at the same time. We checked the OS /var/log/messages file and found these nfs service messages happened around the same time the server went down ,

Aug 20 04:09:45 appserver01 kernel: nfs: server nasserver01 OK
Aug 20 04:10:51 appserver01 kernel: nfs: server nasserver01 not responding, still trying
Aug 20 04:10:51 appserver01 kernel: nfs: server nasserver01 not responding, still trying
Aug 20 04:10:53 appserver01 kernel: nfs: server nasserver01 OK

These messages seems to be related to nfs timeout. As there were no problem with the NAS device itself , it was clear that nfs service was timing out might have caused the issue. We changed the nfs to use the TCP and nfs version 3 which is more reliable instead of UDP with some additional tuning parameters. Once remounting with new parameters the problem didn't happen so far. Here are the new setting for the nfs mount over TCP.

/etc/fstab:

nasserver01:/app/WebSphere /mnt/WebSphere (rw,noatime,hard,intr,tcp,nfsvers=3,retrans=5,rsize=8192,wsize=8192,timeo=14,addr=10.10.1.20)

In case if the problem still exists after the tuning , nfsstat or tcpdump traces can be used to analyze the problem.

No comments: