Friday, October 29, 2010

Oracle 11G : root.sh fails with - Failure at final check of Oracle CRS stack. 10

I was setting up a Oracle 11G RAC in a two node Linux cluster and got into a issue while running the root.sh in the second node of the cluster as below:


/rdbms/crs/root.sh
Checking to see if Oracle CRS stack is already configured
/etc/oracle does not exist. Creating it now.

Setting the permissions on OCR backup directory
Setting up Network socket directories
Oracle Cluster Registry configuration upgraded successfully
clscfg: EXISTING configuration version 4 detected.
clscfg: version 4 is 11 Release 1.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :
node 1: devdb03b devdb03b-priv devdb03b
node 2: devdb03a devdb03a-priv devdb03a
clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply -force parameter to override.
-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
Failure at final check of Oracle CRS stack.
10


After the error I was manually evaluating the basic setup was right, there were a couple of issues which were trivial and had escaped the clufy verification:

1. The private and virtual host names were commented in the /etc/hosts file in one node.
2. The time was not synced in both the nodes which could cause node eviction.

[oracle@devdb03b cssd]$ date
Thu Oct 28 10:49:38 GMT 2010
[oracle@devdb03a ~]$ date
Thu Oct 28 10:48:39 GMT 2010


After the required changes were done the installation was cleaned up following the following note and reinstalled.

Note: How to Clean Up After a Failed 10g or 11.1 Oracle Clusterware Installation [ID 239998.1]

Which still dint resolve the issue, after some more analysis on the trace dumps from the ocssd , we could see that the network heart beat was not coming through for some other reason like a port block or a firewall issue, checking the /etc/services and iptables confirmed it.

[ CSSD]2010-10-28 10:55:58.709 [1098586432] >TRACE: clssnmReadDskHeartbeat: node 1, devdb03a, has a disk HB, but no network HB, DHB has rcfg 183724820, wrtcnt, 476, LATS 51264, lastSeqNo 476, timestamp 1288262691/387864

OL=tcp)(HOST=wv1
devdb03b-priv)(P
ORT=49895))

iptables was enabled and had many restrictions, so after adding the following in the iptables and restarting the nodes (as in one node the crs restart was hanging forever).

In node devdb03a


ACCEPT all -- devdb03b anywhere

In node devdb03b


ACCEPT all -- devdb03a anywhere

After this the crs became healthy, but no resources were there.

This was due to the root.sh failure in the second node, to fix this the vipca was run as rot user from the first node and everything fell in place quickly, and all the vip,ons and gsd came up fine.

No comments: