Friday, September 9, 2011

Oracle Clusterware restore

ASM and Clusterware are interleaved. ASM cannot startup without the clusterware running first and the clusterware can store all its voting disk and OCR in ASM
This is a bit of a complex intergration. I was in a scenario where I lost all my disks in ASM. When I say lost, we were trying a proof of concept. :)

When I tried to start ASM as a standalone with a pfile it complained that it couldnt communicate with the CSS
And I couldnt start the CRS without the voting disk and OCR, which I had now lost.
I was able to see all the LUNs on the server under /dev/rdsk

Below is how to recover.


Start CRS in exclusive NOCRS mode. This should be run only on one node of the cluster.


oracle@NODEA:/var/opt/oracle sudo crsctl start crs -excl -nocrs
Password:
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.mdnsd' on 'NODEA'
CRS-2676: Start of 'ora.mdnsd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'NODEA'
CRS-2676: Start of 'ora.gpnpd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'NODEA'
CRS-2672: Attempting to start 'ora.gipcd' on 'NODEA'
CRS-2676: Start of 'ora.cssdmonitor' on 'NODEA' succeeded
CRS-2676: Start of 'ora.gipcd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'NODEA'
CRS-2672: Attempting to start 'ora.diskmon' on 'NODEA'
CRS-2676: Start of 'ora.diskmon' on 'NODEA' succeeded
CRS-2676: Start of 'ora.cssd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'NODEA'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'NODEA'
CRS-2672: Attempting to start 'ora.ctssd' on 'NODEA'
CRS-2676: Start of 'ora.ctssd' on 'NODEA' succeeded
CRS-2676: Start of 'ora.drivers.acfs' on 'NODEA' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'NODEA'
CRS-2676: Start of 'ora.asm' on 'NODEA' succeeded



Re-create all the diskgroups.


SQL> CREATE DISKGROUP ORCLCLU HIGH REDUNDANCY
2 FAILGROUP fg1 DISK '/dev/rdsk/c6t001738000CE8002Ad0s0'
3 FAILGROUP fg2 DISK '/dev/rdsk/c6t001738000CE8002Bd0s0'
4 FAILGROUP fg3 DISK '/dev/rdsk/c6t001738000CE8002Cd0s0'
5 FAILGROUP fg4 DISK '/dev/rdsk/c6t001738000CE8002Dd0s0'
6 QUORUM FAILGROUP fg5 DISK '/dev/rdsk/c6t001738000CE8002Ed0s0'
7 ATTRIBUTE 'compatible.asm' = '11.2.0.0.0';

Diskgroup created.

create diskgroup DATA_FAST external redundancy disk '/dev/rdsk/c6t001738000CE80026d0s0' attribute 'COMPATIBLE.ASM' = '11.2';

create diskgroup ARCHFLASH external redundancy disk '/dev/rdsk/c6t001738000CE80029d0s0' attribute 'COMPATIBLE.ASM' = '11.2';

create diskgroup DATA_RW external redundancy disk '/dev/rdsk/c6t001738000CE80028d0s0' attribute 'COMPATIBLE.ASM' = '11.2';
............




Run queries on V$asm_disk and V$asm_diskgroup to check all disks and diskgroups are available.

Time to restore OCR and voting disks.
OCR is first.
Find out where the OCR backups are located.


oracle@NODEA:/var/opt/oracle sudo ocrconfig -showbackup
Password:
PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy

NODEA 2011/09/05 17:36:47 /apps/product/11gr2/grid/cdata/test-clu/backup00.ocr

NODEA 2011/09/05 13:36:46 /apps/product/11gr2/grid/cdata/test-clu/backup01.ocr

NODEA 2011/09/05 09:36:44 /apps/product/11gr2/grid/cdata/test-clu/backup02.ocr

NODEA 2011/09/04 01:36:33 /apps/product/11gr2/grid/cdata/test-clu/day.ocr

NODEA 2011/08/23 01:08:30 /apps/product/11gr2/grid/cdata/test-clu/week.ocr
PROT-25: Manual backups for the Oracle Cluster Registry are not available

oracle@NODEA:/var/opt/oracle sudo ocrconfig -restore /apps/product/11gr2/grid/cdata/test-clu/backup00.ocr
oracle@NODEA:/var/opt/oracle



Now the voting disk. In 11g Voting disks are autobacked up with the OCR.


oracle@NODEA:/var/opt/oracle sudo crsctl replace votedisk +ORCLCLU
Successful addition of voting disk 142da6336d704f7fbf0fbebd482506d2.
Successful addition of voting disk 2477e787c9dd4f72bfa50bc3e88fc8d1.
Successful addition of voting disk 72c3a35d987a4f2ebf2e27c97ee69946.
Successful addition of voting disk 6c3a9fd3b0b84f06bfed66c9b9604f1b.
Successful addition of voting disk 56db0703046b4f22bf614ec7ab1bd716.
Successfully replaced voting disk group with +ORCLCLU.
CRS-4266: Voting file(s) successfully replaced



Create a SPFILE for ASM from its local PFILE. Place it on the shared storage.


sqlplus / as sysasm

create spfile='+DATA_RW' from pfile;



Clusterware restore done. Time to stop the exclusive running clusterware and make it work for all nodes

Stop the clusterware. -f forces the stop


oracle@NODEA:/apps/product/11gr2/grid/dbs sudo crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'NODEA'
CRS-2673: Attempting to stop 'ora.ctssd' on 'NODEA'
CRS-2673: Attempting to stop 'ora.asm' on 'NODEA'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'NODEA'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'NODEA'
CRS-2677: Stop of 'ora.asm' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'NODEA'
CRS-2677: Stop of 'ora.drivers.acfs' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'NODEA'
CRS-2677: Stop of 'ora.cssd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'NODEA'
CRS-2677: Stop of 'ora.gipcd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'NODEA'
CRS-2677: Stop of 'ora.gpnpd' on 'NODEA' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'NODEA' has completed
CRS-4133: Oracle High Availability Services has been stopped.



Start the clusterware.


oracle@NODEA:/apps/product/11gr2/grid/dbs sudo crsctl start crs
CRS-4123: Oracle High Availability Services has been started.



finally check the status of the cluster on all nodes.


oracle@NODEB:/apps/product/11gr2/grid/log/NODEB/evmd sudo crsctl check cluster -all
Password:
**************************************************************
NODEA:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
NODEB:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************



My first try in restarting clusterware did not help as the server had some hanging cluster threads. To clear them, I had to restart the server and let the init scripts start clusterware cleanly.

Use crsctl stat res -t to check the status of all cluster resources.

No comments:

Post a Comment