DONT BOUNCE THAT DB: 11gR2

Showing posts with label 11gR2. Show all posts

Sunday, July 28, 2013

Setting up High Availability for Golden Gate

The first step is to setup a virtual IP
The VIP needs to reside on the public network subnet. To find the subnet, you can do the below

crsctl stat res -p | grep -i .network

NETWORK = ora.net1.network

crsctl stat res -p | grep -i subnet
USR_ORA_SUBNET=10.2.170.128

Add the VIP to the /etc/hosts file

10.2.170.157 prdgg-vip.prd1.company.com prdgg-vip

Add the VIP into the cluster

GRID_HOME/bin/appvipcfg create -network=1 -ip=10.2.170.157 -vipname=prdgg-vip -user=root
usage: tail [+/-[n][lbc][f]] [file]
tail [+/-[n][l][r|f]] [file]
Production Copyright 2007, 2008, Oracle.All rights reserved
2013-05-29 13:51:03: Creating Resource Type
2013-05-29 13:51:03: Executing cmd: /u01/app/crs/11.2.0/grid/bin/crsctl add type app.appvip.type -basetype cluster_resource -file /u01/app/crs/11.2.0/grid/crs/template/appvip.type
2013-05-29 13:51:04: Create the Resource
2013-05-29 13:51:04: Executing cmd: /u01/app/crs/11.2.0/grid/bin/crsctl add resource prdgg-vip -type app.appvip.type -attr USR_ORA_VIP=10.2.170.157,START_DEPENDENCIES=hard(ora.net1.network) pullup(ora.net1.network),STOP_DEPENDENCIES=hard(ora.net1.network),ACL='owner:root:rwx,pgrp:root:r-x,other::r--,user:root:r-x'

Check if the VIP has been added into the cluster

GRID_HOME/bin/crsctl stat res prdgg-vip
NAME=prdgg-vip
TYPE=app.appvip.type
TARGET=OFFLINE
STATE=OFFLINE

Give Oracle user the privilege.

GRID_HOME/bin/crsctl setperm resource prdgg-vip -u user:oracle:r-x

Start the VIP

GRID_HOME/bin/crsctl start res prdgg-vip
CRS-2672: Attempting to start 'prdgg-vip' on 'prdnode2'
CRS-2676: Start of 'prdgg-vip' on 'prdnode2' succeeded

Try switching the VIP between the nodes

GRID_HOME/bin/crsctl relocate res prdgg-vip
CRS-2673: Attempting to stop 'prdgg-vip' on 'prdnode2'
CRS-2677: Stop of 'prdgg-vip' on 'prdnode2' succeeded
CRS-2672: Attempting to start 'prdgg-vip' on 'prdnode1'
CRS-2676: Start of 'prdgg-vip' on 'prdnode1' succeeded

GRID_HOME/bin/crsctl stat res prdgg-vip
NAME=prdgg-vip
TYPE=app.appvip.type
TARGET=ONLINE
STATE=ONLINE on prdnode1

Create the action script

#!/bin/sh
set -x
#GG auto restart
OGG_HOME=/u01/app/ogg/GG11g
export OGG_HOME;
start_delay_secs=5
#Include the GoldenGate home in the library path to start GGSCI
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${OGG_HOME}
#set the oracle home to the database to ensure GoldenGate will get the
#right environment settings to be able to connect to the database
ORACLE_HOME=/u01/app/oracle/product/11.2.0/db1
JAVA_HOME=/usr/java/default
PATH=$PATH:/usr/java/default/bin
#check_process validates that a manager process is running at the PID
#that GoldenGate specifies.

check_process () {
if ( [ -f "${OGG_HOME}/dirpcs/MGR.pcm" ] )
then
pid=`cut -f8 "${OGG_HOME}/dirpcs/MGR.pcm"`
if [ ${pid} = `ps -e |grep ${pid} |grep mgr |cut -d " " -f2` ]
then
#manager process is running on the PID . exit success
exit 0
else
if [ ${pid} = `ps -e |grep ${pid} |grep mgr |cut -d " " -f1` ]
then
#manager process is running on the PID . exit success
exit 0
else
#manager process is not running on the PID
exit 1
fi
fi
else
#manager is not running because there is no PID file
exit 1
fi
}

#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
ggsci_command=$1
ggsci_output=`${OGG_HOME}/ggsci << EOF
${ggsci_command}
exit
EOF`
}

case $1 in
'start')
#start manager
call_ggsci 'start manager'
#there is a small delay between issuing the start manager command
#and the process being spawned on the OS . wait before checking
sleep ${start_delay_secs}
#Start the Extract and Replicat processes
call_ggsci 'start *'
#check whether manager is running and exit accordingly
check_process
;;

'stop')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;

'check')
check_process
;;

'clean')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;

'abort')
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
esac

[prdnode1.prd1.company: /u01/oraogg/prdnode/scripts] ls -ltr 11gr2_gg_action.scr
-rwxr-xr-x 1 oracle oinstall 2353 May 29 13:59 11gr2_gg_action.scr

Add the GoldenGate Application to the cluster

GRID_HOME/bin/crsctl add resource ggateapp -type cluster_resource -attr "ACTION_SCRIPT=/u01/oraogg/prdnode/scripts/11gr2_gg_action.scr,CHECK_INTERVAL=30, START_DEPENDENCIES='hard(prdgg-vip)pullup(prdgg-vip)', STOP_DEPENDENCIES='hard(prdgg-vip)'"

Check the GoldenGate Application in the cluster

GRID_HOME/bin/crsctl stat res ggateapp -p
NAME=ggateapp
TYPE=cluster_resource
ACL=owner:oracle:rwx,pgrp:oinstall:rwx,other::r--
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=/u01/oraogg/prdnode/scripts/11gr2_gg_action.scr
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/scriptagent
AUTO_START=restore
CARDINALITY=1
CHECK_INTERVAL=30
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION=
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=
LOAD=1
LOGGING_LEVEL=1
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PLACEMENT=balanced
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=1
SCRIPT_TIMEOUT=60
SERVER_POOLS=
START_DEPENDENCIES=hard(prdgg-vip)pullup(prdgg-vip)
START_TIMEOUT=0
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(prdgg-vip)
STOP_TIMEOUT=0
UPTIME_THRESHOLD=1h

Start the cluster resource on the node where GoldenGate is running.

GRID_HOME/bin/crsctl start resource ggateapp -n prdnode1
CRS-2672: Attempting to start 'ggateapp' on 'prdnode1'
CRS-2676: Start of 'ggateapp' on 'prdnode1' succeeded

Check if it running

GRID_HOME/bin/crsctl stat res ggateapp
NAME=ggateapp
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on prdnode1

Tuesday, October 18, 2011

Adaptive Cursor Sharing

Oracle 11g has introduced the new mode of cursor sharing to handle bind peeking.
Bind-peeking is the occurrence of shared cursors leading to non-optimal explain plans.
For eg.

A SQL "select name from emp where dept=:B". This SQL is executed from a package which is called multiple times for different values.
On the first execution, the cursor are cached in the shared pool. The first execution is

'select name from emp where dept=5'

and returns 3 rows. This execution plan involves a index which is optimal give that only 3 out of 253 rows are returned.

On the second execution, the cached cursor gets reused.

'select name from emp where dept=10'

But returns 250 rows out of 253 rows in the table. But sadly the same execution plan gets called because the cursor is shared.

In 11g, Oracle uses an adaptive cursor sharing where Oracle creates child cursors for the matching cursors in the shared pool. This then checks for any optimizations that can be done to plan apart from the parent cursor.
Hence when upgrading its necessary to have some extra memory allocations for the shared pool. In Oracle 11g, cursor_sharing should only be used in two modes EXACT and FORCE.

EXACT mode, this is the default in Oracle 11g. Oracle creates child cursors for SQLs which differ in bind variables. This is desirable approach. The application is making best use of bind variables and Oracle is making sure there is no bind peeking.

In FORCE mode, Oracle forces reuse of the cursors. This is best for legacy applications bind variables are not effectively used.
In this mode Oracle, forces SQLs with similar literal and binds to reuse the same cursors. It does not generate child cursors for different bind variables.

The SIMILAR mode although available in Oracle 11g is not recommended as it forces Oracle to replace literals with binds and allow similar child cursors. Oracle also says it heavily impacts the number of child cursor creation and upsets the optimizer operations.

Although there is not explicit child cursor limit(can be controlled with _ parameter), they can fill up your shared pool quiet rapidly with one bad SQL. Oracle reckons that excessive child cursors may be more detrimental to performance than using separate cursors. We noticed that SIMILAR floods the shared pool and leads to some bugs like Bug 11657468 - Excessive mutex waits with adaptive cursor sharing .

Thanks to Senthil for assisting in the research

Friday, September 9, 2011

Oracle Clusterware restore

ASM and Clusterware are interleaved. ASM cannot startup without the clusterware running first and the clusterware can store all its voting disk and OCR in ASM
This is a bit of a complex intergration. I was in a scenario where I lost all my disks in ASM. When I say lost, we were trying a proof of concept. :)

When I tried to start ASM as a standalone with a pfile it complained that it couldnt communicate with the CSS
And I couldnt start the CRS without the voting disk and OCR, which I had now lost.
I was able to see all the LUNs on the server under /dev/rdsk

Below is how to recover.

Start CRS in exclusive NOCRS mode. This should be run only on one node of the cluster.

oracle@NODEA:/var/opt/oracle sudo crsctl start crs -excl -nocrs
Password:
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.mdnsd' on 'NODEA'
CRS-2676: Start of 'ora.mdnsd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'NODEA'
CRS-2676: Start of 'ora.gpnpd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'NODEA'
CRS-2672: Attempting to start 'ora.gipcd' on 'NODEA'
CRS-2676: Start of 'ora.cssdmonitor' on 'NODEA' succeeded
CRS-2676: Start of 'ora.gipcd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'NODEA'
CRS-2672: Attempting to start 'ora.diskmon' on 'NODEA'
CRS-2676: Start of 'ora.diskmon' on 'NODEA' succeeded
CRS-2676: Start of 'ora.cssd' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'NODEA'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'NODEA'
CRS-2672: Attempting to start 'ora.ctssd' on 'NODEA'
CRS-2676: Start of 'ora.ctssd' on 'NODEA' succeeded
CRS-2676: Start of 'ora.drivers.acfs' on 'NODEA' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'NODEA' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'NODEA'
CRS-2676: Start of 'ora.asm' on 'NODEA' succeeded

Re-create all the diskgroups.

SQL> CREATE DISKGROUP ORCLCLU HIGH REDUNDANCY
2 FAILGROUP fg1 DISK '/dev/rdsk/c6t001738000CE8002Ad0s0'
3 FAILGROUP fg2 DISK '/dev/rdsk/c6t001738000CE8002Bd0s0'
4 FAILGROUP fg3 DISK '/dev/rdsk/c6t001738000CE8002Cd0s0'
5 FAILGROUP fg4 DISK '/dev/rdsk/c6t001738000CE8002Dd0s0'
6 QUORUM FAILGROUP fg5 DISK '/dev/rdsk/c6t001738000CE8002Ed0s0'
7 ATTRIBUTE 'compatible.asm' = '11.2.0.0.0';

Diskgroup created.

create diskgroup DATA_FAST external redundancy disk '/dev/rdsk/c6t001738000CE80026d0s0' attribute 'COMPATIBLE.ASM' = '11.2';

create diskgroup ARCHFLASH external redundancy disk '/dev/rdsk/c6t001738000CE80029d0s0' attribute 'COMPATIBLE.ASM' = '11.2';

create diskgroup DATA_RW external redundancy disk '/dev/rdsk/c6t001738000CE80028d0s0' attribute 'COMPATIBLE.ASM' = '11.2';
............

Run queries on V$asm_disk and V$asm_diskgroup to check all disks and diskgroups are available.

Time to restore OCR and voting disks.
OCR is first.
Find out where the OCR backups are located.

oracle@NODEA:/var/opt/oracle sudo ocrconfig -showbackup
Password:
PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy

NODEA 2011/09/05 17:36:47 /apps/product/11gr2/grid/cdata/test-clu/backup00.ocr

NODEA 2011/09/05 13:36:46 /apps/product/11gr2/grid/cdata/test-clu/backup01.ocr

NODEA 2011/09/05 09:36:44 /apps/product/11gr2/grid/cdata/test-clu/backup02.ocr

NODEA 2011/09/04 01:36:33 /apps/product/11gr2/grid/cdata/test-clu/day.ocr

NODEA 2011/08/23 01:08:30 /apps/product/11gr2/grid/cdata/test-clu/week.ocr
PROT-25: Manual backups for the Oracle Cluster Registry are not available

oracle@NODEA:/var/opt/oracle sudo ocrconfig -restore /apps/product/11gr2/grid/cdata/test-clu/backup00.ocr
oracle@NODEA:/var/opt/oracle

Now the voting disk. In 11g Voting disks are autobacked up with the OCR.

oracle@NODEA:/var/opt/oracle sudo crsctl replace votedisk +ORCLCLU
Successful addition of voting disk 142da6336d704f7fbf0fbebd482506d2.
Successful addition of voting disk 2477e787c9dd4f72bfa50bc3e88fc8d1.
Successful addition of voting disk 72c3a35d987a4f2ebf2e27c97ee69946.
Successful addition of voting disk 6c3a9fd3b0b84f06bfed66c9b9604f1b.
Successful addition of voting disk 56db0703046b4f22bf614ec7ab1bd716.
Successfully replaced voting disk group with +ORCLCLU.
CRS-4266: Voting file(s) successfully replaced

Create a SPFILE for ASM from its local PFILE. Place it on the shared storage.

sqlplus / as sysasm

create spfile='+DATA_RW' from pfile;

Clusterware restore done. Time to stop the exclusive running clusterware and make it work for all nodes

Stop the clusterware. -f forces the stop

oracle@NODEA:/apps/product/11gr2/grid/dbs sudo crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'NODEA'
CRS-2673: Attempting to stop 'ora.ctssd' on 'NODEA'
CRS-2673: Attempting to stop 'ora.asm' on 'NODEA'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'NODEA'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'NODEA'
CRS-2677: Stop of 'ora.asm' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'NODEA'
CRS-2677: Stop of 'ora.drivers.acfs' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'NODEA' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'NODEA'
CRS-2677: Stop of 'ora.cssd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'NODEA'
CRS-2677: Stop of 'ora.gipcd' on 'NODEA' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'NODEA'
CRS-2677: Stop of 'ora.gpnpd' on 'NODEA' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'NODEA' has completed
CRS-4133: Oracle High Availability Services has been stopped.

Start the clusterware.

oracle@NODEA:/apps/product/11gr2/grid/dbs sudo crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

finally check the status of the cluster on all nodes.

oracle@NODEB:/apps/product/11gr2/grid/log/NODEB/evmd sudo crsctl check cluster -all
Password:
**************************************************************
NODEA:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
NODEB:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

My first try in restarting clusterware did not help as the server had some hanging cluster threads. To clear them, I had to restart the server and let the init scripts start clusterware cleanly.

Use crsctl stat res -t to check the status of all cluster resources.

Monday, September 5, 2011

DB Health Check:: 11gR2 New Feature

A relatively unknown new feature with 11g is the Health Monitor Checks.

This feature has 27 checks out of which 6 of them are not internal. The internal checks are managed by the RDBMS software
The non internal checks can be used by DBAs to check and fix any problems.
the non-internal checks are

DB Structure Integrity Check
Data Block Integrity Check
Redo Integrity Check
Transaction Integrity Check
Undo Segment Integrity Check
Dictionary Integrity Check

Some of these checks needed Oracle Support guidence before like
Data Dictionary Consistency checks ID 456468.1. It is very useful to make it part of the software package now.

DBA should make this part of their checks every few months. Some of these checks are resource intensive and it is recommended to run during low load periods.

To run a check use

BEGIN
DBMS_HM.RUN_CHECK('Dictionary Integrity Check', to_char(sysdate,'DDMONYY')||'_dict_check');
END;
/

BEGIN
DBMS_HM.RUN_CHECK('DB Structure Integrity Check', to_char(sysdate,'DDMONYY')||'_dbstruct');
END;
/

These can also be added to the scheduler.

ADRCI can be used then to view the reports.

adrci> show hm_run

ADR Home = /apps/oracle/diag/rdbms/XXXX/XXXX4:
*************************************************************************

**********************************************************
HM RUN RECORD 1
**********************************************************
RUN_ID 1
RUN_NAME 01SEP11_check
CHECK_NAME Dictionary Integrity Check
NAME_ID 24
MODE 0
START_TIME 2011-09-01 09:25:46.844074 +01:00
RESUME_TIME
END_TIME 2011-09-01 09:35:15.977181 +01:00
MODIFIED_TIME 2011-09-01 09:37:45.385591 +01:00
TIMEOUT 0
FLAGS 0
STATUS 5
SRC_INCIDENT_ID 0
NUM_INCIDENTS 0
ERR_NUMBER 0
REPORT_FILE /apps/oracle/diag/rdbms/XXXX/XXXX4/hm/HMREPORT_01SEP11_check.hm
1 rows fetched

adrci> show report hm_run 01SEP11_check

This then should return the errors if any as

Dictionary Inconsistency
6386
FAILURE
OPEN
CRITICAL
0
2011-09-01 09:35:11.868297 +01:00
SQL dictionary health check: syn$.owner fk 95 on object SYN$ failed
Damaged rowid is AAAAA9AABAAAC5BAER - description: Synonymn ORDDCM_DOC_REGS is referenced