Oracle内部错误:ORA-00600[kccchb_3]一例

一套Linux x86-64上的11.1.0.7 RAC系统,该RAC使用Netapps NFS作为共享存储。其中一个节点出现ORA-00600: internal error code, arguments: [kccchb_3]内部错误并导致实例意外终止,详细日志如下:

Mon Dec 27 00:03:13 2010
Error: Controlfile sequence number in file header is different from the one in memory
Please check that the correct mount optionsare used if controlfile is located on NFS
Errors in file /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/trace/TSMOTEBS1_ckpt_15907.trc (incident=11353):
ORA-00600: internal error code, arguments: [kccchb_3], [555541], [555540], [555540], [], [], [], [], [], [], [], []
Incident details in: /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/incident/incdir_11353/TSMOTEBS1_ckpt_15907_i11353.trc
Errors in file /d02/oratst/TSMOTEBS/db/11.1.0/admin/TSMOTEBS1_l118464dcss3001/diag/rdbms/tsmotebs/TSMOTEBS1/trace/TSMOTEBS1_ckpt_15907.trc:
ORA-00600: internal error code, arguments: [kccchb_3], [555541], [555540], [555540], [], [], [], [], [], [], [], []
CKPT (ospid: 15907): terminating the instance due to error 469

根据Metalink文档Note 303986.1 RAC instance using NFS via Netapps report Ora-600 [kccchb_3]:

Symptoms
RAC instance using NFS via Netapps occasionally reports:
ORA-00600 [kccchb_3], [689], [0], [],[],[],[]
ORA-00600 [kclchkblk_3], [7], [738515663], [14], [], [], [], []

Cause
This is most probably a Netapp issue wherein 'noac' doesn't work as expected.
NetApp recommended mount options for Oracle9i RAC on Solaris:
rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio,noac

NetApp recommended mount options for Linux SLES9, RHEL4, RHEL3 QU3 and later:
rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=ORACLE_BLOCK_SIZE,
wsize=ORACLE_BLOCK_SIZE,actimeo=0

NetApp recommended mount options for Linux RHAS2.1, RHEL3 pre QU3:
rw,bg,hard,nointr,tcp,vers=3,timeo=600,rsize=ORACLE_BLOCK_SIZE,
wsize=ORACLE_BLOCK_SIZE,noac

And for AS 2.1 in /etc/modules.conf add:
options nfs nfs_uncached_io=1
Solution

Workaround
set filesystemio_options=DIRECTIO for all instances (to bypass any buffer cache)

以上介绍了该ORA-00600 [kccchb_3]内部错误可以通过设置初始化参数filesystemio_options为DIRECTIO来workaround;实际上更好的选择可以是设置filesystemio_options为SETALL,即同时使用DIRECTIO和ASYNC异步IO。

顺便提一下在Linux上启用异步ASYNC IO不仅需要设置该filesystemio_options参数,我们还需要同时保证以下几点:
1.filesystemio_options参数设置为ASYNC或者SETALL

2.合理设置Linux Kernel内核参数fs.aio-max-nr,例如设置为3145728;以及参数fs.file-max =6815744

3.disk_asynch_io初始化参数设置为TRUE

4.在10gR2下oracle binary可能没有正确以async_on选项编译,我们需要手动重新make(The reason behind this behavior is that the LIBAIO_0.1 Linux OS library is not attached for io_getevents.)

SQL>shutdown immediate
[maclean@rh8 ~]$ cd $ORACLE_HOME/rdbms/lib
[maclean@rh8 lib]$ ln -s /usr/lib/libaio.so skgaio.o
[maclean@rh8 lib]$ make PL_ORALIBS=-laio -f ins_rdbms.mk async_on
rm -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaioi.o
cp /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaio.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/skgaioi.o
chmod 755 /u01/oracle/product/10.2.0/db_1/bin

 - Linking Oracle 
rm -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle
gcc  -o /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle -L/u01/oracle/product/10.2.0/db_1/rdbms/lib/ -L/u01/oracle/product/10.2.0/db_1/lib/ -L/u01/oracle/product/10.2.0/db_1/lib/stubs/ -L/usr/lib -lirc   -Wl,-E `test -f /u01/oracle/product/10.2.0/db_1/rdbms/lib/sskgpsmti.o && echo /u01/oracle/product/10.2.0/db_1/rdbms/lib/sskgpsmti.o` /u01/oracle/product/10.2.0/db_1/rdbms/lib/opimai.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/ssoraed.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/ttcsoi.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/defopt.o  -Wl,--whole-archive -lperfsrv10 -Wl,--no-whole-archive /u01/oracle/product/10.2.0/db_1/lib/nautab.o /u01/oracle/product/10.2.0/db_1/lib/naeet.o /u01/oracle/product/10.2.0/db_1/lib/naect.o /u01/oracle/product/10.2.0/db_1/lib/naedhs.o /u01/oracle/product/10.2.0/db_1/rdbms/lib/config.o  -lserver10 -lodm10 -lnnet10 -lskgxp10 -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lhasgen10 -lcore10 -lskgxn2 -locr10 -locrb10 -locrutl10 -lhasgen10 -lcore10 -lskgxn2   -lclient10  -lvsn10 -lcommon10 -lgeneric10 -lknlopt `if /usr/bin/ar tv /u01/oracle/product/10.2.0/db_1/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap10" ; fi` -lslax10 -lpls10  -lplp10 -lserver10 -lclient10  -lvsn10 -lcommon10 -lgeneric10 -lknlopt -lslax10 -lpls10  -lplp10 -ljox10 -lserver10 -lclsra10 -ldbcfg10 -locijdbcst10 -lwwg  `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lmm -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /u01/oracle/product/10.2.0/db_1/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10   -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `if /usr/bin/ar tv /u01/oracle/product/10.2.0/db_1/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo10"; fi` -lctxc10 -lctx10 -lzx10 -lgx10 -lctx10 -lzx10 -lgx10 -lordimt10 -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lsnls10 -lunls10  -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -laio    `cat /u01/oracle/product/10.2.0/db_1/lib/sysliblist` -Wl,-rpath,/u01/oracle/product/10.2.0/db_1/lib -lm    `cat /u01/oracle/product/10.2.0/db_1/lib/sysliblist` -ldl -lm   -L/u01/oracle/product/10.2.0/db_1/lib
/u01/oracle/product/10.2.0/db_1/lib//libcore10.a(lcd.o): In function `lcdprm':
lcd.c:(.text+0x8e8): warning: the `gets' function is dangerous and should not be used.
mv -f /u01/oracle/product/10.2.0/db_1/bin/oracle /u01/oracle/product/10.2.0/db_1/bin/oracleO
mv /u01/oracle/product/10.2.0/db_1/rdbms/lib/oracle /u01/oracle/product/10.2.0/db_1/bin/oracle
chmod 6751 /u01/oracle/product/10.2.0/db_1/bin/oracle

SQL> startup;

/* 通过以下方式可以验证ASYNC IO的工作情况 */

cat /proc/slabinfo | grep kio

[maclean@rh8 ~]$ cat /proc/slabinfo | grep kio
kioctx                56     72    320   12    1 : tunables   54   27    8 : slabdata      6      6      0
kiocb                 15     15    256   15    1 : tunables  120   60    8 : slabdata      1      1      0

/* 这里kioctx的值大于零,说明正在使用异步ASYNC IO */

如何是11g,那么无需那么麻烦要重新编译async_on,同时可以通过以下手段了解异步ASYNC IO的使用情况:

SQL> col name for a60 
SQL> SELECT name, asynch_io FROM v$datafile f,v$iostat_file i 
WHERE f.file# = i.file_no
AND filetype_name = 'Data File' ; 

NAME							     ASYNCH_IO
------------------------------------------------------------ ---------
/standby/oradata/PROD/datafile/o1_mf_system_6q9dwgwh_.dbf    ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_sysaux_6q9dwgyp_.dbf    ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_undotbs1_6q9dwh0r_.dbf  ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_users_6q9dwh2x_.dbf     ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_example_6q9dzhh1_.dbf   ASYNC_ON
/standby/oradata/PROD/datafile/o1_mf_enc_6q9fdzcx_.dbf	     ASYNC_ON

6 rows selected.

/* 可以看到ASYNCH_IO的状态为ON,说明正使用异步IO */


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *