面向Oracle 19c/21c的全链路故障诊断指南
Oracle RAC安装是企业级数据库部署中最复杂的任务之一,涉及Grid Infrastructure、Clusterware、ASM和数据库四个紧密耦合的层次。本文提供系统化的故障诊断方法论,通过ASCII流程图和伪代码形式呈现诊断逻辑,帮助DBA快速定位和解决安装过程中的各类故障。全文覆盖从CVU预检失败到数据库实例无法启动的完整故障链,并针对19c/21c版本特有问题给出补丁建议。
第一部分:Grid Infrastructure安装故障诊断
1.1 全局诊断决策树
+------------------------------------------------------------------+
| GI INSTALLATION FAILURE DIAGNOSTIC FLOWCHART |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| 运行 runcluvfy.sh 预检查 |
| cluvfy stage -pre crsinst |
+----------------------------------+
|
+---------------+---------------+
| |
[PASSED] [FAILED]
| |
v v
+------------------+ +---------------------------+
| 执行 gridSetup.sh| | 分析 CVU 输出 |
+------------------+ | 定位 PRVF-* 错误码 |
| +---------------------------+
| |
+---------+---------+ +-------------+-------------+
| | | | |
[成功] [失败] [SSH失败] [网络失败] [存储失败]
| | | | |
v v v v v
+--------+ +------------------+ +-----------+ +-----------+
|完成安装| |检查INS-*错误码 | |INS-06006 | |INS-41112 |
+--------+ |$ORACLE_INVENTORY | |SSH配置 | |防火墙问题 |
|/logs/ | +-----------+ +-----------+
+------------------+
|
+-------------+-------------+
| |
[root.sh失败] [其他INS错误]
| |
v v
+------------------+ +----------------------+
|检查rootcrs*.log | | 参考错误码解析表 |
|CLSRSC-* 错误 | | 执行针对性修复 |
+------------------+ +----------------------+
1.2 CVU预检查诊断伪代码
pseudo
FUNCTION diagnose_cvu_failure(cvu_output):
# 阶段1: 解析CVU输出
errors = parse_prvf_errors(cvu_output)
FOR each error IN errors:
SWITCH error.code:
CASE "PRVF-5311": # SSH验证失败
PRINT "诊断: SSH互信未正确配置"
EXECUTE check_ssh_connectivity()
IF openssh_version >= 8.0:
APPLY_FIX "MOS Note 2555697.1 - OpenSSH 8.x兼容性"
ADD_TO_SSHD_CONFIG "KexAlgorithms +diffie-hellman-group1-sha1"
ENDIF
RUN "$GRID_HOME/oui/prov/resources/scripts/sshUserSetup.sh"
CASE "PRVF-5150": # 磁盘路径不一致
PRINT "诊断: ASM磁盘在节点间路径不一致"
FOR each node IN cluster_nodes:
VERIFY disk_paths_exist(node)
VERIFY udev_rules_identical(node)
ENDFOR
CASE "PRVF-5636": # 用户等价性检查失败
PRINT "诊断: 用户SSH等价性未建立"
EXECUTE "ssh-copy-id grid@remote_node"
VERIFY "ssh remote_node hostname" returns_without_password
CASE "PRVF-7617": # NTP/Chrony配置问题
PRINT "诊断: 时钟同步服务配置错误"
IF using_ntpd:
VERIFY "-x" slew_mode IN /etc/sysconfig/ntpd
ELIF using_chrony:
VERIFY "-x" IN OPTION of /etc/sysconfig/chronyd
ENDIF
CASE "PRVF-9652": # 内核参数不满足
PRINT "诊断: 内核参数配置不符合要求"
EXECUTE "cluvfy comp sys -n all -p crs -verbose"
APPLY_FIXUP_SCRIPT from CVU output
ENDFOR
RETURN remediation_summary
1.3 常见INS错误码速查表
| 错误码 | 错误描述 | 根因分析 | 解决方案 |
|---|---|---|---|
| INS-06006 | Passwordless SSH connectivity not set up | OpenSSH 8.x兼容性问题 | 参考MOS 2555697.1添加KexAlgorithms |
| INS-06003 | Failed to setup passwordless SSH | SSH配置不完整 | 运行sshUserSetup.sh脚本 |
| INS-30131 | Initial setup failed for installer | /tmp空间不足或权限问题 Oracle | 确保/tmp至少400MB可用空间 |
| INS-32052 | ORACLE_BASE和ORACLE_HOME相同 | 目录结构配置错误 | 设置不同路径,如BASE=/u01/app/grid, HOME=/u01/app/19.0.0/grid |
| INS-41112 | Network interface不通 | Linux防火墙启用 the gruffdba | systemctl stop firewalld; systemctl disable firewalld |
| INS-40926 | 节点连接问题 | 网络配置错误 | 验证public/private网络、VIP子网配置 |
| INS-13001 | OS版本不支持 | 未认证的操作系统 | 检查认证矩阵或使用-ignorePrereq参数 |
1.4 root.sh故障诊断流程
+------------------------------------------------------------------+
| root.sh FAILURE DIAGNOSTIC FLOW |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| 执行 $GRID_HOME/root.sh |
+----------------------------------+
|
[失败]
|
v
+----------------------------------+
| 检查日志文件: |
| $ORACLE_BASE/crsdata/<host>/ |
| crsconfig/rootcrs_<host>_*.log |
+----------------------------------+
|
v
+----------------------------------+
| 解析 CLSRSC-* 错误码 |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[CLSRSC-115] [CLSRSC-117] [CLSRSC-507]
| | |
v v v
+----------------+ +------------------+ +------------------+
|时钟同步失败 | |CRS启动失败 | |节点通信失败 |
|检查NTP/chrony | |网络/DNS/磁盘问题 | |等待Node1完成 |
+----------------+ +------------------+ +------------------+
root.sh恢复命令序列:
bash
# 步骤1: 在所有节点(除最后一个)执行deconfigure
perl $GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force
# 步骤2: 在最后一个节点执行(清理OCR/Voting Disk)
perl $GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force -lastnode
# 步骤3: 清理ASM磁盘头(如需要)
dd if=/dev/zero of=/dev/sdX1 bs=1024 count=100
# 步骤4: 重新运行root.sh
$GRID_HOME/root.sh
1.5 关键日志文件定位
bash
# CVU预检日志
/tmp/CVU_<version>_<user>/
$ORACLE_BASE/crsdata/<hostname>/cvu/cvutrace.log.0
# 安装主日志
$ORACLE_INVENTORY/logs/installActions<timestamp>.log
# Grid Setup日志
$GRID_HOME/cfgtoollogs/gridSetupActions<timestamp>.log
# root.sh执行日志 (最关键)
$ORACLE_BASE/crsdata/<hostname>/crsconfig/rootcrs_<hostname>_<timestamp>.log
# OUI日志
$GRID_HOME/cfgtoollogs/oui/
# 查找Oracle Inventory位置
cat /etc/oraInst.loc
第二部分:Clusterware故障诊断
2.1 Clusterware启动层次与诊断决策树
+------------------------------------------------------------------+
| CLUSTERWARE STARTUP HIERARCHY |
+------------------------------------------------------------------+
Level 1: OHASD (Oracle High Availability Services Daemon)
|
+---> cssdagent (启动 CSSD)
+---> orarootagent (root级别资源)
+---> oraagent (oracle用户资源)
+---> cssdmonitor (监控节点健康)
Level 2a: OHASD oraagent 启动:
|
+---> MDNSD (多播DNS)
+---> GIPCD (进程/节点间通信)
+---> GPNPD (Grid Plug & Play)
+---> EVMD (事件监控)
+---> ASM实例
Level 2b: OHASD rootagent 启动:
|
+---> CRSD (集群资源管理)
+---> CTSSD (时钟同步)
+---> ACFS (ASM集群文件系统)
+---> Diskmon
Level 3+: CRSD 启动应用资源:
|
+---> VIP
+---> 数据库实例
+---> 监听器
+---> 服务
2.2 Clusterware启动故障诊断流程
+------------------------------------------------------------------+
| CLUSTERWARE STARTUP FAILURE DIAGNOSTIC |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| crsctl check crs |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[CRS-4639] [CRS-4535] [CRS-4530]
OHASD不可达 CRSD不可达 CSSD不可达
| | |
v v v
+----------------+ +------------------+ +------------------+
|检查OHASD进程 | |检查CRSD进程 | |检查CSSD进程 |
|ps -ef|grep ohas| |检查crsd.log | |检查ocssd.log |
+----------------+ +------------------+ +------------------+
| | |
v v v
+----------------+ +------------------+ +------------------+
|验证OLR: | |验证OCR: | |验证Voting Disk: |
|ocrcheck -local | |ocrcheck | |crsctl query css |
| | | | |votedisk |
+----------------+ +------------------+ +------------------+
| | |
v v v
+----------------+ +------------------+ +------------------+
|启动HAS: | |启动CRSD: | |检查磁盘权限 |
|crsctl start has| |crsctl start res | |ASM diskgroup |
| | |ora.crsd -init | |mount状态 |
+----------------+ +------------------+ +------------------+
2.3 诊断伪代码:Clusterware启动故障
pseudo
FUNCTION diagnose_clusterware_startup():
# 阶段1: 检查CRS整体状态
status = EXECUTE "crsctl check crs"
IF status.contains("CRS-4639"):
# OHASD不可达
PRINT "诊断: Oracle High Availability Services未运行"
# 检查OHASD进程
IF NOT process_running("ohasd.bin"):
# 检查OLR完整性
olr_status = EXECUTE "ocrcheck -local"
IF olr_status.contains("FAILED"):
EXECUTE_RECOVERY "ocrconfig -local -restore <backup>"
ENDIF
# 检查启动脚本
VERIFY "/etc/init.d/init.ohasd" or "systemctl status oracle-ohasd"
# 尝试启动
EXECUTE "crsctl start has"
ENDIF
ELIF status.contains("CRS-4535"):
# CRSD不可达
PRINT "诊断: Cluster Ready Services未运行"
# 检查CRSD日志
ANALYZE "$GRID_HOME/log/<hostname>/crsd/crsd.log"
# 常见原因: ASM磁盘权限
VERIFY asm_disk_permissions() == "660 grid:asmadmin"
# 检查OCR
ocr_status = EXECUTE "ocrcheck"
IF ocr_status.contains("FAILED"):
EXECUTE_RECOVERY "OCR恢复流程"
ENDIF
# 尝试启动CRSD
EXECUTE "crsctl start res ora.crsd -init"
ELIF status.contains("CRS-4530"):
# CSSD不可达
PRINT "诊断: Cluster Synchronization Services未运行"
# 检查Voting Disk
vd_status = EXECUTE "crsctl query css votedisk"
IF vd_status.contains("no voting"):
PRINT "Voting Disk不可访问"
CHECK_ASM_DISKGROUP "containing voting disk"
ENDIF
# 检查网络心跳
CHECK_NETWORK_HEARTBEAT()
# 分析CSSD日志
ANALYZE "$GRID_HOME/log/<hostname>/cssd/ocssd.log"
ENDIF
RETURN diagnosis_report
2.4 常见CRS错误码速查表
| 错误码 | 错误描述 | 根因分析 | 解决方案 |
|---|---|---|---|
| CRS-4639 | Cannot contact Oracle High Availability Services | OHASD进程未运行 | crsctl start has; 检查OLR |
| CRS-4535 | Cannot communicate with Cluster Ready Services What-When-How | CRSD未运行 | 检查ASM磁盘权限; ocrcheck |
| CRS-4530 | Communications failure with CSS daemon | CSSD未运行 | 检查Voting Disk; 验证网络心跳 |
| CRS-1714 | Unable to discover any voting files | Voting Disk不可访问 | 检查ASM diskgroup; 验证磁盘权限 |
| CRS-4124 | Oracle High Availability Services startup failed Parnassusdata | OHASD初始化失败 | 检查OLR; 验证文件权限 |
| CRS-0184 | Cannot communicate with CRS daemon | 网络问题 | 检查private interconnect |
| CRS-2674 | Start of resource failed | 资源依赖失败 | crsctl status resource -dependency |
2.5 网络心跳与节点驱逐诊断
+------------------------------------------------------------------+
| NODE EVICTION DIAGNOSTIC FLOWCHART |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| 分析 ocssd.log 确定驱逐原因 |
| grep "evict\|reboot\|split" |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[网络心跳超时] [磁盘心跳超时] [Split-brain]
| | |
v v v
+----------------+ +------------------+ +------------------+
|检查interconnect| |检查存储I/O | |检查网络分区 |
|ping -I <priv> | |Voting Disk访问 | |两节点都声称 |
|<remote_priv> | | | |membership |
+----------------+ +------------------+ +------------------+
| | |
v v v
+----------------+ +------------------+ +------------------+
|验证MTU设置 | |crsctl query css | |检查交换机配置 |
|检查网络延迟 | |votedisk | |验证VLAN隔离 |
|调整misscount | |检查ASM diskgroup | | |
+----------------+ +------------------+ +------------------+
心跳参数说明与调整:
bash
# 查看当前心跳参数
crsctl get css misscount # 默认30秒
crsctl get css disktimeout # 默认200秒
crsctl get css reboottime # 默认3秒
# 调整misscount (需要以root执行)
crsctl set css misscount 60
crsctl set css disktimeout 200 -force
# 网络诊断命令
oifcfg getif # 查看接口配置
oifcfg iflist -p -n # 列出所有接口
olsnodes -n -i -s -t # 查看节点状态
# MTU测试 (私网MTU通常为9000 jumbo frame)
ping -s 8972 -c 2 -I <source_ip> <remote_node>
2.6 OCR/Voting Disk恢复流程
pseudo
PROCEDURE recover_ocr_voting_disk():
# 场景1: OCR损坏恢复
IF ocr_corrupted:
# Step 1: 停止所有节点的CRS
FOR each node IN cluster:
EXECUTE "crsctl stop crs -f" ON node
ENDFOR
# Step 2: 在一个节点以exclusive模式启动
EXECUTE "crsctl start crs -excl"
# Step 3: 停止CRSD
EXECUTE "crsctl stop res ora.crsd -init"
# Step 4: 恢复OCR
backup_file = GET_LATEST_BACKUP "ocrconfig -showbackup"
EXECUTE "ocrconfig -restore " + backup_file
# Step 5: 验证并重启
EXECUTE "ocrcheck"
EXECUTE "crsctl stop crs"
EXECUTE "crsctl start crs"
# 场景2: Voting Disk恢复
ELIF voting_disk_corrupted:
# Step 1: 停止CRS
EXECUTE "crsctl stop crs -f"
# Step 2: 禁用自动启动
EXECUTE "crsctl disable crs"
# Step 3: exclusive模式启动
EXECUTE "crsctl start crs -excl"
# Step 4: 替换Voting Disk位置
EXECUTE "crsctl replace votedisk +<NEW_DISKGROUP>"
# Step 5: 重启
EXECUTE "crsctl stop crs"
EXECUTE "crsctl enable crs"
EXECUTE "crsctl start crs"
ENDIF
2.7 关键日志文件
bash
# ADR结构 (19c/21c推荐)
$ORACLE_BASE/diag/crs/<hostname>/crs/trace/
├── alert.log # 主告警日志
├── ohasd.trc # OHASD跟踪
├── crsd.trc # CRSD跟踪
├── ocssd.trc # CSSD跟踪
├── evmd.trc # EVMD跟踪
├── gpnpd.trc # GPNPD跟踪
└── gipcd.trc # GIPCD跟踪
# 传统日志位置 (仍有效)
$GRID_HOME/log/<hostname>/
├── alert<hostname>.log # Clusterware告警日志
├── ohasd/ohasd.log
├── crsd/crsd.log
├── cssd/ocssd.log
└── agent/ # 代理日志
# 日志分析命令
grep -E "CRS-|ORA-|TNS-" $GRID_HOME/log/<host>/alert<host>.log
grep -E "evict|reboot|split|brain" ocssd.log
grep -E "network HB|missing for|timeout" ocssd.log
第三部分:ASM实例故障诊断
3.1 ASM故障诊断决策树
+------------------------------------------------------------------+
| ASM INSTANCE FAILURE DIAGNOSTIC |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| srvctl status asm |
| (检查ASM实例状态) |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[实例未运行] [实例运行中] [磁盘组问题]
| | |
v v v
+----------------+ +------------------+ +------------------+
|检查CSS状态 | |检查磁盘发现 | |ALTER DISKGROUP |
|crsctl check css| |asmcmd lsdsk | |xxx MOUNT |
+----------------+ |--discovery | +------------------+
| +------------------+ |
v | v
+----------------+ v +------------------+
|启动ASM实例 | +------------------+ |ORA-15040 |
|srvctl start asm| |检查asm_diskstring| |磁盘组不完整 |
+----------------+ |v$asm_disk | |尝试MOUNT FORCE |
+------------------+ +------------------+
3.2 ASM磁盘发现诊断伪代码
pseudo
FUNCTION diagnose_asm_disk_discovery():
# 阶段1: 检查ASM_DISKSTRING配置
diskstring = QUERY "SHOW PARAMETER asm_diskstring"
PRINT "当前ASM_DISKSTRING: " + diskstring
# 阶段2: 使用kfod验证磁盘可见性
kfod_output = EXECUTE "kfod op=disks asm_diskstring='" + diskstring + "'"
IF kfod_output.count == 0:
PRINT "诊断: 未发现任何ASM磁盘"
# 检查发现路径
IF using_asmlib:
VERIFY "/etc/init.d/oracleasm status" == "loaded"
EXECUTE "/etc/init.d/oracleasm listdisks"
ELIF using_udev:
VERIFY udev_rules_exist("/etc/udev/rules.d/99-oracle-asm*.rules")
EXECUTE "udevadm info --query=all --name=/dev/sdX"
ELIF using_afd:
EXECUTE "asmcmd afd_state"
EXECUTE "asmcmd afd_lsdsk"
ENDIF
ELSE:
# 阶段3: 检查磁盘头状态
FOR each disk IN discovered_disks:
header = EXECUTE "kfed read " + disk.path
IF header.kfbh.type == "KFBTYP_INVALID":
PRINT disk.path + " 不是有效的ASM磁盘"
ELIF header.kfdhdb.hdrsts == "KFDHDR_MEMBER":
PRINT disk.path + " 是diskgroup " + header.kfdhdb.grpname + " 的成员"
ELIF header.kfdhdb.hdrsts == "KFDHDR_CANDIDATE":
PRINT disk.path + " 是候选磁盘,可用于创建diskgroup"
ENDIF
ENDFOR
ENDIF
# 阶段4: 检查磁盘权限
FOR each disk IN discovered_disks:
perms = GET_PERMISSIONS(disk.path)
IF perms.mode != "660" OR perms.owner != "grid" OR perms.group != "asmadmin":
PRINT "权限错误: " + disk.path
PRINT "修复: chown grid:asmadmin " + disk.path
PRINT "修复: chmod 660 " + disk.path
ENDIF
ENDFOR
RETURN discovery_report
3.3 常见ORA-15xxx错误速查表
| 错误码 | 错误描述 | 根因分析 | 解决方案 |
|---|---|---|---|
| ORA-15001 | diskgroup does not exist or is not mounted | 磁盘组不存在或未挂载 | ALTER DISKGROUP xxx MOUNT |
| ORA-15017 | diskgroup cannot be mounted | 磁盘缺失或配置问题 | 检查v$asm_disk; 尝试MOUNT FORCE |
| ORA-15018 | diskgroup cannot be created | 发现字符串问题或权限 | 验证asm_diskstring; 检查磁盘权限 |
| ORA-15025 | could not open disk | 权限被拒绝或路径错误 Oraexcel | chown grid:asmadmin; chmod 660 |
| ORA-15040 | diskgroup is incomplete | 部分磁盘丢失 | ALTER DISKGROUP xxx MOUNT FORCE |
| ORA-15042 | ASM disk is missing | 磁盘从系统移除 | 检查OS日志; 验证存储连接 |
| ORA-15063 | insufficient number of disks | 磁盘数不满足冗余要求 | 验证磁盘可见性; 添加磁盘 |
| ORA-15077 | could not locate ASM instance | ASM实例未运行 | srvctl start asm |
| ORA-15081 | failed to submit I/O | ASMLIB驱动问题 | 检查oracleasm状态; 验证磁盘I/O |
3.4 udev规则配置与故障排查
bash
# 标准udev规则文件位置
/etc/udev/rules.d/99-oracle-asmdevices.rules
# 推荐的udev规则格式 (使用SCSI ID)
KERNEL=="sd*", SUBSYSTEM=="block", ENV{ID_SERIAL}=="360001234567890000111222333444554", \
SYMLINK+="oracleasm/DATA01", OWNER="grid", GROUP="asmadmin", MODE="0660"
# 获取SCSI ID
/usr/lib/udev/scsi_id -g -u -d /dev/sdb # OL7/8
/lib/udev/scsi_id --page=0x83 --whitelisted --device=/dev/mapper/mpath0 # multipath
# 应用udev规则
udevadm control --reload-rules
udevadm trigger --type=devices
partprobe /dev/sdX
# 验证
ls -la /dev/oracleasm/
常见udev配置错误:
| 错误类型 | 表现 | 修复方法 |
|---|---|---|
| OWNER/GROUP/MODE顺序错误 | 权限未正确应用 | 将这些参数放在SYMLINK之前 |
| scsi_id语法错误 | 不同OL版本语法不同 | 根据OS版本使用正确语法 |
| 缺少SUBSYSTEM | 规则不生效 | 添加SUBSYSTEM=="block" |
| VMware UUID未启用 | 无法获取唯一标识 | VMX文件添加disk.EnableUUID="TRUE" |
| 规则未同步到所有节点 | 节点间磁盘名不一致 | 所有RAC节点规则文件必须相同 |
3.5 ASM Filter Driver (AFD) 配置 (19c/21c推荐)
bash
# 检查AFD支持
$GRID_HOME/bin/afddriverstate supported
# 预期输出: AFD-9200: Supported
# 安装前标记磁盘 (以root执行)
export ORACLE_HOME=/u01/app/grid/product/19c
export ORACLE_BASE=/tmp
$ORACLE_HOME/bin/asmcmd afd_label DATA01 /dev/sdb --init
$ORACLE_HOME/bin/asmcmd afd_label DATA02 /dev/sdc --init
# 验证标签
$ORACLE_HOME/bin/asmcmd afd_lslbl /dev/sd*
# 检查AFD状态
asmcmd afd_state
# 预期输出: ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED'
# 更新ASM_DISKSTRING以包含AFD
ALTER SYSTEM SET asm_diskstring='AFD:*' SCOPE=BOTH;
3.6 Diskgroup Mount故障恢复流程
+------------------------------------------------------------------+
| DISKGROUP MOUNT FAILURE RECOVERY |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| ALTER DISKGROUP xxx MOUNT |
+----------------------------------+
|
[失败]
|
v
+----------------------------------+
| 检查错误码 |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[ORA-15040] [ORA-15042] [ORA-15017]
磁盘组不完整 磁盘缺失 无法挂载
| | |
v v v
+------------------+ +------------------+ +------------------+
|检查丢失磁盘数量 | |kfed read检查磁盘 | |检查asm_diskstring|
|SELECT * FROM | |头是否损坏 | |验证磁盘权限 |
|v$asm_disk | | | | |
+------------------+ +------------------+ +------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
|尝试FORCE mount: | |尝试修复磁盘头: | |修正配置后重试 |
|ALTER DISKGROUP | |kfed repair | | |
|xxx MOUNT FORCE | |/dev/xxx | | |
+------------------+ +------------------+ +------------------+
挂载命令层级:
sql
-- 正常挂载
ALTER DISKGROUP data MOUNT;
-- 强制挂载 (部分磁盘不可用)
ALTER DISKGROUP data MOUNT FORCE;
-- 受限模式挂载 (单实例, 维护用)
ALTER DISKGROUP data MOUNT RESTRICTED;
-- 受限强制挂载 (严重故障恢复)
ALTER DISKGROUP data MOUNT RESTRICTED FORCE;
-- 紧急恢复挂载 (最后手段, 可能丢数据)
-- 仅用于多磁盘故障且标准恢复失败的情况
ALTER DISKGROUP data MOUNT RESTRICTED FORCE FOR RECOVERY;
3.7 关键诊断命令
bash
# ASMCMD磁盘发现
asmcmd lsdsk --discovery
asmcmd lsdsk -p --discovery # 包含路径
# ASMCMD磁盘组列表
asmcmd lsdg
# KFOD磁盘发现
kfod op=disks asm_diskstring='/dev/sd*' disks=all
kfod op=groups
kfod status=TRUE disk=all dscvgroup=TRUE
# KFED磁盘头读取
kfed read /dev/sdb1
# 关键字段:
# kfbh.type: KFBTYP_DISKHEAD (有效) 或 KFBTYP_INVALID
# kfdhdb.hdrsts: KFDHDR_MEMBER (在diskgroup中) 或 KFDHDR_CANDIDATE
# kfdhdb.dskname: 磁盘名
# kfdhdb.grpname: 磁盘组名
# SQL查询
SELECT name, state, type, total_mb, free_mb, offline_disks
FROM v$asm_diskgroup;
SELECT group_number, disk_number, name, path,
header_status, mount_status, mode_status, state
FROM v$asm_disk ORDER BY group_number, disk_number;
# ASM日志位置
$ORACLE_BASE/diag/asm/+asm/<instance>/trace/alert_<instance>.log
$GRID_HOME/log/<hostname>/asm/
第四部分:RAC Database安装/创建故障诊断
4.1 DBCA故障诊断决策树
+------------------------------------------------------------------+
| DBCA FAILURE DIAGNOSTIC FLOWCHART |
+------------------------------------------------------------------+
|
v
+----------------------------------+
| 执行 DBCA 创建数据库 |
+----------------------------------+
|
[失败]
|
v
+----------------------------------+
| 检查DBCA日志: |
| $ORACLE_BASE/cfgtoollogs/dbca/ |
| <DB_NAME>/ |
+----------------------------------+
|
+----------------------+----------------------+
| | |
[模板问题] [ASM连接失败] [监听器问题]
| | |
v v v
+----------------+ +------------------+ +------------------+
|PRCS-1046 | |ORA-15055 | |ORA-12514 |
|检查模板完整性 | |ORA-15001 | |ORA-12541 |
|应用最新补丁 | |检查ASM实例状态 | |检查监听器配置 |
+----------------+ +------------------+ +------------------+
4.2 RAC实例启动故障诊断伪代码
pseudo
FUNCTION diagnose_rac_instance_startup(db_name, instance_name):
# 阶段1: 检查集群资源状态
res_status = EXECUTE "crsctl stat res ora." + db_name + ".db -t"
IF res_status.state == "OFFLINE":
# 尝试启动
EXECUTE "srvctl start database -d " + db_name
IF startup_failed:
# 检查告警日志获取具体错误
alert_log = "$ORACLE_BASE/diag/rdbms/" + db_name + "/" + instance_name + "/trace/alert_" + instance_name + ".log"
error = PARSE_LATEST_ERROR(alert_log)
SWITCH error.code:
CASE "ORA-01078": # SPFILE处理失败
PRINT "诊断: SPFILE不存在或不可访问"
VERIFY spfile_exists_in_asm()
IF spfile_corrupted:
CREATE_PFILE_FROM_BACKUP()
STARTUP_WITH_PFILE()
ENDIF
CASE "ORA-01102": # 无法以EXCLUSIVE模式挂载
PRINT "诊断: 另一实例已挂载或残留进程存在"
orphan_procs = EXECUTE "ps -ef | grep ora_.*_" + instance_name
IF orphan_procs.count > 0:
KILL_ORPHAN_PROCESSES(orphan_procs)
ENDIF
# 使用srvctl而非sqlplus
EXECUTE "srvctl start instance -d " + db_name + " -i " + instance_name
CASE "ORA-29702": # Cluster Group Service错误
PRINT "诊断: 集群资源组问题 (常见于GI升级后)"
# 方案1: 重启整个clusterware
EXECUTE "crsctl stop crs" ON all_nodes
EXECUTE "crsctl start crs" ON all_nodes
# 方案2: 如仍失败, 应用GI补丁
CHECK_MOS_FOR_PATCHES("ORA-29702 after GI upgrade")
CASE "ORA-00600": # 内部错误
PRINT "诊断: 内部错误, 需分析trace文件"
trace_file = GET_TRACE_FILE_FROM_ALERT(alert_log)
EXECUTE "tfactl diagcollect -srdc ORA-00600"
# 检查常见RAC相关ORA-600
IF error.arguments.contains("kjbrwreq"):
PRINT "Cache Fusion问题, 检查interconnect"
ELIF error.arguments.contains("kccsbck"):
PRINT "db_unique_name不一致问题"
ENDIF
ENDSWITCH
ENDIF
ENDIF
# 阶段2: 验证interconnect
interconnect = QUERY "SELECT * FROM V$CLUSTER_INTERCONNECTS"
IF interconnect.IS_PUBLIC == "YES":
PRINT "警告: RAC使用public网络作为interconnect, 性能将严重下降"
VERIFY private_network_configuration()
ENDIF
# 阶段3: 验证监听器注册
listener_status = EXECUTE "lsnrctl status"
IF service_not_registered:
EXECUTE "ALTER SYSTEM REGISTER" IN sqlplus
VERIFY remote_listener_parameter()
ENDIF
RETURN diagnosis_report
4.3 常见DBCA和实例启动错误
| 错误码 | 错误描述 | 根因分析 | 解决方案 |
|---|---|---|---|
| ORA-01078 | Failure in processing system parameters | SPFILE不存在或损坏 | 验证ASM中spfile位置; 从备份创建pfile |
| ORA-01102 | Cannot mount database in EXCLUSIVE mode | 残留进程或另一实例已挂载 | kill残留进程; 使用srvctl启动 |
| ORA-29702 | Error in Cluster Group Service | GI升级后资源组问题 | 重启clusterware; 应用GI补丁 |
| ORA-15055 | Unable to connect to ASM instance | 二进制权限问题 | 运行setasmgidwrap; 验证组成员 |
| ORA-12514 | Listener does not know of service | 服务未注册 | ALTER SYSTEM REGISTER; 检查remote_listener |
| ORA-00600 | Internal Error | 多种原因 | 分析trace文件; 使用MOS ORA-600 Lookup |
| INS-30014 | Unable to check CFS location | 权限问题或主机名过长 | 检查目录权限; 主机名≤15字符 |
4.4 集群资源注册命令
bash
# 添加Administrator-managed RAC数据库
srvctl add database -db orcl \
-oraclehome /u01/app/oracle/product/19c/dbhome_1 \
-dbtype RAC \
-dbname orcl \
-spfile +DATA/ORCL/spfileorcl.ora \
-pwfile +DATA/ORCL/orapworcl \
-diskgroup "DATA,FRA" \
-startoption OPEN \
-stopoption IMMEDIATE \
-role PRIMARY
# 添加实例
srvctl add instance -db orcl -instance orcl1 -node node1
srvctl add instance -db orcl -instance orcl2 -node node2
# 添加Policy-managed RAC数据库
srvctl add database -db orcl \
-oraclehome /u01/app/oracle/product/19c/dbhome_1 \
-dbtype RAC \
-serverpool ora_pool \
-spfile +DATA/ORCL/spfileorcl.ora
# 查看配置
srvctl config database -d orcl
# 启动/停止数据库
srvctl start database -d orcl
srvctl stop database -d orcl -stopoption immediate
# 删除数据库资源 (不删除数据文件)
srvctl remove database -d orcl
4.5 Listener和服务注册诊断
bash
# 检查监听器状态
lsnrctl status LISTENER
lsnrctl services LISTENER
# 验证监听器参数
SQL> show parameter listener
# 关键参数:
# local_listener: 指向本地VIP
# remote_listener: 指向SCAN地址
# 强制服务注册
SQL> ALTER SYSTEM REGISTER;
# 19c最佳实践: remote_listener配置
ALTER SYSTEM SET remote_listener='orcl-scan:1521' SCOPE=BOTH;
# 如果SCAN解析有问题, 使用IP列表
ALTER SYSTEM SET remote_listener='(DESCRIPTION=(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip1>)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip2>)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip3>)(PORT=1521))))' SCOPE=BOTH;
4.6 关键日志文件
bash
# DBCA日志
$ORACLE_BASE/cfgtoollogs/dbca/<DB_NAME>/
$ORACLE_HOME/cfgtoollogs/dbca/
# 数据库告警日志
$ORACLE_BASE/diag/rdbms/<db>/<sid>/trace/alert_<sid>.log
# 查找告警日志位置
SQL> SELECT name, value FROM V$DIAG_INFO WHERE name IN ('Diag Trace', 'Diag Alert');
# 监听器日志
$GRID_HOME/log/<hostname>/listener/listener.log
$ORACLE_BASE/diag/tnslsnr/<hostname>/listener/trace/listener.log
# Clusterware代理日志 (数据库资源相关)
$GRID_HOME/log/<hostname>/agent/crsd/oraagent_oracle/oraagent_oracle.log
# 使用ADRCI分析
adrci
ADRCI> show homes
ADRCI> set home diag/rdbms/orcl/orcl1
ADRCI> show alert -tail 100
ADRCI> show alert -p "MESSAGE_TEXT LIKE '%ORA-%' AND ORIGINATING_TIMESTAMP > SYSTIMESTAMP - 1"
第五部分:19c/21c版本特有问题与补丁建议
5.1 关键MOS Note参考
| MOS Note | 描述 | 适用场景 |
|---|---|---|
| 555.1 | Oracle Database 19c Important Recommended Patches | 所有19c环境必读 |
| 2523220.1 | Database 19 Release Updates Bugs Fixed | 选择RU版本时参考 |
| 2523221.1 | Grid Infrastructure 19 Release Updates Bugs Fixed | GI补丁选择 |
| 2555697.1 | INS-06006 GI RunInstaller Fails with OpenSSH 8.x | OpenSSH 8.x兼容性 |
| 2858850.1 | INS-41814 on RHEL8/OL8 | RHEL8 locale问题 |
| 2835152.1 | CRS-5037 after GI upgrade to 19.14 | GI 19.14+数据库启动失败 |
5.2 19c/21c已知问题与解决方案
| 问题描述 | 影响版本 | 解决方案 |
|---|---|---|
| OpenSSH 8.x导致INS-06006 | 19c base | 应用最新RU或添加KexAlgorithms配置 |
| RHEL8 LC_ALL locale问题 | 19c/21c on RHEL8 | 确保en_US.UTF-8 locale正确配置 |
| GI 19.14+ CRS-5037 | 19.14及以上 | 数据库也需要应用最低补丁级别 |
| ACFS/AFD内核驱动问题 | 内核更新后 | 运行acfsroot install重新编译 |
| 21c非CDB架构不支持 | 21c | 所有数据库必须为CDB架构 |
5.3 推荐的补丁策略
bash
# 检查当前补丁级别
$GRID_HOME/OPatch/opatch lspatches
$ORACLE_HOME/OPatch/opatch lspatches
# 安装时应用GI RU
./gridSetup.sh -applyRU /path/to/GI_RU_patch
# 使用opatchauto打补丁 (需要root)
opatchauto apply /path/to/patch -oh $GRID_HOME
# 数据库打补丁 (rolling方式)
# 停止一个实例, 打补丁, 启动, 然后下一个实例
srvctl stop instance -d orcl -i orcl1
$ORACLE_HOME/OPatch/opatch apply /path/to/patch
srvctl start instance -d orcl -i orcl1
# 重复其他实例
# Monthly Recommended Patches (MRP) - Oracle推荐方式
# 从MOS获取最新MRP并按说明应用
第六部分:综合故障诊断快速参考
6.1 Master诊断流程图
+==================================================================+
| ORACLE RAC INSTALLATION MASTER DIAGNOSTIC FLOW |
+==================================================================+
|
[1] 确定故障阶段
|
+----------+-----------+-----------+-----------+
| | | | |
v v v v v
+----------+ +----------+ +----------+ +----------+ +----------+
|GI安装 | |CW启动 | |ASM问题 | |DB创建 | |实例启动 |
|INS-*错误 | |CRS-*错误 | |ORA-15* | |DBCA失败 | |ORA-*错误 |
+----------+ +----------+ +----------+ +----------+ +----------+
| | | | |
v v v v v
+----------+ +----------+ +----------+ +----------+ +----------+
|runcluvfy | |crsctl | |asmcmd | |DBCA日志 | |alert.log |
|CVU日志 | |check crs | |lsdsk | | | |srvctl |
+----------+ +----------+ +----------+ +----------+ +----------+
| | | | |
+----------+-----------+-----------+-----------+
|
[2] 收集诊断信息
|
+------------------------------------------+
| |
v v
+----------------------+ +----------------------+
| 手动收集: | | 自动收集: |
| - 相关日志文件 | | - tfactl diagcollect |
| - crsctl commands | | - diagcollection.pl |
| - sqlplus queries | | - ADRCI incident |
+----------------------+ +----------------------+
|
[3] 分析与修复
|
+------------------------------------------+
| |
v v
+----------------------+ +----------------------+
| 本文档参考: | | 外部资源: |
| - 错误码速查表 | | - MOS知识库 |
| - 诊断伪代码 | | - Oracle文档 |
| - 恢复流程 | | - SR (服务请求) |
+----------------------+ +----------------------+
6.2 常用诊断命令速查表
bash
# ========== Cluster状态检查 ==========
crsctl check cluster -all # 检查所有节点集群状态
crsctl stat res -t # 列出所有资源状态
crsctl check crs # 检查本地CRS状态
olsnodes -n -i -s -t # 列出集群节点信息
# ========== OCR/Voting Disk ==========
ocrcheck # 验证OCR完整性
ocrcheck -local # 验证OLR完整性
ocrconfig -showbackup # 查看OCR备份
crsctl query css votedisk # 查看Voting Disk状态
# ========== ASM ==========
srvctl status asm # ASM实例状态
asmcmd lsdsk --discovery # 发现ASM磁盘
asmcmd lsdg # 列出磁盘组
kfod op=disks # KFOD磁盘发现
kfed read /dev/xxx # 读取磁盘头
# ========== Database ==========
srvctl status database -d <db> # 数据库状态
srvctl config database -d <db> # 数据库配置
lsnrctl status # 监听器状态
# ========== 日志收集 ==========
tfactl diagcollect -node all # TFA诊断收集
diagcollection.pl --collect --crs # CRS诊断收集
adrci # ADR命令行接口
6.3 故障解决检查清单
GI安装前检查:
- runcluvfy.sh stage -pre crsinst 全部通过
- SSH互信双向无密码 (ssh hostname "hostname")
- NTP/chrony配置slew模式 (-x选项)
- 防火墙已禁用 (firewalld, iptables)
- SELinux禁用或permissive
- DNS解析正确 (所有hostname, VIP, SCAN)
- 内核参数符合要求
- cvuqdisk RPM已安装
- 共享存储所有节点可访问且权限正确
实例启动故障快速定位:
- 检查 crsctl stat res -t 确认资源状态
- 检查 alert log 获取具体ORA错误
- 验证 ASM实例运行 (srvctl status asm)
- 验证 interconnect配置 (V$CLUSTER_INTERCONNECTS)
- 验证 监听器状态和服务注册
结论
Oracle RAC安装故障诊断是一项需要系统化方法的复杂任务。本文提供的分层诊断框架——从GI安装到Clusterware、ASM、最后到数据库——为DBA提供了结构化的问题定位路径。关键要点包括:80%的启动问题源于磁盘空间、网络连接或文件权限这三个基本因素;日志分析是诊断的核心,alert.log应作为首要检查对象;19c/21c版本引入了AFD作为推荐的磁盘管理方式,同时OpenSSH 8.x兼容性问题需要特别关注。建议DBA在生产环境部署前,完整执行CVU预检查并解决所有警告,同时保持GI和数据库软件处于Oracle推荐的补丁级别(MRP),以避免已知问题影响系统稳定性。