面向Oracle 19c/21c的全链路故障诊断指南

Oracle RAC安装是企业级数据库部署中最复杂的任务之一,涉及Grid Infrastructure、Clusterware、ASM和数据库四个紧密耦合的层次。本文提供系统化的故障诊断方法论,通过ASCII流程图伪代码形式呈现诊断逻辑,帮助DBA快速定位和解决安装过程中的各类故障。全文覆盖从CVU预检失败到数据库实例无法启动的完整故障链,并针对19c/21c版本特有问题给出补丁建议。


第一部分:Grid Infrastructure安装故障诊断

1.1 全局诊断决策树

 
 
+------------------------------------------------------------------+
|           GI INSTALLATION FAILURE DIAGNOSTIC FLOWCHART           |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |   运行 runcluvfy.sh 预检查       |
            |   cluvfy stage -pre crsinst     |
            +----------------------------------+
                              |
              +---------------+---------------+
              |                               |
         [PASSED]                        [FAILED]
              |                               |
              v                               v
    +------------------+        +---------------------------+
    | 执行 gridSetup.sh|        | 分析 CVU 输出             |
    +------------------+        | 定位 PRVF-* 错误码        |
              |                 +---------------------------+
              |                               |
    +---------+---------+       +-------------+-------------+
    |                   |       |             |             |
[成功]              [失败]   [SSH失败]    [网络失败]   [存储失败]
    |                   |       |             |             |
    v                   v       v             v             v
+--------+    +------------------+  +-----------+  +-----------+
|完成安装|    |检查INS-*错误码   |  |INS-06006  |  |INS-41112  |
+--------+    |$ORACLE_INVENTORY |  |SSH配置    |  |防火墙问题 |
              |/logs/            |  +-----------+  +-----------+
              +------------------+
                       |
         +-------------+-------------+
         |                           |
    [root.sh失败]              [其他INS错误]
         |                           |
         v                           v
+------------------+      +----------------------+
|检查rootcrs*.log  |      | 参考错误码解析表     |
|CLSRSC-* 错误     |      | 执行针对性修复       |
+------------------+      +----------------------+

1.2 CVU预检查诊断伪代码

 
 
pseudo
FUNCTION diagnose_cvu_failure(cvu_output):
    
    # 阶段1: 解析CVU输出
    errors = parse_prvf_errors(cvu_output)
    
    FOR each error IN errors:
        SWITCH error.code:
            
            CASE "PRVF-5311":  # SSH验证失败
                PRINT "诊断: SSH互信未正确配置"
                EXECUTE check_ssh_connectivity()
                IF openssh_version >= 8.0:
                    APPLY_FIX "MOS Note 2555697.1 - OpenSSH 8.x兼容性"
                    ADD_TO_SSHD_CONFIG "KexAlgorithms +diffie-hellman-group1-sha1"
                ENDIF
                RUN "$GRID_HOME/oui/prov/resources/scripts/sshUserSetup.sh"
                
            CASE "PRVF-5150":  # 磁盘路径不一致
                PRINT "诊断: ASM磁盘在节点间路径不一致"
                FOR each node IN cluster_nodes:
                    VERIFY disk_paths_exist(node)
                    VERIFY udev_rules_identical(node)
                ENDFOR
                
            CASE "PRVF-5636":  # 用户等价性检查失败
                PRINT "诊断: 用户SSH等价性未建立"
                EXECUTE "ssh-copy-id grid@remote_node"
                VERIFY "ssh remote_node hostname" returns_without_password
                
            CASE "PRVF-7617":  # NTP/Chrony配置问题
                PRINT "诊断: 时钟同步服务配置错误"
                IF using_ntpd:
                    VERIFY "-x" slew_mode IN /etc/sysconfig/ntpd
                ELIF using_chrony:
                    VERIFY "-x" IN OPTION of /etc/sysconfig/chronyd
                ENDIF
                
            CASE "PRVF-9652":  # 内核参数不满足
                PRINT "诊断: 内核参数配置不符合要求"
                EXECUTE "cluvfy comp sys -n all -p crs -verbose"
                APPLY_FIXUP_SCRIPT from CVU output
                
    ENDFOR
    
    RETURN remediation_summary

1.3 常见INS错误码速查表

错误码 错误描述 根因分析 解决方案
INS-06006 Passwordless SSH connectivity not set up OpenSSH 8.x兼容性问题 参考MOS 2555697.1添加KexAlgorithms
INS-06003 Failed to setup passwordless SSH SSH配置不完整 运行sshUserSetup.sh脚本
INS-30131 Initial setup failed for installer /tmp空间不足或权限问题 Oracle 确保/tmp至少400MB可用空间
INS-32052 ORACLE_BASE和ORACLE_HOME相同 目录结构配置错误 设置不同路径,如BASE=/u01/app/grid, HOME=/u01/app/19.0.0/grid
INS-41112 Network interface不通 Linux防火墙启用 the gruffdba systemctl stop firewalld; systemctl disable firewalld
INS-40926 节点连接问题 网络配置错误 验证public/private网络、VIP子网配置
INS-13001 OS版本不支持 未认证的操作系统 检查认证矩阵或使用-ignorePrereq参数

1.4 root.sh故障诊断流程

 
 
+------------------------------------------------------------------+
|                root.sh FAILURE DIAGNOSTIC FLOW                    |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |  执行 $GRID_HOME/root.sh         |
            +----------------------------------+
                              |
                         [失败]
                              |
                              v
            +----------------------------------+
            |  检查日志文件:                    |
            |  $ORACLE_BASE/crsdata/<host>/    |
            |  crsconfig/rootcrs_<host>_*.log  |
            +----------------------------------+
                              |
                              v
            +----------------------------------+
            |  解析 CLSRSC-* 错误码            |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [CLSRSC-115]           [CLSRSC-117]          [CLSRSC-507]
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|时钟同步失败    |   |CRS启动失败       |   |节点通信失败      |
|检查NTP/chrony  |   |网络/DNS/磁盘问题 |   |等待Node1完成     |
+----------------+   +------------------+   +------------------+

root.sh恢复命令序列

 
 
bash
# 步骤1: 在所有节点(除最后一个)执行deconfigure
perl $GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force

# 步骤2: 在最后一个节点执行(清理OCR/Voting Disk)
perl $GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force -lastnode

# 步骤3: 清理ASM磁盘头(如需要)
dd if=/dev/zero of=/dev/sdX1 bs=1024 count=100

# 步骤4: 重新运行root.sh
$GRID_HOME/root.sh

1.5 关键日志文件定位

 
 
bash
# CVU预检日志
/tmp/CVU_<version>_<user>/
$ORACLE_BASE/crsdata/<hostname>/cvu/cvutrace.log.0

# 安装主日志
$ORACLE_INVENTORY/logs/installActions<timestamp>.log

# Grid Setup日志
$GRID_HOME/cfgtoollogs/gridSetupActions<timestamp>.log

# root.sh执行日志 (最关键)
$ORACLE_BASE/crsdata/<hostname>/crsconfig/rootcrs_<hostname>_<timestamp>.log

# OUI日志
$GRID_HOME/cfgtoollogs/oui/

# 查找Oracle Inventory位置
cat /etc/oraInst.loc

第二部分:Clusterware故障诊断

2.1 Clusterware启动层次与诊断决策树

 
 
+------------------------------------------------------------------+
|            CLUSTERWARE STARTUP HIERARCHY                          |
+------------------------------------------------------------------+

Level 1: OHASD (Oracle High Availability Services Daemon)
    |
    +---> cssdagent (启动 CSSD)
    +---> orarootagent (root级别资源)
    +---> oraagent (oracle用户资源)
    +---> cssdmonitor (监控节点健康)
    
Level 2a: OHASD oraagent 启动:
    |
    +---> MDNSD (多播DNS)
    +---> GIPCD (进程/节点间通信)
    +---> GPNPD (Grid Plug & Play)
    +---> EVMD (事件监控)
    +---> ASM实例
    
Level 2b: OHASD rootagent 启动:
    |
    +---> CRSD (集群资源管理)
    +---> CTSSD (时钟同步)
    +---> ACFS (ASM集群文件系统)
    +---> Diskmon
    
Level 3+: CRSD 启动应用资源:
    |
    +---> VIP
    +---> 数据库实例
    +---> 监听器
    +---> 服务

2.2 Clusterware启动故障诊断流程

 
 
+------------------------------------------------------------------+
|          CLUSTERWARE STARTUP FAILURE DIAGNOSTIC                   |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |     crsctl check crs             |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [CRS-4639]             [CRS-4535]            [CRS-4530]
  OHASD不可达            CRSD不可达            CSSD不可达
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|检查OHASD进程   |   |检查CRSD进程      |   |检查CSSD进程      |
|ps -ef|grep ohas|   |检查crsd.log      |   |检查ocssd.log     |
+----------------+   +------------------+   +------------------+
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|验证OLR:        |   |验证OCR:          |   |验证Voting Disk:  |
|ocrcheck -local |   |ocrcheck          |   |crsctl query css  |
|                |   |                  |   |votedisk          |
+----------------+   +------------------+   +------------------+
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|启动HAS:        |   |启动CRSD:         |   |检查磁盘权限      |
|crsctl start has|   |crsctl start res  |   |ASM diskgroup     |
|                |   |ora.crsd -init    |   |mount状态         |
+----------------+   +------------------+   +------------------+

2.3 诊断伪代码:Clusterware启动故障

 
 
pseudo
FUNCTION diagnose_clusterware_startup():
    
    # 阶段1: 检查CRS整体状态
    status = EXECUTE "crsctl check crs"
    
    IF status.contains("CRS-4639"):
        # OHASD不可达
        PRINT "诊断: Oracle High Availability Services未运行"
        
        # 检查OHASD进程
        IF NOT process_running("ohasd.bin"):
            # 检查OLR完整性
            olr_status = EXECUTE "ocrcheck -local"
            IF olr_status.contains("FAILED"):
                EXECUTE_RECOVERY "ocrconfig -local -restore <backup>"
            ENDIF
            
            # 检查启动脚本
            VERIFY "/etc/init.d/init.ohasd" or "systemctl status oracle-ohasd"
            
            # 尝试启动
            EXECUTE "crsctl start has"
        ENDIF
        
    ELIF status.contains("CRS-4535"):
        # CRSD不可达
        PRINT "诊断: Cluster Ready Services未运行"
        
        # 检查CRSD日志
        ANALYZE "$GRID_HOME/log/<hostname>/crsd/crsd.log"
        
        # 常见原因: ASM磁盘权限
        VERIFY asm_disk_permissions() == "660 grid:asmadmin"
        
        # 检查OCR
        ocr_status = EXECUTE "ocrcheck"
        IF ocr_status.contains("FAILED"):
            EXECUTE_RECOVERY "OCR恢复流程"
        ENDIF
        
        # 尝试启动CRSD
        EXECUTE "crsctl start res ora.crsd -init"
        
    ELIF status.contains("CRS-4530"):
        # CSSD不可达
        PRINT "诊断: Cluster Synchronization Services未运行"
        
        # 检查Voting Disk
        vd_status = EXECUTE "crsctl query css votedisk"
        IF vd_status.contains("no voting"):
            PRINT "Voting Disk不可访问"
            CHECK_ASM_DISKGROUP "containing voting disk"
        ENDIF
        
        # 检查网络心跳
        CHECK_NETWORK_HEARTBEAT()
        
        # 分析CSSD日志
        ANALYZE "$GRID_HOME/log/<hostname>/cssd/ocssd.log"
        
    ENDIF
    
    RETURN diagnosis_report

2.4 常见CRS错误码速查表

错误码 错误描述 根因分析 解决方案
CRS-4639 Cannot contact Oracle High Availability Services OHASD进程未运行 crsctl start has; 检查OLR
CRS-4535 Cannot communicate with Cluster Ready Services What-When-How CRSD未运行 检查ASM磁盘权限; ocrcheck
CRS-4530 Communications failure with CSS daemon CSSD未运行 检查Voting Disk; 验证网络心跳
CRS-1714 Unable to discover any voting files Voting Disk不可访问 检查ASM diskgroup; 验证磁盘权限
CRS-4124 Oracle High Availability Services startup failed Parnassusdata OHASD初始化失败 检查OLR; 验证文件权限
CRS-0184 Cannot communicate with CRS daemon 网络问题 检查private interconnect
CRS-2674 Start of resource failed 资源依赖失败 crsctl status resource -dependency

2.5 网络心跳与节点驱逐诊断

 
 
+------------------------------------------------------------------+
|              NODE EVICTION DIAGNOSTIC FLOWCHART                   |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |  分析 ocssd.log 确定驱逐原因     |
            |  grep "evict\|reboot\|split"    |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [网络心跳超时]         [磁盘心跳超时]        [Split-brain]
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|检查interconnect|   |检查存储I/O       |   |检查网络分区      |
|ping -I <priv>  |   |Voting Disk访问   |   |两节点都声称      |
|<remote_priv>   |   |                  |   |membership        |
+----------------+   +------------------+   +------------------+
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|验证MTU设置     |   |crsctl query css  |   |检查交换机配置    |
|检查网络延迟    |   |votedisk          |   |验证VLAN隔离      |
|调整misscount   |   |检查ASM diskgroup |   |                  |
+----------------+   +------------------+   +------------------+

心跳参数说明与调整

 
 
bash
# 查看当前心跳参数
crsctl get css misscount      # 默认30秒
crsctl get css disktimeout    # 默认200秒
crsctl get css reboottime     # 默认3秒

# 调整misscount (需要以root执行)
crsctl set css misscount 60
crsctl set css disktimeout 200 -force

# 网络诊断命令
oifcfg getif                  # 查看接口配置
oifcfg iflist -p -n           # 列出所有接口
olsnodes -n -i -s -t          # 查看节点状态

# MTU测试 (私网MTU通常为9000 jumbo frame)
ping -s 8972 -c 2 -I <source_ip> <remote_node>

2.6 OCR/Voting Disk恢复流程

 
 
pseudo
PROCEDURE recover_ocr_voting_disk():
    
    # 场景1: OCR损坏恢复
    IF ocr_corrupted:
        # Step 1: 停止所有节点的CRS
        FOR each node IN cluster:
            EXECUTE "crsctl stop crs -f" ON node
        ENDFOR
        
        # Step 2: 在一个节点以exclusive模式启动
        EXECUTE "crsctl start crs -excl"
        
        # Step 3: 停止CRSD
        EXECUTE "crsctl stop res ora.crsd -init"
        
        # Step 4: 恢复OCR
        backup_file = GET_LATEST_BACKUP "ocrconfig -showbackup"
        EXECUTE "ocrconfig -restore " + backup_file
        
        # Step 5: 验证并重启
        EXECUTE "ocrcheck"
        EXECUTE "crsctl stop crs"
        EXECUTE "crsctl start crs"
        
    # 场景2: Voting Disk恢复
    ELIF voting_disk_corrupted:
        # Step 1: 停止CRS
        EXECUTE "crsctl stop crs -f"
        
        # Step 2: 禁用自动启动
        EXECUTE "crsctl disable crs"
        
        # Step 3: exclusive模式启动
        EXECUTE "crsctl start crs -excl"
        
        # Step 4: 替换Voting Disk位置
        EXECUTE "crsctl replace votedisk +<NEW_DISKGROUP>"
        
        # Step 5: 重启
        EXECUTE "crsctl stop crs"
        EXECUTE "crsctl enable crs"
        EXECUTE "crsctl start crs"
        
    ENDIF

2.7 关键日志文件

 
 
bash
# ADR结构 (19c/21c推荐)
$ORACLE_BASE/diag/crs/<hostname>/crs/trace/
├── alert.log              # 主告警日志
├── ohasd.trc              # OHASD跟踪
├── crsd.trc               # CRSD跟踪
├── ocssd.trc              # CSSD跟踪
├── evmd.trc               # EVMD跟踪
├── gpnpd.trc              # GPNPD跟踪
└── gipcd.trc              # GIPCD跟踪

# 传统日志位置 (仍有效)
$GRID_HOME/log/<hostname>/
├── alert<hostname>.log    # Clusterware告警日志
├── ohasd/ohasd.log
├── crsd/crsd.log
├── cssd/ocssd.log
└── agent/                 # 代理日志

# 日志分析命令
grep -E "CRS-|ORA-|TNS-" $GRID_HOME/log/<host>/alert<host>.log
grep -E "evict|reboot|split|brain" ocssd.log
grep -E "network HB|missing for|timeout" ocssd.log

第三部分:ASM实例故障诊断

3.1 ASM故障诊断决策树

 
 
+------------------------------------------------------------------+
|               ASM INSTANCE FAILURE DIAGNOSTIC                     |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |   srvctl status asm              |
            |   (检查ASM实例状态)              |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [实例未运行]           [实例运行中]          [磁盘组问题]
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|检查CSS状态     |   |检查磁盘发现      |   |ALTER DISKGROUP   |
|crsctl check css|   |asmcmd lsdsk     |   |xxx MOUNT         |
+----------------+   |--discovery       |   +------------------+
       |             +------------------+           |
       v                      |                     v
+----------------+            v            +------------------+
|启动ASM实例     |   +------------------+  |ORA-15040        |
|srvctl start asm|   |检查asm_diskstring|  |磁盘组不完整     |
+----------------+   |v$asm_disk        |  |尝试MOUNT FORCE  |
                     +------------------+  +------------------+

3.2 ASM磁盘发现诊断伪代码

 
 
pseudo
FUNCTION diagnose_asm_disk_discovery():
    
    # 阶段1: 检查ASM_DISKSTRING配置
    diskstring = QUERY "SHOW PARAMETER asm_diskstring"
    PRINT "当前ASM_DISKSTRING: " + diskstring
    
    # 阶段2: 使用kfod验证磁盘可见性
    kfod_output = EXECUTE "kfod op=disks asm_diskstring='" + diskstring + "'"
    
    IF kfod_output.count == 0:
        PRINT "诊断: 未发现任何ASM磁盘"
        
        # 检查发现路径
        IF using_asmlib:
            VERIFY "/etc/init.d/oracleasm status" == "loaded"
            EXECUTE "/etc/init.d/oracleasm listdisks"
        ELIF using_udev:
            VERIFY udev_rules_exist("/etc/udev/rules.d/99-oracle-asm*.rules")
            EXECUTE "udevadm info --query=all --name=/dev/sdX"
        ELIF using_afd:
            EXECUTE "asmcmd afd_state"
            EXECUTE "asmcmd afd_lsdsk"
        ENDIF
        
    ELSE:
        # 阶段3: 检查磁盘头状态
        FOR each disk IN discovered_disks:
            header = EXECUTE "kfed read " + disk.path
            
            IF header.kfbh.type == "KFBTYP_INVALID":
                PRINT disk.path + " 不是有效的ASM磁盘"
            ELIF header.kfdhdb.hdrsts == "KFDHDR_MEMBER":
                PRINT disk.path + " 是diskgroup " + header.kfdhdb.grpname + " 的成员"
            ELIF header.kfdhdb.hdrsts == "KFDHDR_CANDIDATE":
                PRINT disk.path + " 是候选磁盘,可用于创建diskgroup"
            ENDIF
        ENDFOR
    ENDIF
    
    # 阶段4: 检查磁盘权限
    FOR each disk IN discovered_disks:
        perms = GET_PERMISSIONS(disk.path)
        IF perms.mode != "660" OR perms.owner != "grid" OR perms.group != "asmadmin":
            PRINT "权限错误: " + disk.path
            PRINT "修复: chown grid:asmadmin " + disk.path
            PRINT "修复: chmod 660 " + disk.path
        ENDIF
    ENDFOR
    
    RETURN discovery_report

3.3 常见ORA-15xxx错误速查表

错误码 错误描述 根因分析 解决方案
ORA-15001 diskgroup does not exist or is not mounted 磁盘组不存在或未挂载 ALTER DISKGROUP xxx MOUNT
ORA-15017 diskgroup cannot be mounted 磁盘缺失或配置问题 检查v$asm_disk; 尝试MOUNT FORCE
ORA-15018 diskgroup cannot be created 发现字符串问题或权限 验证asm_diskstring; 检查磁盘权限
ORA-15025 could not open disk 权限被拒绝或路径错误 Oraexcel chown grid:asmadmin; chmod 660
ORA-15040 diskgroup is incomplete 部分磁盘丢失 ALTER DISKGROUP xxx MOUNT FORCE
ORA-15042 ASM disk is missing 磁盘从系统移除 检查OS日志; 验证存储连接
ORA-15063 insufficient number of disks 磁盘数不满足冗余要求 验证磁盘可见性; 添加磁盘
ORA-15077 could not locate ASM instance ASM实例未运行 srvctl start asm
ORA-15081 failed to submit I/O ASMLIB驱动问题 检查oracleasm状态; 验证磁盘I/O

3.4 udev规则配置与故障排查

 
 
bash
# 标准udev规则文件位置
/etc/udev/rules.d/99-oracle-asmdevices.rules

# 推荐的udev规则格式 (使用SCSI ID)
KERNEL=="sd*", SUBSYSTEM=="block", ENV{ID_SERIAL}=="360001234567890000111222333444554", \
  SYMLINK+="oracleasm/DATA01", OWNER="grid", GROUP="asmadmin", MODE="0660"

# 获取SCSI ID
/usr/lib/udev/scsi_id -g -u -d /dev/sdb       # OL7/8
/lib/udev/scsi_id --page=0x83 --whitelisted --device=/dev/mapper/mpath0  # multipath

# 应用udev规则
udevadm control --reload-rules
udevadm trigger --type=devices
partprobe /dev/sdX

# 验证
ls -la /dev/oracleasm/

常见udev配置错误

错误类型 表现 修复方法
OWNER/GROUP/MODE顺序错误 权限未正确应用 将这些参数放在SYMLINK之前
scsi_id语法错误 不同OL版本语法不同 根据OS版本使用正确语法
缺少SUBSYSTEM 规则不生效 添加SUBSYSTEM=="block"
VMware UUID未启用 无法获取唯一标识 VMX文件添加disk.EnableUUID="TRUE"
规则未同步到所有节点 节点间磁盘名不一致 所有RAC节点规则文件必须相同

3.5 ASM Filter Driver (AFD) 配置 (19c/21c推荐)

 
 
bash
# 检查AFD支持
$GRID_HOME/bin/afddriverstate supported
# 预期输出: AFD-9200: Supported

# 安装前标记磁盘 (以root执行)
export ORACLE_HOME=/u01/app/grid/product/19c
export ORACLE_BASE=/tmp
$ORACLE_HOME/bin/asmcmd afd_label DATA01 /dev/sdb --init
$ORACLE_HOME/bin/asmcmd afd_label DATA02 /dev/sdc --init

# 验证标签
$ORACLE_HOME/bin/asmcmd afd_lslbl /dev/sd*

# 检查AFD状态
asmcmd afd_state
# 预期输出: ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED'

# 更新ASM_DISKSTRING以包含AFD
ALTER SYSTEM SET asm_diskstring='AFD:*' SCOPE=BOTH;

3.6 Diskgroup Mount故障恢复流程

 
 
+------------------------------------------------------------------+
|            DISKGROUP MOUNT FAILURE RECOVERY                       |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |  ALTER DISKGROUP xxx MOUNT       |
            +----------------------------------+
                              |
                         [失败]
                              |
                              v
            +----------------------------------+
            |  检查错误码                       |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [ORA-15040]            [ORA-15042]           [ORA-15017]
  磁盘组不完整           磁盘缺失              无法挂载
       |                      |                      |
       v                      v                      v
+------------------+  +------------------+  +------------------+
|检查丢失磁盘数量  |  |kfed read检查磁盘 |  |检查asm_diskstring|
|SELECT * FROM     |  |头是否损坏        |  |验证磁盘权限      |
|v$asm_disk        |  |                  |  |                  |
+------------------+  +------------------+  +------------------+
       |                      |                      |
       v                      v                      v
+------------------+  +------------------+  +------------------+
|尝试FORCE mount:  |  |尝试修复磁盘头:   |  |修正配置后重试    |
|ALTER DISKGROUP   |  |kfed repair      |  |                  |
|xxx MOUNT FORCE   |  |/dev/xxx         |  |                  |
+------------------+  +------------------+  +------------------+

挂载命令层级

 
 
sql
-- 正常挂载
ALTER DISKGROUP data MOUNT;

-- 强制挂载 (部分磁盘不可用)
ALTER DISKGROUP data MOUNT FORCE;

-- 受限模式挂载 (单实例, 维护用)
ALTER DISKGROUP data MOUNT RESTRICTED;

-- 受限强制挂载 (严重故障恢复)
ALTER DISKGROUP data MOUNT RESTRICTED FORCE;

-- 紧急恢复挂载 (最后手段, 可能丢数据)
-- 仅用于多磁盘故障且标准恢复失败的情况
ALTER DISKGROUP data MOUNT RESTRICTED FORCE FOR RECOVERY;

3.7 关键诊断命令

 
 
bash
# ASMCMD磁盘发现
asmcmd lsdsk --discovery
asmcmd lsdsk -p --discovery    # 包含路径

# ASMCMD磁盘组列表
asmcmd lsdg

# KFOD磁盘发现
kfod op=disks asm_diskstring='/dev/sd*' disks=all
kfod op=groups
kfod status=TRUE disk=all dscvgroup=TRUE

# KFED磁盘头读取
kfed read /dev/sdb1
# 关键字段:
# kfbh.type: KFBTYP_DISKHEAD (有效) 或 KFBTYP_INVALID
# kfdhdb.hdrsts: KFDHDR_MEMBER (在diskgroup中) 或 KFDHDR_CANDIDATE
# kfdhdb.dskname: 磁盘名
# kfdhdb.grpname: 磁盘组名

# SQL查询
SELECT name, state, type, total_mb, free_mb, offline_disks 
FROM v$asm_diskgroup;

SELECT group_number, disk_number, name, path, 
       header_status, mount_status, mode_status, state
FROM v$asm_disk ORDER BY group_number, disk_number;

# ASM日志位置
$ORACLE_BASE/diag/asm/+asm/<instance>/trace/alert_<instance>.log
$GRID_HOME/log/<hostname>/asm/

第四部分:RAC Database安装/创建故障诊断

4.1 DBCA故障诊断决策树

 
 
+------------------------------------------------------------------+
|              DBCA FAILURE DIAGNOSTIC FLOWCHART                    |
+------------------------------------------------------------------+
                              |
                              v
            +----------------------------------+
            |  执行 DBCA 创建数据库            |
            +----------------------------------+
                              |
                         [失败]
                              |
                              v
            +----------------------------------+
            |  检查DBCA日志:                   |
            |  $ORACLE_BASE/cfgtoollogs/dbca/  |
            |  <DB_NAME>/                      |
            +----------------------------------+
                              |
       +----------------------+----------------------+
       |                      |                      |
  [模板问题]             [ASM连接失败]         [监听器问题]
       |                      |                      |
       v                      v                      v
+----------------+   +------------------+   +------------------+
|PRCS-1046       |   |ORA-15055         |   |ORA-12514        |
|检查模板完整性  |   |ORA-15001         |   |ORA-12541        |
|应用最新补丁    |   |检查ASM实例状态   |   |检查监听器配置   |
+----------------+   +------------------+   +------------------+

4.2 RAC实例启动故障诊断伪代码

 
 
pseudo
FUNCTION diagnose_rac_instance_startup(db_name, instance_name):
    
    # 阶段1: 检查集群资源状态
    res_status = EXECUTE "crsctl stat res ora." + db_name + ".db -t"
    
    IF res_status.state == "OFFLINE":
        # 尝试启动
        EXECUTE "srvctl start database -d " + db_name
        
        IF startup_failed:
            # 检查告警日志获取具体错误
            alert_log = "$ORACLE_BASE/diag/rdbms/" + db_name + "/" + instance_name + "/trace/alert_" + instance_name + ".log"
            error = PARSE_LATEST_ERROR(alert_log)
            
            SWITCH error.code:
                
                CASE "ORA-01078":  # SPFILE处理失败
                    PRINT "诊断: SPFILE不存在或不可访问"
                    VERIFY spfile_exists_in_asm()
                    IF spfile_corrupted:
                        CREATE_PFILE_FROM_BACKUP()
                        STARTUP_WITH_PFILE()
                    ENDIF
                    
                CASE "ORA-01102":  # 无法以EXCLUSIVE模式挂载
                    PRINT "诊断: 另一实例已挂载或残留进程存在"
                    orphan_procs = EXECUTE "ps -ef | grep ora_.*_" + instance_name
                    IF orphan_procs.count > 0:
                        KILL_ORPHAN_PROCESSES(orphan_procs)
                    ENDIF
                    # 使用srvctl而非sqlplus
                    EXECUTE "srvctl start instance -d " + db_name + " -i " + instance_name
                    
                CASE "ORA-29702":  # Cluster Group Service错误
                    PRINT "诊断: 集群资源组问题 (常见于GI升级后)"
                    # 方案1: 重启整个clusterware
                    EXECUTE "crsctl stop crs" ON all_nodes
                    EXECUTE "crsctl start crs" ON all_nodes
                    # 方案2: 如仍失败, 应用GI补丁
                    CHECK_MOS_FOR_PATCHES("ORA-29702 after GI upgrade")
                    
                CASE "ORA-00600":  # 内部错误
                    PRINT "诊断: 内部错误, 需分析trace文件"
                    trace_file = GET_TRACE_FILE_FROM_ALERT(alert_log)
                    EXECUTE "tfactl diagcollect -srdc ORA-00600"
                    # 检查常见RAC相关ORA-600
                    IF error.arguments.contains("kjbrwreq"):
                        PRINT "Cache Fusion问题, 检查interconnect"
                    ELIF error.arguments.contains("kccsbck"):
                        PRINT "db_unique_name不一致问题"
                    ENDIF
                    
            ENDSWITCH
        ENDIF
    ENDIF
    
    # 阶段2: 验证interconnect
    interconnect = QUERY "SELECT * FROM V$CLUSTER_INTERCONNECTS"
    IF interconnect.IS_PUBLIC == "YES":
        PRINT "警告: RAC使用public网络作为interconnect, 性能将严重下降"
        VERIFY private_network_configuration()
    ENDIF
    
    # 阶段3: 验证监听器注册
    listener_status = EXECUTE "lsnrctl status"
    IF service_not_registered:
        EXECUTE "ALTER SYSTEM REGISTER" IN sqlplus
        VERIFY remote_listener_parameter()
    ENDIF
    
    RETURN diagnosis_report

4.3 常见DBCA和实例启动错误

错误码 错误描述 根因分析 解决方案
ORA-01078 Failure in processing system parameters SPFILE不存在或损坏 验证ASM中spfile位置; 从备份创建pfile
ORA-01102 Cannot mount database in EXCLUSIVE mode 残留进程或另一实例已挂载 kill残留进程; 使用srvctl启动
ORA-29702 Error in Cluster Group Service GI升级后资源组问题 重启clusterware; 应用GI补丁
ORA-15055 Unable to connect to ASM instance 二进制权限问题 运行setasmgidwrap; 验证组成员
ORA-12514 Listener does not know of service 服务未注册 ALTER SYSTEM REGISTER; 检查remote_listener
ORA-00600 Internal Error 多种原因 分析trace文件; 使用MOS ORA-600 Lookup
INS-30014 Unable to check CFS location 权限问题或主机名过长 检查目录权限; 主机名≤15字符

4.4 集群资源注册命令

 
 
bash
# 添加Administrator-managed RAC数据库
srvctl add database -db orcl \
  -oraclehome /u01/app/oracle/product/19c/dbhome_1 \
  -dbtype RAC \
  -dbname orcl \
  -spfile +DATA/ORCL/spfileorcl.ora \
  -pwfile +DATA/ORCL/orapworcl \
  -diskgroup "DATA,FRA" \
  -startoption OPEN \
  -stopoption IMMEDIATE \
  -role PRIMARY

# 添加实例
srvctl add instance -db orcl -instance orcl1 -node node1
srvctl add instance -db orcl -instance orcl2 -node node2

# 添加Policy-managed RAC数据库
srvctl add database -db orcl \
  -oraclehome /u01/app/oracle/product/19c/dbhome_1 \
  -dbtype RAC \
  -serverpool ora_pool \
  -spfile +DATA/ORCL/spfileorcl.ora

# 查看配置
srvctl config database -d orcl

# 启动/停止数据库
srvctl start database -d orcl
srvctl stop database -d orcl -stopoption immediate

# 删除数据库资源 (不删除数据文件)
srvctl remove database -d orcl

4.5 Listener和服务注册诊断

 
 
bash
# 检查监听器状态
lsnrctl status LISTENER
lsnrctl services LISTENER

# 验证监听器参数
SQL> show parameter listener
# 关键参数:
# local_listener: 指向本地VIP
# remote_listener: 指向SCAN地址

# 强制服务注册
SQL> ALTER SYSTEM REGISTER;

# 19c最佳实践: remote_listener配置
ALTER SYSTEM SET remote_listener='orcl-scan:1521' SCOPE=BOTH;

# 如果SCAN解析有问题, 使用IP列表
ALTER SYSTEM SET remote_listener='(DESCRIPTION=(ADDRESS_LIST=
  (ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip1>)(PORT=1521))
  (ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip2>)(PORT=1521))
  (ADDRESS=(PROTOCOL=TCP)(HOST=<scan_ip3>)(PORT=1521))))' SCOPE=BOTH;

4.6 关键日志文件

 
 
bash
# DBCA日志
$ORACLE_BASE/cfgtoollogs/dbca/<DB_NAME>/
$ORACLE_HOME/cfgtoollogs/dbca/

# 数据库告警日志
$ORACLE_BASE/diag/rdbms/<db>/<sid>/trace/alert_<sid>.log

# 查找告警日志位置
SQL> SELECT name, value FROM V$DIAG_INFO WHERE name IN ('Diag Trace', 'Diag Alert');

# 监听器日志
$GRID_HOME/log/<hostname>/listener/listener.log
$ORACLE_BASE/diag/tnslsnr/<hostname>/listener/trace/listener.log

# Clusterware代理日志 (数据库资源相关)
$GRID_HOME/log/<hostname>/agent/crsd/oraagent_oracle/oraagent_oracle.log

# 使用ADRCI分析
adrci
ADRCI> show homes
ADRCI> set home diag/rdbms/orcl/orcl1
ADRCI> show alert -tail 100
ADRCI> show alert -p "MESSAGE_TEXT LIKE '%ORA-%' AND ORIGINATING_TIMESTAMP > SYSTIMESTAMP - 1"

第五部分:19c/21c版本特有问题与补丁建议

5.1 关键MOS Note参考

MOS Note 描述 适用场景
555.1 Oracle Database 19c Important Recommended Patches 所有19c环境必读
2523220.1 Database 19 Release Updates Bugs Fixed 选择RU版本时参考
2523221.1 Grid Infrastructure 19 Release Updates Bugs Fixed GI补丁选择
2555697.1 INS-06006 GI RunInstaller Fails with OpenSSH 8.x OpenSSH 8.x兼容性
2858850.1 INS-41814 on RHEL8/OL8 RHEL8 locale问题
2835152.1 CRS-5037 after GI upgrade to 19.14 GI 19.14+数据库启动失败

5.2 19c/21c已知问题与解决方案

问题描述 影响版本 解决方案
OpenSSH 8.x导致INS-06006 19c base 应用最新RU或添加KexAlgorithms配置
RHEL8 LC_ALL locale问题 19c/21c on RHEL8 确保en_US.UTF-8 locale正确配置
GI 19.14+ CRS-5037 19.14及以上 数据库也需要应用最低补丁级别
ACFS/AFD内核驱动问题 内核更新后 运行acfsroot install重新编译
21c非CDB架构不支持 21c 所有数据库必须为CDB架构

5.3 推荐的补丁策略

 
 
bash
# 检查当前补丁级别
$GRID_HOME/OPatch/opatch lspatches
$ORACLE_HOME/OPatch/opatch lspatches

# 安装时应用GI RU
./gridSetup.sh -applyRU /path/to/GI_RU_patch

# 使用opatchauto打补丁 (需要root)
opatchauto apply /path/to/patch -oh $GRID_HOME

# 数据库打补丁 (rolling方式)
# 停止一个实例, 打补丁, 启动, 然后下一个实例
srvctl stop instance -d orcl -i orcl1
$ORACLE_HOME/OPatch/opatch apply /path/to/patch
srvctl start instance -d orcl -i orcl1
# 重复其他实例

# Monthly Recommended Patches (MRP) - Oracle推荐方式
# 从MOS获取最新MRP并按说明应用

第六部分:综合故障诊断快速参考

6.1 Master诊断流程图

 
 
+==================================================================+
|         ORACLE RAC INSTALLATION MASTER DIAGNOSTIC FLOW            |
+==================================================================+
                              |
            [1] 确定故障阶段
                              |
       +----------+-----------+-----------+-----------+
       |          |           |           |           |
       v          v           v           v           v
+----------+ +----------+ +----------+ +----------+ +----------+
|GI安装    | |CW启动    | |ASM问题   | |DB创建    | |实例启动  |
|INS-*错误 | |CRS-*错误 | |ORA-15*   | |DBCA失败  | |ORA-*错误 |
+----------+ +----------+ +----------+ +----------+ +----------+
       |          |           |           |           |
       v          v           v           v           v
+----------+ +----------+ +----------+ +----------+ +----------+
|runcluvfy | |crsctl    | |asmcmd    | |DBCA日志  | |alert.log |
|CVU日志   | |check crs | |lsdsk     | |          | |srvctl    |
+----------+ +----------+ +----------+ +----------+ +----------+
       |          |           |           |           |
       +----------+-----------+-----------+-----------+
                              |
            [2] 收集诊断信息
                              |
       +------------------------------------------+
       |                                          |
       v                                          v
+----------------------+              +----------------------+
| 手动收集:            |              | 自动收集:            |
| - 相关日志文件       |              | - tfactl diagcollect |
| - crsctl commands    |              | - diagcollection.pl  |
| - sqlplus queries    |              | - ADRCI incident     |
+----------------------+              +----------------------+
                              |
            [3] 分析与修复
                              |
       +------------------------------------------+
       |                                          |
       v                                          v
+----------------------+              +----------------------+
| 本文档参考:          |              | 外部资源:            |
| - 错误码速查表       |              | - MOS知识库          |
| - 诊断伪代码         |              | - Oracle文档         |
| - 恢复流程           |              | - SR (服务请求)      |
+----------------------+              +----------------------+

6.2 常用诊断命令速查表

 
 
bash
# ========== Cluster状态检查 ==========
crsctl check cluster -all        # 检查所有节点集群状态
crsctl stat res -t               # 列出所有资源状态
crsctl check crs                 # 检查本地CRS状态
olsnodes -n -i -s -t             # 列出集群节点信息

# ========== OCR/Voting Disk ==========
ocrcheck                         # 验证OCR完整性
ocrcheck -local                  # 验证OLR完整性
ocrconfig -showbackup            # 查看OCR备份
crsctl query css votedisk        # 查看Voting Disk状态

# ========== ASM ==========
srvctl status asm                # ASM实例状态
asmcmd lsdsk --discovery         # 发现ASM磁盘
asmcmd lsdg                      # 列出磁盘组
kfod op=disks                    # KFOD磁盘发现
kfed read /dev/xxx               # 读取磁盘头

# ========== Database ==========
srvctl status database -d <db>   # 数据库状态
srvctl config database -d <db>   # 数据库配置
lsnrctl status                   # 监听器状态

# ========== 日志收集 ==========
tfactl diagcollect -node all     # TFA诊断收集
diagcollection.pl --collect --crs  # CRS诊断收集
adrci                            # ADR命令行接口

6.3 故障解决检查清单

GI安装前检查

  • runcluvfy.sh stage -pre crsinst 全部通过
  • SSH互信双向无密码 (ssh hostname "hostname")
  • NTP/chrony配置slew模式 (-x选项)
  • 防火墙已禁用 (firewalld, iptables)
  • SELinux禁用或permissive
  • DNS解析正确 (所有hostname, VIP, SCAN)
  • 内核参数符合要求
  • cvuqdisk RPM已安装
  • 共享存储所有节点可访问且权限正确

实例启动故障快速定位

  • 检查 crsctl stat res -t 确认资源状态
  • 检查 alert log 获取具体ORA错误
  • 验证 ASM实例运行 (srvctl status asm)
  • 验证 interconnect配置 (V$CLUSTER_INTERCONNECTS)
  • 验证 监听器状态和服务注册

结论

Oracle RAC安装故障诊断是一项需要系统化方法的复杂任务。本文提供的分层诊断框架——从GI安装到Clusterware、ASM、最后到数据库——为DBA提供了结构化的问题定位路径。关键要点包括:80%的启动问题源于磁盘空间、网络连接或文件权限这三个基本因素;日志分析是诊断的核心,alert.log应作为首要检查对象;19c/21c版本引入了AFD作为推荐的磁盘管理方式,同时OpenSSH 8.x兼容性问题需要特别关注。建议DBA在生产环境部署前,完整执行CVU预检查并解决所有警告,同时保持GI和数据库软件处于Oracle推荐的补丁级别(MRP),以避免已知问题影响系统稳定性。