首页 » ORACLE 9i-23ai » Troubleshooting oracle CRS start cssd fail with log show “unable to escalate to real time“

Troubleshooting oracle CRS start cssd fail with log show “unable to escalate to real time“

Oracle 11.2.0.4 RAC 安装完重启CRS启动失败,提示ocssd无法启动,ocssd日志中查看提示如下错误,  提示在提升CSSD进程为real time模式失败。

clssscSetPrivEnv: unable to set priority to 4
SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time

排查方法

1 检查CRS进程状态

# /u01/app/11.2.0/grid/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               Instance Shutdown   
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       anbob1                                      
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING            
ora.cssdmonitor
      1        ONLINE  ONLINE       anbob1                                      
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       anbob1                                      
ora.gpnpd
      1        ONLINE  ONLINE       anbob1                                      
ora.mdnsd
      1        ONLINE  ONLINE       anbob1                   

2  查看CSS LOG

Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2023-06-01 17:21:07.270: [    CSSD][906602304]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[    CSSD][906602304]clsugetconf : Configuration type [4].
2023-06-01 17:21:07.270: [    CSSD][906602304]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (exclusive) mode with uniqueness value 1685611267
2023-06-01 17:21:07.270: [    CSSD][906602304]clssscmain: Environment is production
2023-06-01 17:21:07.270: [    CSSD][906602304]clssscmain: Core file size limit extended
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscmain: GIPCHA down 0
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscExtendLimits: The current soft limit for file descriptors is 65536, hard limit is 65536
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscExtendLimits: The current soft limit for locked memory is 4294967295, hard limit is 4294967295
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21
2023-06-01 17:21:07.272: [    CSSD][906602304]clssscSetPrivEnv: Setting priority to 4
2023-06-01 17:21:07.276: [    CSSD][906602304]clssscSetPrivEnv: unable to set priority to 4
2023-06-01 17:21:07.276: [    CSSD][906602304]SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time

2023-06-01 17:21:07.276: [    CSSD][906602304](:CSSSC00011:)clssscExit: A fatal error occurred during initialization

3, 系统级是否有配置cgoups CPU accounting

systemd通过Unit的配置文件配置资源控制,Unit包括services(上面例子就是一个service unit), slices, scopes, sockets, mount points, 和swap devices六种。systemd底层也是依赖Linux Control Groups (cgroups)来实现资源控制。cgroup有两个版本,新版本的cgroup v2即Unified cgroup(参考cgroup v2)和传统的cgroup v1(参考cgroup v1

Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1)
Grid Infrastructure: CSSD Fails to Start on Solaris Local Containers (zones) (Doc ID 1340694.1)

$ cat /cgroups/sysdefault/cpu.rt_*
$ cat /etc/cgconfig.conf
$ rpm -qa|grep libcgroup-tools

如cat /etc/cgconfig.conf

group sysdefault {
cpu {
cpu.shares = 1024;
cpu.rt_period_us = 1000000;
cpu.rt_runtime_us = 950000; 
}
}

sched_rt_period_us: 测量实时任务带宽强制的时间段。默认值为 1000000(微秒)。
sched_rt_runtime_us: 在 sched_rt_period_us 时间段分配给实时任务的量子。设置为 -1 会禁用 RT 带宽强制。默认情况下,RT 任务每秒可能消耗 95%CPU,因而剩下 5%CPU/秒(或 0.05 秒)供 SCHED_OTHER 任务使用。默认值为 950000(微秒)。

4 , CRS 与OS不兼容bug
GI Intallation On Linux 7 Fails with ‘CSS Startup Failed With Return Code 1’ And CRS-1656 (Doc ID 2935061.1)

CPU accounting related service(s) has been disabled and this is not caused by cgroup setting.
建议GI 安装11.2.0.4.201020 (Oct 2020)或更新的PSU

$ opatch lspatches -oh /u01/app/oracle/product/11.2.0/dbhome_1 
31668908;OJVM PATCH SET UPDATE 11.2.0.4.201020
31537677;Database Patch Set Update : 11.2.0.4.201020 (31537677)
29938455;OCW Patch Set Update : 11.2.0.4.191015 (29938455)

当前已安装该PSU,排除

5,service level级启用CPU Accounting

检查是否有相关配置,如配置CPU Accounting 或CPU Quota等。默认CPU Accounting是禁用的,当启用 CPU Accounting(隐式或显式)时,您将无法再创建在没有额外配置的情况下以实时调度运行的进程,因为附加进程的 CPU CGroup 控制器节点没有分配real-time quantum  cpu.rt_runtime_us。当在单元文件中明确指定 CPUAccounting=yes 时,或通过在单元文件中指定 CPU* 属性(例如 CPUQuota、CPUShares 等)隐式指定时,只有在激活单元时才会启用 CPU Accounting。 在服务单元中指定 Delegate=yes 时也会发生这种情况。

$ ls /sys/fs/cgroup/cpu,cpuacct/*.slice

$ find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -i -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota -e Delegate|grep -v "^Bin"

如:

# find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota
/etc/systemd/system.conf:#DefaultCPUAccounting=no
/etc/systemd/system/asm_pms.service.d/50-CPUQuota.conf:CPUQuota=1200%
/etc/systemd/system/hms.service.d/50-CPUQuota.conf:CPUQuota=1200%
/etc/systemd/system/node_exporter.service.d/50-CPUQuota.conf:CPUQuota=1200%
/etc/systemd/system/rce_agent.service.d/50-CPUQuota.conf:CPUQuota=1200%

CPUQuota=:用于设置cgroup v2的cpu.max参数或者cgroup v1的cpu.cfs_quota_us参数。表示可以占用的CPU时间配额百分比。如:20%表示最大可以使用单个CPU核的20%。可以超过100%,比如1200%表示可以使用12个CPU核.

如果有相关服务,建议停止该serivce或注释配置参数,重启OS 和crs尝试。

6 , 分析日志文件
可以使用gdb 分析生成的core.xx 文件, 检查cssdOUT.log 日志。之前连接工具: 分析core dump file

小结
SysV services,即使是那些具有 root 权限的服务,在启用 CPUAccounting 选项时也无法获得 real-time scheduling。 为任何服务启用 CPUAccounting 后,systemd 在全局范围内使用 CGroup CPU bandwidth controller ,随后的 sched_setscheduler() 系统调用由于实时调度优先级而意外终止。 为避免此错误再次发生,可以为实时使用service设置 CGroup cpu.rt_runtime_us 选项。此问题不仅影响 SysV 服务, 同样的限制也适用于 systemd 本机服务和用户程序。

打赏

,

对不起,这篇文章暂时关闭评论。