Troubleshooting oracle CRS start cssd fail with log show “unable to escalate to real time“
Oracle 11.2.0.4 RAC 安装完重启CRS启动失败,提示ocssd无法启动,ocssd日志中查看提示如下错误, 提示在提升CSSD进程为real time模式失败。
clssscSetPrivEnv: unable to set priority to 4
SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time
排查方法
1 检查CRS进程状态
# /u01/app/11.2.0/grid/bin/crsctl stat res -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE Instance Shutdown ora.cluster_interconnect.haip 1 ONLINE OFFLINE ora.crf 1 ONLINE ONLINE anbob1 ora.crsd 1 ONLINE OFFLINE ora.cssd 1 ONLINE OFFLINE STARTING ora.cssdmonitor 1 ONLINE ONLINE anbob1 ora.ctssd 1 ONLINE OFFLINE ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE OFFLINE ora.gipcd 1 ONLINE ONLINE anbob1 ora.gpnpd 1 ONLINE ONLINE anbob1 ora.mdnsd 1 ONLINE ONLINE anbob1
2 查看CSS LOG
Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2023-06-01 17:21:07.270: [ CSSD][906602304]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
[ CSSD][906602304]clsugetconf : Configuration type [4].
2023-06-01 17:21:07.270: [ CSSD][906602304]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (exclusive) mode with uniqueness value 1685611267
2023-06-01 17:21:07.270: [ CSSD][906602304]clssscmain: Environment is production
2023-06-01 17:21:07.270: [ CSSD][906602304]clssscmain: Core file size limit extended
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscmain: GIPCHA down 0
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscExtendLimits: The current soft limit for file descriptors is 65536, hard limit is 65536
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscExtendLimits: The current soft limit for locked memory is 4294967295, hard limit is 4294967295
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21
2023-06-01 17:21:07.272: [ CSSD][906602304]clssscSetPrivEnv: Setting priority to 4
2023-06-01 17:21:07.276: [ CSSD][906602304]clssscSetPrivEnv: unable to set priority to 4
2023-06-01 17:21:07.276: [ CSSD][906602304]SLOS: cat=-2, opn=scls_set_priority_realtime, dep=1, loc=setsched
unable to escalate to real time
2023-06-01 17:21:07.276: [ CSSD][906602304](:CSSSC00011:)clssscExit: A fatal error occurred during initialization
3, 系统级是否有配置cgoups CPU accounting
systemd通过Unit的配置文件配置资源控制,Unit包括services(上面例子就是一个service unit), slices, scopes, sockets, mount points, 和swap devices六种。systemd底层也是依赖Linux Control Groups (cgroups)来实现资源控制。cgroup有两个版本,新版本的cgroup v2即Unified cgroup(参考cgroup v2)和传统的cgroup v1(参考cgroup v1)
Linux: GI OCSSD Fails to Start After cgroups Setting Change (Doc ID 1577784.1)
Grid Infrastructure: CSSD Fails to Start on Solaris Local Containers (zones) (Doc ID 1340694.1)
$ cat /cgroups/sysdefault/cpu.rt_* $ cat /etc/cgconfig.conf $ rpm -qa|grep libcgroup-tools 如cat /etc/cgconfig.conf group sysdefault { cpu { cpu.shares = 1024; cpu.rt_period_us = 1000000; cpu.rt_runtime_us = 950000; } }
sched_rt_period_us: 测量实时任务带宽强制的时间段。默认值为 1000000(微秒)。
sched_rt_runtime_us: 在 sched_rt_period_us 时间段分配给实时任务的量子。设置为 -1 会禁用 RT 带宽强制。默认情况下,RT 任务每秒可能消耗 95%CPU,因而剩下 5%CPU/秒(或 0.05 秒)供 SCHED_OTHER 任务使用。默认值为 950000(微秒)。
4 , CRS 与OS不兼容bug
GI Intallation On Linux 7 Fails with ‘CSS Startup Failed With Return Code 1’ And CRS-1656 (Doc ID 2935061.1)
CPU accounting related service(s) has been disabled and this is not caused by cgroup setting.
建议GI 安装11.2.0.4.201020 (Oct 2020)或更新的PSU
$ opatch lspatches -oh /u01/app/oracle/product/11.2.0/dbhome_1 31668908;OJVM PATCH SET UPDATE 11.2.0.4.201020 31537677;Database Patch Set Update : 11.2.0.4.201020 (31537677) 29938455;OCW Patch Set Update : 11.2.0.4.191015 (29938455)
当前已安装该PSU,排除
5,service level级启用CPU Accounting
检查是否有相关配置,如配置CPU Accounting 或CPU Quota等。默认CPU Accounting是禁用的,当启用 CPU Accounting(隐式或显式)时,您将无法再创建在没有额外配置的情况下以实时调度运行的进程,因为附加进程的 CPU CGroup 控制器节点没有分配real-time quantum cpu.rt_runtime_us。当在单元文件中明确指定 CPUAccounting=yes 时,或通过在单元文件中指定 CPU* 属性(例如 CPUQuota、CPUShares 等)隐式指定时,只有在激活单元时才会启用 CPU Accounting。 在服务单元中指定 Delegate=yes 时也会发生这种情况。
$ ls /sys/fs/cgroup/cpu,cpuacct/*.slice $ find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -i -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota -e Delegate|grep -v "^Bin"
如:
# find /etc/systemd/system.conf /etc/systemd/system /usr/lib/systemd -type f | xargs grep -e CPUAccounting -e CPUWeight -e StartupCPUWeight -e CPUShares -e StartupCPUShares -e CPUQuota /etc/systemd/system.conf:#DefaultCPUAccounting=no /etc/systemd/system/asm_pms.service.d/50-CPUQuota.conf:CPUQuota=1200% /etc/systemd/system/hms.service.d/50-CPUQuota.conf:CPUQuota=1200% /etc/systemd/system/node_exporter.service.d/50-CPUQuota.conf:CPUQuota=1200% /etc/systemd/system/rce_agent.service.d/50-CPUQuota.conf:CPUQuota=1200%
CPUQuota=:用于设置cgroup v2的cpu.max参数或者cgroup v1的cpu.cfs_quota_us参数。表示可以占用的CPU时间配额百分比。如:20%表示最大可以使用单个CPU核的20%。可以超过100%,比如1200%表示可以使用12个CPU核.
如果有相关服务,建议停止该serivce或注释配置参数,重启OS 和crs尝试。
6 , 分析日志文件
可以使用gdb 分析生成的core.xx 文件, 检查cssdOUT.log 日志。之前连接工具: 分析core dump file
小结
SysV services,即使是那些具有 root 权限的服务,在启用 CPUAccounting 选项时也无法获得 real-time scheduling。 为任何服务启用 CPUAccounting 后,systemd 在全局范围内使用 CGroup CPU bandwidth controller ,随后的 sched_setscheduler() 系统调用由于实时调度优先级而意外终止。 为避免此错误再次发生,可以为实时使用service设置 CGroup cpu.rt_runtime_us 选项。此问题不仅影响 SysV 服务, 同样的限制也适用于 systemd 本机服务和用户程序。
对不起,这篇文章暂时关闭评论。