Performance tuning ‘gc cr¤t grant 2-way’ event (当主机扩容cpu后)
上周遇到了一个案例在主机资源扩容后, gc cr grant 2-way 明显增长, 而且在节点1性能慢时人为做了flush buffer_cache 又浇了桶油后, 节点1很快hang 住,最后kill 了所有了LOCAL=NO的进程都于事无补, 最后kill 了后台进程强型重启了instance, 下面是节点1 重启前的1小时AWR
Cache Sizes Begin End Buffer Cache: 41,984M 41,984M Std Block Size: 8K Shared Pool Size: 8,192M 8,192M Log Buffer: 264,632K Load Profile Per Second Per Transaction Per Exec Per Call DB Time(s): 107.3 0.6 0.03 0.07 DB CPU(s): 2.1 0.0 0.00 0.00 Redo size: 1,521,947.0 8,505.6 Logical reads: 392,261.2 2,192.2 Block changes: 6,776.0 37.9 Physical reads: 2,567.2 14.4 Physical writes: 575.0 3.2 User calls: 1,540.5 8.6 Parses: 121.2 0.7 Hard parses: 16.0 0.1 W/A MB processed: 0.5 0.0 Logons: 2.1 0.0 Executes: 3,139.9 17.6 Rollbacks: 0.1 0.0 Transactions: 178.9 Top 5 Timed Foreground Events Event Waits Time(s) Avg wait (ms) % DB time Wait Class gc cr grant 2-way 957,312 113,999 119 29.50 Cluster gc current block 2-way 566,707 68,269 120 17.67 Cluster gc current grant 2-way 385,571 47,160 122 12.20 Cluster gc cr multi block request 196,918 42,349 215 10.96 Cluster gc buffer busy acquire 326,623 40,036 123 10.36 Cluster Global Cache Load Profile Per Second Per Transaction Global Cache blocks received: 221.24 1.24 Global Cache blocks served: 162.41 0.91 GCS/GES messages received: 1,639.74 9.16 GCS/GES messages sent: 2,984.85 16.68 DBWR Fusion writes: 13.10 0.07 Estd Interconnect traffic (KB) 3,972.45 Global Cache Efficiency Percentages (Target local+remote 100%) Buffer access - local cache %: 99.39 Buffer access - remote cache %: 0.06 Buffer access - disk %: 0.55 Global Cache and Enqueue Services - Workload Characteristics Avg global enqueue get time (ms): 0.3 Avg global cache cr block receive time (ms): 114.8 Avg global cache current block receive time (ms): 121.8 Avg global cache cr block build time (ms): 0.0 Avg global cache cr block send time (ms): 0.0 Global cache log flushes for cr blocks served %: 3.8 Avg global cache cr block flush time (ms): 1.1 Avg global cache current block pin time (ms): 0.1 Avg global cache current block send time (ms): 0.0 Global cache log flushes for current blocks served %: 0.0 Avg global cache current block flush time (ms): 1.2 Global Cache and Enqueue Services - Messaging Statistics Avg message sent queue time (ms): 154.4 Avg message sent queue time on ksxp (ms): 0.3 Avg message received queue time (ms): 0.0 Avg GCS message process time (ms): 0.0 Avg GES message process time (ms): 0.0 % of direct sent messages: 21.07 % of indirect sent messages: 60.94 % of flow controlled messages: 17.99
gc cr¤t grant 2-way 是一种 grant message package 的传递,当取cr 或current block 时向block master instance 请求x或s的权限 ,当请求的block在从任何实例上的buffer cache中都没有发现, lms进程会通知FG进程从disk 读取block到local buffer cache中,
如果这个等待时间过长原因如下:
- SQL 过多的I/O 操作导致cr grant;
- insert 大量的数据导到current grant;
- 非常小的buffer cache;
- flush buffer cache 会加剧gc cr/current grant 2-way;
- 还有可能是过多的节点间交互访问;
- 极差的网络性能;
- oracle bugs…
通常gc grant 是一种LMS 进程发送的非常小的grant function message packs ,在节点间交互不会占用太大带宽,配合的“gc buffer busy acquire“ 事件及Global Cache Load Profile 中显示的信息,基本可以排除网络问题
在MOS 查询该事件不难发现存在一个情况当两节点的cpu 数不一致时, 启动的LMS 数量不同也会导致该问题, 后来找客户确认了主机资源扩容是否有扩CPU? 答复是肯定的, 而且这点以前是未通知的,只通知增加内存。
High “gc cr grant 2-way” / “gc current block 2-way” Wait due to Different CPU Count on Cluster Nodes (文档 ID 1911398.1)
下面是两节点的情况。
# node 1 SQL> show parameter cpu NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ cpu_count integer 128 parallel_threads_per_cpu integer 2 resource_manager_cpu_allocation integer 128 SQL> show parameter gcs_server_processes NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ gcs_server_processes integer 6 #node 2 SQL> show parameter cpu PARAMETER_NAME TYPE VALUE ------------------------------------------------------------ ----------- ----------- cpu_count integer 64 parallel_threads_per_cpu integer 2 resource_manager_cpu_allocation integer 64 SQL> show parameter gcs_server PARAMETER_NAME TYPE VALUE ------------------------------------------------------------ ----------- --------------------- gcs_server_processes integer 4
原因:
通过上面发现节点1当天扩容比以前增加了一倍的CPU, 而且节点1 的gcs_server_processes 是6, 节点2是4,默认gcs_server_processes是根据CPU 数据计算出来的, 这种不平衡的LMS进程和CPU 会导致 , 在lms 多的节点上(本案例的节点1 ) 有更强的cache fusion 请求的能力疯狂的抛向LMS进程小的节点(节点2)时, 节点2 的负载过重无法对称的处理, 就会出现这个性能问题。
解决方法:
配置gcs_server_processes 为相同的值后重启实例。 本案例把节点1 gcs_server_processes 调为 4. 参数调整后这个性能问题没有再现得到解决。
对不起,这篇文章暂时关闭评论。