Investigating a CPU Spike and Resource/cluster Hang in Oracle RAC Environment

Introduction:

In a complex Oracle Real Application Clusters (RAC) environment, unexpected issues can arise, leading to system instability and performance degradation. In this blog post, we will delve into a specific incident involving a CPU spike and resource hang that occurred during the execution of a high parallelism SQL job. We will analyze the root causes and explore potential solutions to mitigate such problems in the future.


The Incident:

On a particular day between 2:00 PM and 6:30 PM IST, an SQL job with a Degree of Parallelism (DoP) set to 200 was initiated on Node#3 around 2:12 PM IST. Soon after, the CPU Idle% dropped to single digits, reaching as low as 1% at 3:10 PM IST. This performance degradation led to Node#6 being evicted following a dirty detach reconfiguration.


During the incident, it was observed that the shutdown process remained hung for an extended period, and the system was only restarted at 4:44 PM as per the Syslog. The Cluster Ready Services (CRS) stack came back online at 4:59 PM. Throughout the duration of the problem, the CRS PE MASTER (master node) remained on Node#3, which resulted in all nodes being stuck in restarting resources or experiencing issues with srvctl.


Furthermore, learned from the peer nodes that their CRS daemons (crsd) slaves were unable to contact the crsd master on Node#3 during the two-hour period. As a result, they were unable to complete resource actions. it was evident that the node experienced high CPU utilization due to the 200 DoP query, causing all processes to become stuck and ultimately leading to the system shutdown.


Node#3 was in a partially functional state - it was neither removed from the cluster since network communication was still operational (cssd runs with high priority), nor could crsd schedule tasks properly due to the excessive CPU usage. Eventually, the crsd master was moved to different nodes to address the issue.


Node#3 was the first node affected by the issue, experiencing scheduling problems and resulting in the stack being shut down at 4:06 PM IST. After a period of time, the system restarted at 4:44 PM IST, and the Cluster Ready Services (CRS) stack came online at 4:59 PM IST. During this time, the CRS PE MASTER (master node) remained on Node#3, causing other nodes to be stuck in restarting resources or hanging srvctl.


At 5:25 PM IST, Node#1 also encountered a similar problem. Several resources went into an unknown state due to the unavailability of the master CRS daemon (crsd) on Node#1. After restarting Node#1, the resources were already damaged and in an unknown state. It was noted that the ASM alert log showed no issues with the disk groups, and the cause seemed to be related to the initial hang at Node#3.


Root Causes and Observations:

1. The SQL query executed with a DoP of 200, but the parallel_force_local parameter was set to TRUE, causing all the parallel execution to happen on Node#3 instead of distributing it across RAC nodes. This resulted in high CPU utilization on Node#3 and eventually led to the shutdown of the system.

https://docs.oracle.com/en/database/oracle/oracle-database/19/vldbg/parameters-parallel-exec.html#GUID-5E724762-3BA3-4D6F-9836-4285960545B1


2. The DBMS_PARALLEL_EXECUTE.RESUME_TASK parameter was set to 200, spawning a large number of jobs (exact count determined at runtime). Ideally, this value should be restricted to a reasonable number, such as below 50, to prevent excessive execution time or CPU choking.

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_PARALLEL_EXECUTE.html#GUID-E43F62E7-0DD7-4AB8-96D7-16677FDF1E1F


3. Another contributing factor was the PARALLEL_FORCE_LOCAL parameter set to TRUE for the customer's instance, forcing all spawned jobs to run on the same RAC instance (Node#3). This limited the parallel execution, exacerbating the high CPU utilization on Node#3.


4. The JOB_PROCESSES_QUEUE parameter was set to a value of 999, which limits the maximum number of jobs that can be created at the database level. Increasing this value to the default of 4000 would have allowed for more jobs to be created and prevented the given DB operation from hanging due to the lack of available job slots.


Recommended Solutions:

To address the issues encountered in this incident and prevent similar problems in the future, the following actions are recommended:


1. Review and optimize the parallelism settings: Evaluate the necessity of a high DoP (Degree of Parallelism) and ensure it aligns with the available CPU and memory resources. Consider adjusting the parallel_level parameter in DBMS_PARALLEL_EXECUTE.RESUME_TASK to a reasonable value to avoid overloading the system.


2. Adjust the PARALLEL_FORCE_LOCAL parameter: Assess the impact of running all parallel jobs on a single instance versus distributing them across RAC nodes. Set this parameter to FALSE to allow for better load balancing and

Comments

Popular posts from this blog

Resolving 19c Oracle Database Performance Woes: Unraveling the Mystery of 'Free Buffer Waits'

Unveiling the Mystery Behind a Critical Downtime Incident: Default TRANSMIT_MODE Behavior in Oracle Database 19c's Log Archive Destinations