Resolving 19c Oracle Database Performance Woes: Unraveling the Mystery of 'Free Buffer Waits'

 Introduction:

 

A recent upgrade from version 19.21 left a customer grappling with a perplexing performance decline, marked by a surge in "free buffer waits." Tasks that once took a manageable two hours were suddenly stretching beyond the tolerable 10-hour mark, prompting a closer examination of the underlying issues.

The customer had implemented a multi-tenant architecture utilizing a single Pluggable Database (PDB).

 

Unveiling the Culprit:

 

The initial investigation pinpointed a conspicuous increase in "free buffer waits," casting a shadow on system efficiency and causing disruptions in the normal flow of operations. Recognizing the critical nature of this bottleneck, the focus shifted to conducting a thorough hanganalyze to unravel the intricacies of the problem.

Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'rdbms ipc message'<='free buffer waits'
     Chain 1 Signature Hash: 0x8e1d98d3
 [b] Chain 2 Signature: 'rdbms ipc message'<='free buffer waits'
     Chain 2 Signature Hash: 0x8e1d98d3
 [c] Chain 3 Signature: 'rdbms ipc message'<='free buffer waits'
     Chain 3 Signature Hash: 0x8e1d98d3
 
Oracle session identified by:
    {
                instance: 2 (Test)
                   os id: 142493
              process id: 129, oracle@Test
              session id: 123
        session serial #: 54038
             module name: 3 (Test)
                  pdb id: 3 (Test)
    }
    is waiting for 'free buffer waits' with wait info:
    {
 
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (Test)
                   os id: 34845
              process id: 38, oracle@Test (DBW1)
              session id: 4599
        session serial #: 42097
                  pdb id: 1 (CDB$ROOT)
    }
    which is waiting for 'rdbms ipc message' with wait info:
    {
                      p1: 'timeout'=0xeb
            time in wait: 0.000000 sec
      heur. time in wait: 0.005517 sec
           timeout after: 2.350000 sec
                 wait id: 6852613

                blocking: 7 sessions

 

We initiated an investigation into the cause behind the DBWR waiting for 'rdbms ipc message.' As we delved deeper into the troubleshooting process, a significant discovery unfolded—several identical bugs were linked to the stack.

 

One such bug, identified as BUG 32577147, specifically pertains to the 12.2 Real Application Clusters (RAC) environment. This bug manifests as DBWR SLAVES encountering 'RDBMS IPC MESSAGE,' causing block sessions on both 'FREE BUFFER WAITS' and 'LATCH: CACHE BUFFERS CHAINS.'

We meticulously examined the load profile to determine if there was any discernible increase when compared to a previously stable period. However, our investigation failed to reveal any notable changes or spikes in the load profile during the identified problematic timeframe.

A comprehensive solution for the issue is outlined in the standard MOS document titled "Resolving Issues Where 'free buffer waits' Are Observed Waiting for Buffers to be Freed" with Document ID 1476046.1.

 

Reducing Waits:

There are 2 main scenarios where high waits for 'free buffer waits' occur. Firstly make sure that DBWR is able to write out the blocks quickly enough, there is information on this here. If write throughput is within acceptable thresholds then the other alternative is that the buffer cache is too small. Use the following to address each scenario:

 

DBWR is not fast enough clearing dirty blocks.

Check if CPU is not saturated. A saturated CPU can amplify wait events where a background process does not get enough CPU to progress faster.

Check Slow IO (Poor data file write performance). Some file systems have poor write performance (writes take too long) and is impacting DBwriter's ability to keep enough clean buffers in the buffer cache.

The DBWriter will achieve optimal throughput when asynchronous I/O is available to it. DBWriter may not be able to keep up with buffer demands if asynch I/O is not available. If your platform doesn't support it, then adding multiple DBWriters can help divide the workload.

Tune checkpoints, so that we can flush dirty buffers fast enough.

Tune the SQL which can eliminate excess IO

Buffer cache is too small:

If the buffer cache is too small and filled with hot blocks, then sessions will be starved for free buffers (clean, cold blocks) and will need to spend too much time looking for free buffers and/or posting DBWR to write dirty blocks and make them free. Increase the parameter or DB_CACHE_SIZE and monitor the effect of the change.

Tune the SQL which can eliminate excess IO

 

We conducted a thorough analysis using OSwatcher to investigate any potential slowdown in IO, but no anomalies were detected, with IO consistently performing well below 2ms.

 

Presently, the SGA_MAX_SIZE and SGA_TARGET are configured at 80GB and 60GB, respectively. To address the identified issue, a strategic decision has been made to augment the DB_CACHE_SIZE from 0 to 25GB.

Updating the parameter from 0 to 25GB has successfully alleviated the 'free buffer waits' concern. Can we consider the issue resolved at this point?


The aforementioned snapshot illustrates the chronological progression of our observations concerning free buffer busy waits over time and outlines the step-by-step measures taken to address and resolve the issue.

 


What Next??

While the report time has significantly decreased from 10 hours to 4 hours after the parameter adjustment, it's noteworthy that there remains a performance degradation compared to the previously optimal 2-hour timeframe.

Additionally, the team has begun encountering intermittent instances of ORA-04031 errors.

Upon a more in-depth examination of the ORA-04031 error, it was revealed that among the 7 sub-pools, the 7th sub-pool was responsible for triggering the issue. Notably, both the SQLA and KHL0 components within this sub-pool were experiencing significant growth.

In an effort to resolve the identified issue, the team opted to augment the shared pool size to 57GB, successfully mitigating the ORA-04031 problem. However, a subsequent development revealed the resurgence of "free buffer waits" in the system, accompanied by a notable decline in overall performance. The visual representation above, dated on the 15th, illustrates a recurrence of the yellow portion, indicating configuration wait issues.

 

Solution:

 

After scrutinizing numerous trace files for discernible patterns, no anomalies were found. However, a closer examination of a DBWR trace file revealed a potentially significant detail: the "target_pdbs" value was recorded as 160. This raised concerns, considering the database had only a single PDB. The high value of "target_pdbs" became a focal point of investigation, prompting questions about its implications on memory allocation for each PDB.

 

Shifting the perspective towards a configuration-related issue rather than a bug, further exploration unearthed several parameters set at elevated values both at the Container Database (CDB) and Pluggable Database (PDB) levels, contrary to recommended practices. This discovery hinted at a potential configuration defect contributing to the observed performance issues.

Furthermore, it was observed that following the team's decision to augment the SGA_MAX_size, the "target_pdbs" value automatically adjusted to 200 on the 15th. This adjustment prompted the team to make further accommodations by increasing the shared pool and buffer_cache to align with the elevated "target_pdbs" value. Consequently, the team also raised the SGA_MAX_size in response to these adjustments.

 



Initially investigating the alteration of the "target_pdbs" value, We have later discovered that this parameter is automatically adjusted based on the SGA_MAX_size if not explicitly set. In instances where the parameter is left unspecified, the default value is dynamically calculated. This calculation is derived from the maximum SGA size allocated for the Container Database (CDB), divided by 512MB per Pluggable Database (PDB).

 

Now What About ORA-04031??

As you may be aware, the ORA-4031 error signifies a memory constraint in the System Global Area (SGA) when attempting to augment one of the SGA pools due to insufficient available memory. In your specific scenario, the ORA-4031 incident occurred within the shared pool.

Upon scrutinizing the trace files, it became evident that a substantial portion of the shared pool memory was being allocated to the "SQLA 0003" and "KGLH0 0003" memory components, accounting for 48% and 21%, respectively.

TOP 20 MEMORY USES ACROSS SGA HEAP 1 - 7
----------------------------------------------
"SQLA 0003" 26 GB 48% >>>>>>>>
"KGLH0 0003" 11 GB 21% >>>>>>>>
"free memory " 10 GB 19%
 
PDB-MODIFIED PARAMETERS:
...
sga_min_size = 0

shared_pool_size = 32212254720 >>>>>>

 

 

It's important to note a known issue in Multitenant environments (CDB/PDB), where elevated SQLA and KGLH0 memory usage occurs when SHARED_POOL_SIZE or SGA_MIN_SIZE is configured at the PDB level. This concern is elucidated in the following document:

 

"ORA-04031 on Multitenant Database with Excessive Amounts of KGLH0 and / or SQLA Memory and Parameter SHARED_POOL_SIZE or SGA_MIN_SIZE Set at the PDB Level" (Doc ID 2590172.1)

 

Analysis of the trace files unequivocally reveals that shared_pool_size is indeed set at the PDB level:

To preemptively address the ORA-4031 issue, it is recommended to remove the shared_pool_size setting from the PDB configuration.

 

Addressing the configuration-level concerns, the team opted to reset the values for the following parameters:

 

-- Modify parameters at the CDB level

ALTER SYSTEM SET sga_max_size = 107374182400 SID='*' SCOPE=SPFILE;

ALTER SYSTEM SET sga_target = 107374182400 SID='*' SCOPE=SPFILE;

ALTER SYSTEM SET shared_pool_size = 12G SID='*' SCOPE=SPFILE;

ALTER SYSTEM SET db_cache_size = 12G SID='*' SCOPE=SPFILE;

ALTER SYSTEM SET "_fix_control" = '31626438:0' SID='*' SCOPE=SPFILE;

 

-- Log in to the PDB and reset the following two parameters:

ALTER SYSTEM RESET sga_target SID='*' SCOPE=SPFILE;

ALTER SYSTEM RESET shared_pool_size SID='*' SCOPE=SPFILE;

 

Additionally, it's recommended to consider applying the patch for Bug 35403248, which addresses issues related to sessions hanging while waiting for 'free buffer waits.' This patch is advised to be applied promptly on top of 19.21 RU, as recommended in Note 555.1.

 

Following the implementation of the aforementioned changes, the application team reported a notable improvement in both database and application performance. The comprehensive adjustments made to parameters, including the reset values and the application of the recommended patch for Bug 35403248, collectively contributed to the successful resolution of the identified issues. The collaborative efforts in reconfiguring the system parameters and addressing potential bugs have effectively restored the overall health and performance of the database and associated applications.



Comments

Popular posts from this blog

Unveiling the Mystery Behind a Critical Downtime Incident: Default TRANSMIT_MODE Behavior in Oracle Database 19c's Log Archive Destinations

Investigating a CPU Spike and Resource/cluster Hang in Oracle RAC Environment