Resolving 19c Oracle Database Performance Woes: Unraveling the Mystery of 'Free Buffer Waits'
Introduction:
A recent upgrade from version 19.21 left a customer
grappling with a perplexing performance decline, marked by a surge in
"free buffer waits." Tasks that once took a manageable two hours were
suddenly stretching beyond the tolerable 10-hour mark, prompting a closer
examination of the underlying issues.
The customer had implemented a multi-tenant architecture
utilizing a single Pluggable Database (PDB).
Unveiling
the Culprit:
The initial investigation pinpointed a conspicuous increase
in "free buffer waits," casting a shadow on system efficiency and
causing disruptions in the normal flow of operations. Recognizing the critical
nature of this bottleneck, the focus shifted to conducting a thorough
hanganalyze to unravel the intricacies of the problem.
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'rdbms ipc message'<='free buffer waits' Chain 1 Signature Hash: 0x8e1d98d3 [b] Chain 2 Signature: 'rdbms ipc message'<='free buffer waits' Chain 2 Signature Hash: 0x8e1d98d3 [c] Chain 3 Signature: 'rdbms ipc message'<='free buffer waits' Chain 3 Signature Hash: 0x8e1d98d3 Oracle session identified by: { instance: 2 (Test) os id: 142493 process id: 129, oracle@Test session id: 123 session serial #: 54038 module name: 3 (Test) pdb id: 3 (Test) } is waiting for 'free buffer waits' with wait info: { and is blocked by => Oracle session identified by: { instance: 2 (Test) os id: 34845 process id: 38, oracle@Test (DBW1) session id: 4599 session serial #: 42097 pdb id: 1 (CDB$ROOT) } which is waiting for 'rdbms ipc message' with wait info: { p1: 'timeout'=0xeb time in wait: 0.000000 sec heur. time in wait: 0.005517 sec timeout after: 2.350000 sec wait id: 6852613 blocking: 7 sessions |
We initiated an investigation into the cause behind the DBWR
waiting for 'rdbms ipc message.' As we delved deeper into the troubleshooting
process, a significant discovery unfolded—several identical bugs were linked to
the stack.
One such bug, identified as BUG 32577147, specifically
pertains to the 12.2 Real Application Clusters (RAC) environment. This bug
manifests as DBWR SLAVES encountering 'RDBMS IPC MESSAGE,' causing block
sessions on both 'FREE BUFFER WAITS' and 'LATCH: CACHE BUFFERS CHAINS.'
We meticulously examined the load profile to determine if
there was any discernible increase when compared to a previously stable period.
However, our investigation failed to reveal any notable changes or spikes in
the load profile during the identified problematic timeframe.
A comprehensive solution for the issue is outlined in the
standard MOS document titled "Resolving Issues Where 'free buffer waits'
Are Observed Waiting for Buffers to be Freed" with Document ID 1476046.1.
Reducing Waits: There are 2 main scenarios where high waits for 'free buffer
waits' occur. Firstly make sure that DBWR is able to write out the blocks
quickly enough, there is information on this here. If write throughput is
within acceptable thresholds then the other alternative is that the buffer
cache is too small. Use the following to address each scenario:
DBWR is not fast enough clearing dirty blocks. Check if CPU is not saturated. A saturated CPU can amplify wait
events where a background process does not get enough CPU to progress faster. Check Slow IO (Poor data file write performance). Some file
systems have poor write performance (writes take too long) and is impacting
DBwriter's ability to keep enough clean buffers in the buffer cache. The DBWriter will achieve optimal throughput when asynchronous
I/O is available to it. DBWriter may not be able to keep up with buffer
demands if asynch I/O is not available. If your platform doesn't support it,
then adding multiple DBWriters can help divide the workload. Tune checkpoints, so that we can flush dirty buffers fast enough. Tune the SQL which can eliminate excess IO Buffer cache is too small: If the buffer cache is too small and filled with hot blocks, then
sessions will be starved for free buffers (clean, cold blocks) and will need
to spend too much time looking for free buffers and/or posting DBWR to write
dirty blocks and make them free. Increase the parameter or DB_CACHE_SIZE and
monitor the effect of the change. Tune the SQL which can eliminate excess IO |
We conducted a thorough analysis using OSwatcher to
investigate any potential slowdown in IO, but no anomalies were detected, with
IO consistently performing well below 2ms.
Presently, the SGA_MAX_SIZE and SGA_TARGET are configured at
80GB and 60GB, respectively. To address the identified issue, a strategic
decision has been made to augment the DB_CACHE_SIZE from 0 to 25GB.
Updating the parameter from 0 to 25GB has successfully
alleviated the 'free buffer waits' concern. Can we consider the issue resolved
at this point?
The aforementioned snapshot illustrates the chronological
progression of our observations concerning free buffer busy waits over time and
outlines the step-by-step measures taken to address and resolve the issue.
What
Next??
While the report time has significantly decreased from 10
hours to 4 hours after the parameter adjustment, it's noteworthy that there
remains a performance degradation compared to the previously optimal 2-hour
timeframe.
Additionally, the team has begun encountering intermittent
instances of ORA-04031 errors.
Upon a more in-depth examination of the ORA-04031 error, it
was revealed that among the 7 sub-pools, the 7th sub-pool was responsible for
triggering the issue. Notably, both the SQLA and KHL0 components within this
sub-pool were experiencing significant growth.
In an effort to resolve the identified issue, the team opted
to augment the shared pool size to 57GB, successfully mitigating the ORA-04031
problem. However, a subsequent development revealed the resurgence of
"free buffer waits" in the system, accompanied by a notable decline
in overall performance. The visual representation above, dated on the 15th, illustrates
a recurrence of the yellow portion, indicating configuration wait issues.
Solution:
After scrutinizing numerous trace files for discernible
patterns, no anomalies were found. However, a closer examination of a DBWR
trace file revealed a potentially significant detail: the
"target_pdbs" value was recorded as 160. This raised concerns,
considering the database had only a single PDB. The high value of
"target_pdbs" became a focal point of investigation, prompting
questions about its implications on memory allocation for each PDB.
Shifting the perspective towards a configuration-related
issue rather than a bug, further exploration unearthed several parameters set
at elevated values both at the Container Database (CDB) and Pluggable Database
(PDB) levels, contrary to recommended practices. This discovery hinted at a
potential configuration defect contributing to the observed performance issues.
Furthermore, it was observed that following the team's
decision to augment the SGA_MAX_size, the "target_pdbs" value
automatically adjusted to 200 on the 15th. This adjustment prompted the team to
make further accommodations by increasing the shared pool and buffer_cache to
align with the elevated "target_pdbs" value. Consequently, the team
also raised the SGA_MAX_size in response to these adjustments.
Initially investigating the alteration of the
"target_pdbs" value, We have later discovered that this parameter is
automatically adjusted based on the SGA_MAX_size if not explicitly set. In
instances where the parameter is left unspecified, the default value is
dynamically calculated. This calculation is derived from the maximum SGA size
allocated for the Container Database (CDB), divided by 512MB per Pluggable
Database (PDB).
Now
What About ORA-04031??
As you may be aware, the ORA-4031 error signifies a memory
constraint in the System Global Area (SGA) when attempting to augment one of
the SGA pools due to insufficient available memory. In your specific scenario,
the ORA-4031 incident occurred within the shared pool.
Upon scrutinizing the trace files, it became evident that a
substantial portion of the shared pool memory was being allocated to the
"SQLA 0003" and "KGLH0 0003" memory components, accounting
for 48% and 21%, respectively.
TOP 20 MEMORY USES ACROSS SGA HEAP 1 - 7 ---------------------------------------------- "SQLA 0003" 26 GB 48% >>>>>>>> "KGLH0 0003" 11 GB 21% >>>>>>>> "free memory " 10 GB 19% PDB-MODIFIED PARAMETERS: ... sga_min_size = 0 shared_pool_size = 32212254720 >>>>>> |
It's important to note a known issue in Multitenant
environments (CDB/PDB), where elevated SQLA and KGLH0 memory usage occurs when
SHARED_POOL_SIZE or SGA_MIN_SIZE is configured at the PDB level. This concern
is elucidated in the following document:
"ORA-04031 on Multitenant Database with Excessive
Amounts of KGLH0 and / or SQLA Memory and Parameter SHARED_POOL_SIZE or
SGA_MIN_SIZE Set at the PDB Level" (Doc ID 2590172.1)
Analysis of the trace files unequivocally reveals that
shared_pool_size is indeed set at the PDB level:
To preemptively address the ORA-4031 issue, it is recommended
to remove the shared_pool_size setting from the PDB configuration.
Addressing the configuration-level concerns, the team opted
to reset the values for the following parameters:
-- Modify parameters at the CDB level ALTER SYSTEM SET sga_max_size = 107374182400 SID='*'
SCOPE=SPFILE; ALTER SYSTEM SET sga_target = 107374182400 SID='*' SCOPE=SPFILE; ALTER SYSTEM SET shared_pool_size = 12G SID='*' SCOPE=SPFILE; ALTER SYSTEM SET db_cache_size = 12G SID='*' SCOPE=SPFILE; ALTER SYSTEM SET "_fix_control" = '31626438:0' SID='*'
SCOPE=SPFILE;
-- Log in to the PDB and reset the following two parameters: ALTER SYSTEM RESET sga_target SID='*' SCOPE=SPFILE; ALTER SYSTEM RESET shared_pool_size SID='*' SCOPE=SPFILE; |
Additionally, it's recommended to consider applying the
patch for Bug 35403248, which addresses issues related to sessions hanging
while waiting for 'free buffer waits.' This patch is advised to be applied
promptly on top of 19.21 RU, as recommended in Note 555.1.
Following the implementation of the aforementioned changes,
the application team reported a notable improvement in both database and
application performance. The comprehensive adjustments made to parameters,
including the reset values and the application of the recommended patch for Bug
35403248, collectively contributed to the successful resolution of the
identified issues. The collaborative efforts in reconfiguring the system
parameters and addressing potential bugs have effectively restored the overall
health and performance of the database and associated applications.
Comments
Post a Comment