Unveiling the Mystery Behind a Critical Downtime Incident: Default TRANSMIT_MODE Behavior in Oracle Database 19c's Log Archive Destinations

Recently, one of our valued customers experienced a significant downtime event when their complete 6-node RAC cluster fell into a hang state due to a massive pile-up of sessions, exceeding 6000.

The gravity of the situation was immediately apparent as the downtime persisted intermittently for nearly four hours, from 2:10 PM to 5:30 PM, impacting critical transactions and leading to declines.

Upon their reaching out to us for assistance, our troubleshooting journey commenced, aiming to uncover the root cause and provide recommendations to rectify the issue.

Problem Statement:

The 6-node cluster entered a hang state despite attempted reboots, resulting in a prolonged outage with severe implications for transaction processing.

Unraveling the Culprit:

The issue presented itself as a perplexing anomaly, requiring thorough analysis of diagnostic data within the live system. Initial investigations revealed a high occurrence of GC and logfile sync wait events amidst other contention factors.

Navigating through a system inundated with numerous active sessions posed challenges in pinpointing the precise cause, necessitating a methodical approach to troubleshooting. Despite ruling out sudden application load increases, the focus shifted to GC-related wait events and cluster interconnect inspections, albeit without success.

Further scrutiny honed in on the logfile sync wait event, an uncommon occurrence in our system environment. Comparative analysis of historical and real-time data unveiled a noticeable increase in logfile sync occurrences.

Delving deeper, examination of lgwr trace files hinted at heightened write activity, prompting an investigation into potential storage issues. Collaboration with storage vendors and scrutiny of OSwatcher data failed to yield evidence of abnormal disk response times.

While log file parallel write times remained within expected parameters, the mystery persisted as to why the 6-node cluster remained unresponsive.

A Critical Discovery:

Examination of the DB alert log unearthed a pivotal entry a seemingly innocuous alteration to the archive log destination for DEST 2 coinciding with the onset of the issue.

ALTER SYSTEM SET log_archive_dest_2='SERVICE=TEST reopen=10 LGWR NOAFFIRM MAX_CONNECTIONS=5 DB_UNIQUE_NAME=TEST COMPRESSION=ENABLE' SCOPE=BOTH SID='*';
2024-02-06T14:10:21.052822+05:30
2024-02-06T14:10:23.411087+05:30
Starting background process NSS2. <--------- NSS process started.
2024-02-06T14:10:23.481823+05:30

NSS2 started with pid=1257, OS id=21331

Though seemingly inconsequential, this change bore significant ramifications, as evidenced by subsequent events.

Further Analysis:

AWR reports during the affected window revealed unexpected wait events, notably "SYNC remote write," contrary to the system's usual asynchronous redo transmission to standby.

Hanganalyzer data corroborated suspicions of lgwr blocking, affecting over 1500 sessions, shedding light on the session spike observed during the incident.

Chain 2:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 2 (TEST)
os id: 10962
process id: 1334, oracle@TEST
session id: 41
session serial #: 28464
module name: 0 (JDBC Thin Client)
}
is waiting for 'log file sync' with wait info:
{
p1: 'buffer#'=0x4bdd
p2: 'sync scn'=0x2efd6fd41a0
time in wait: 0.883070 sec
timeout after: never
wait id: 26683
blocking: 0 sessions
current sql_id: 0
wait history:
* time between current wait and wait #1: 0.000829 sec
1. event: 'SQL*Net message from client'
time waited: 0.000318 sec
wait id: 26682 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #1 and #2: 0.000109 sec
2. event: 'SQL*Net message to client'
time waited: 0.000004 sec
wait id: 26681 p1: 'driver id'=0x28444553
p2: '#bytes'=0x1
* time between wait #2 and #3: 0.000076 sec
3. event: 'db file sequential read'
time waited: 0.000268 sec
wait id: 26680 p1: 'file#'=0xee3
p2: 'block#'=0x5802d
p3: 'blocks'=0x1
}
and is blocked by
=> Oracle session identified by:
{
instance: 2 (TEST)
os id: 2037
process id: 130, oracle@TEST (LGWR)
session id: 26521
session serial #: 820
}
which is waiting for 'SYNC Remote Write' with wait info:
{
time in wait: 1.048944 sec
timeout after: never
wait id: 34221638
blocking: 1490 sessions ◄---- blocking a large number of sessions
current sql_id: 0
wait history:
* time between current wait and wait #1: 0.000021 sec
1. event: 'log file parallel write'
time waited: 0.003813 sec
wait id: 34221637 p1: 'files'=0x2
p2: 'blocks'=0x3986
p3: 'requests'=0x18
* time between wait #1 and #2: 0.000067 sec
2. event: 'Redo Writer Remote Sync Notify'
time waited: 0.000163 sec
wait id: 34221636
* time between wait #2 and #3: 0.002770 sec
3. event: 'SYNC Remote Write'
time waited: 0.000000 sec (last interval)
time waited: 1.152942 sec (total)
wait id: 34221633
}

Chain 2 Signature: 'SYNC Remote Write'<='log file sync'

Understanding the Root Cause:

A critical realization emerged—failure to specify the SYNC/ASYNC attribute defaulted the behavior to SYNC, contrary to expectations.

10g : https://docs.oracle.com/cd/B19306_01/server.102/b14239/log_arch_dest_param.htm#i78506

Subsequent testing in a controlled environment validated this behavior, aligning with historical documentation for earlier database versions.

As per 10g document, this is expected behaviour. If we change the behaviour now after upgrading from an older release will cause a problem for the configuration maxprotection.

A Call to Action:

Recognizing the need for clarity, a formal request was made to the development team for documentation enhancements to reflect this behavior in Oracle Database 19c.

Key Takeaways:

The incident underscored the importance of precise attribute usage to ensure expected system behavior, highlighting the significance of meticulous configuration management in mission-critical environments.

ALTER SYSTEM SET log_archive_dest_2='SERVICE=TEST reopen=10 LGWR ASYNC NOAFFIRM MAX_CONNECTIONS=5 DB_UNIQUE_NAME=TEST COMPRESSION=ENABLE' SCOPE=BOTH SID='*';

Search This Blog

Oracle Performance Insights A Blog By Kamlesh Ambre

Unveiling the Mystery Behind a Critical Downtime Incident: Default TRANSMIT_MODE Behavior in Oracle Database 19c's Log Archive Destinations

Comments

Post a Comment

Popular posts from this blog

Resolving 19c Oracle Database Performance Woes: Unraveling the Mystery of 'Free Buffer Waits'

Investigating a CPU Spike and Resource/cluster Hang in Oracle RAC Environment