OGG(12.3) Extract long time lag after Oracle RAC a instance Crash
Oracle 12c RAC 1个实例因硬件原因突然crash, 另一实例上的goldengate Extract 进程Lag At Chkpt 持续增加,read checkpoint并不动,开始是因为归档日志被rman备份任务备份后删除,但restore 归档日志从Recovery Checkpoint到当前确认都存在后依旧hang, 清除了BR 文件再次启动extract进程恢复正常。记录一下这个问题。
ggsci> info extract ext2 Program status group lag at chkpt time since chkpt extract running ext2 02:30:30 00:00:09 GGSCI (anbob2) 4> info ext2,showch EXTRACT EXT2 Last Started 2021-06-09 12:50 Status RUNNING Checkpoint Lag 02:26:53 (updated 00:00:01 ago) Process ID 24143 Log Read Checkpoint Oracle Redo Logs 2021-06-09 11:45:59 Thread 1, Seqno 10483, RBA 2239768 SCN 3955.3574738225 (16990170393905) Log Read Checkpoint Oracle Redo Logs 2021-06-09 08:22:30 Thread 2, Seqno 9046, RBA 2146205960 SCN 3955.3346607681 (16989942263361) Startup Checkpoint (starting position in the data source): Thread #: 2 Sequence #: 7494 RBA: 79945744 Timestamp: 2021-04-27 22:19:11.000000 SCN: 3943.2230339182 (16937286387310) Redo File: +DATADG/orcl/onlinelog/group_12.530.1048923467 Recovery Checkpoint (position of oldest unprocessed transaction in the data source): Thread #: 2 Sequence #: 9040 RBA: 685766672 Timestamp: 2021-06-09 04:00:09.000000 SCN: 3955.3186747512 (16989782403192) Redo File: Not Available Current Checkpoint (position of last record read in the data source): Thread #: 2 Sequence #: 9046 RBA: 2146205960 Timestamp: 2021-06-09 08:22:30.000000 SCN: 3955.3346607681 (16989942263361) Redo File: Not Available BR Previous Recovery Checkpoint: Thread #: 2 Sequence #: 8415 RBA: 3733384208 Timestamp: 2021-05-27 22:05:37.000000 SCN: 3952.1571584933 (16975282338725) Redo File: BR Begin Recovery Checkpoint: Thread #: 2 Sequence #: 9045 RBA: 4044758120 Timestamp: 2021-06-09 08:07:12.000000 SCN: 3955.3332660599 (16989928316279) Redo File: BR End Recovery Checkpoint: Thread #: 2 Sequence #: 9045 RBA: 4044758120 Timestamp: 2021-06-09 08:07:12.000000 SCN: 3955.3332660599 (16989928316279) Redo File:
OGG extract checkpoints
STARTUP CHECKPOINT
Whenever an Extract process is started, it makes it’s first checkpoint in the data source. The first checkpoint made in the data source when the extract process starts is the STARTUP CHECKPOINT.
Current Read Checkpoint
The Extract process reads from the Redo Logs or Archive Log files. This Checkpoint shows the current read position of the Extract process. It shows the information of the extract process where currently it is reading like, which is the current redo log or archive log file.
Recovery Checkpoint
the extract process not only captures the committed changes from the redo or archive log files but also keeps track of all the open transaction. This is the checkpoint which shows the Oldest Open transaction in the database.
CURRENT WRITE CHECKPOINT
Extract process captures the changes from the redo log or archive log files and writes these changes to the Trail files called Local Trails. The CURRENT WRITE CHECKPOINT show the information of the extract process where it is currently writing. It means the information like, to which trail file sequence and RBA it is currently writing to.
BR Recovery Checkpoint
“Bounded Recovery” is only for Oracle databases,“Bounded Recovery” is a component of the general extract checkpointing facility. This component of the extract guarantees an efficient recovery after an extract stops for any reason, no matter how many uncommitted transactions are currently outstanding. The Bounded Recovery parameter sets an upper boundary for the maximum amount of time that an extract is needed to recover to the point where it stopped before continuing normal processing. The default settings for “Bounded Recovery” is set to 4 hours
在Oracle GoldenGate版本11.x中,引入了Bounded Recovery(BR)的概念,即允许extract对于长事务(long running transaction 比BRINTERVAL指定值更长的事务)写入到本地BR目录。当extract重启时,它会首先读取BR文件,取而代之读取恢复检查点指定的归档日志,这样有助于提升性能以及减少对旧归档文件的依赖。 但是当在RAC环境中使用Bounded Recovery(BR)特性来恢复一个异常abend掉的extract的话,小概率可能会遇到extract hang住或丢失特性的事务。
What defines a long-running transaction for Oracle GoldenGate?
Transactions in Oracle GoldenGate are long-running if the transaction has been open longer than one (1) “Bounded Recovery” interval.
A “bounded recovery interval” is the amount of time between “Bounded Recovery checkpoints” which persists the current state and data of the extract to disk. “Bounded Recovery checkpoints” are used to identify a recovery position between tow “Bounded Recovery intervals”. The extract will pick up from the last “bounded recovery checkpoint”, instead of processing from the log position where the open long-running transaction first appeared.
What is the maximum Bounded Recovery time?
The maximum bounded recovery time is no more than twice the current “Bounded Recovery checkpoint” interval.
“Bounded Recovery” can be changed by updating the extract parameter file with the following parameter:
1
2
3
4
5
6
7
|
BR [, BRDIR directory] [, BRINTERVAL number {M | H}] [, BRKEEPSTALEFILES] [, BROFF] [, BROFFONFAILURE] [, BRRESET] |
extract hung when using BR and new objects are added to extract With BR setup, when new objects (table, sequence, DDL, et al) are including in the extract, restarted extract will pick up more data that causes the producer queue limit (a fixed number) used by BR be reached. Because the extract is still in BR recovery, the consumer thread is stopped and not processing data from the producer queues. This caused a deadlock, and the extract will appear hung.
when new object is added to an extract. All the archived logs since recovery checkpoint need to be available. ggsci> start extract xxx, BRRESET
Cause Under these conditions, this may be a problem with the Bounded Recovery Checkpoint file. It is likely corrupted. Solution The solution is to reset the Bounded Recovery Checkpoint file when restarting the extract like: GGSCI> start extract xxx BRRESET
BRRESET
BRRESET forces Extract to use normal recovery for the current run, and then turn Bounded Recovery back on after the recovery is complete. Its purpose is for the rare cases when Bounded Recovery does not revert to normal recovery if it encounters an error. Bounded Recovery will be enabled during runtime. Consult Oracle Support before using this option.
Extract is in recovery mode until it has read all records from the recovery checkpoint to the current checkpoint. This is expected behavior.
If this is a planned STOP for EXTRACT, do
ggsci > info <extract>, showch
For each Thread, check that the difference between the Recovery Checkpoint Sequence and the Current Checkpoint Sequence is not too great.
This implies that extract is currently tracking a long transaction. To check what transactions are currently being tracked on a RUNNING extract, do
ggsci > send <extract> , showtrans
Check the Transaction ID (XID for Oracle DB) from the database and wait for the long transactions to commit.
Therefore before planning to stop extract check for long transactions before doing so to prevent long recovery for extract.
这个案例是开始因为有实例突然crash, 应该是存在大事务回滚等,后来因为备份任务删除了归档日志,extract进程重启后无法读取,在restore archivelog后,extract进程依旧无法移动,判断是BR损坏或其它原因导致extract进程hang, 在移除BR文件恢复到normal recover后,读取归档进程运行正常, 当然也可以尝试ggsci> start extract xxx, BRRESET 强制extract进程使用normal recovery启动。
常用检查方法
# check checkpoint
GGSCI > info xxx, showch
# check transaction
GGSCI > info xxx, showch
# check log
GGSCI > view report xxx
# check transaction in db
select b.inst_id,b.sid,b.serial#, b.username,wait_class, state,SECONDS_IN_WAIT, XIDUSN||'.'||XIDSLOT||'.'||XIDSQN "XID", (START_SCNW * 4294967296) + START_SCNB "Start SCN" ,SES_ADDR from gv$transaction a, gv$session b where a.SES_ADDR=b.saddr;
对不起,这篇文章暂时关闭评论。