SMON的作用还包括清理死事务:Recover Dead transaction。当服务进程在提交事务(commit)前就意外终止的话会形成死事务(dead transaction),PMON进程负责轮询Oracle进程,找出这类意外终止的死进程(dead process),通知SMON将与该dead process相关的dead transaction回滚清理,并且PMON还负责恢复dead process原本持有的锁和latch。
我们来具体了解dead transaction的恢复过程:
SQL> select * from v$version; BANNER ---------------------------------------------------------------- Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi PL/SQL Release 10.2.0.4.0 - Production CORE 10.2.0.4.0 Production TNS for Linux: Version 10.2.0.4.0 - Production NLSRTL Version 10.2.0.4.0 - Production SQL> select * from global_name; GLOBAL_NAME -------------------------------------------------------------------------------- www.askmac.cn SQL>alter system set fast_start_parallel_rollback=false; System altered. 设置10500,10046事件以跟踪SMON进程的行为 SQL> alter system set events '10500 trace name context forever,level 8'; System altered. SQL> oradebug setospid 4424 Oracle pid: 8, Unix process pid: 4424, image: [email protected] (SMON) SQL> oradebug event 10046 trace name context forever,level 8; Statement processed. 在一个新的terminal中执行大批量的删除语句,在执行一段时间后使用操作系统命令将执行该删除操作的 服务进程kill掉,模拟一个大的dead transaction的场景 SQL> delete large_rb; delete large_rb [oracle@rh2 bdump]$ kill -9 4535 等待几秒后pmon进程会找出dead process: [claim lock for dead process][lp 0x7000003c70ceff0][p 0x7000003ca63dad8.1290666][hist x9a514951] 在x$ktube内部视图中出现ktuxecfl(Transaction flags)标记为DEAD的记录: SQL> select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD'; SUM(DISTINCT(KTUXESIZ)) ----------------------- 29386 SQL> / SUM(DISTINCT(KTUXESIZ)) ----------------------- 28816 以上KTUXESIZ代表事务所使用的undo块总数(number of undo blocks used by the transaction) ==================smon trace content================== SMON: system monitor process posted WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332801935 WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332815568 WAIT #0: nam='latch: row cache objects' ela= 95 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333332734 WAIT #0: nam='latch: row cache objects' ela= 83 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333356173 WAIT #0: nam='latch: undo global data' ela= 104 address=3066991984 number=187 tries=1 obj#=1 tim=1278243347987705 WAIT #0: nam='latch: object queue header operation' ela= 89 address=3094817048 number=131 tries=0 obj#=1 tim=1278243362468042 WAIT #0: nam='log file switch (checkpoint incomplete)' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243419588202 Dead transaction 0x00c2.008.0000006d recovered by SMON ===================== PARSING IN CURSOR #3 len=358 dep=1 uid=0 oct=3 lid=0 tim=1278243423594568 hv=3186851936 ad='ae82c1b8' select smontabv.cnt, smontab.time_mp, smontab.scn, smontab.num_mappings, smontab.tim_scn_map, smontab.orig_thread from smon_scn_time smontab, (select max(scn) scnmax, count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt from smon_scn_time where thread = 0) smontabv where smontab.scn = smontabv.scnmax and thread = 0 END OF STMT PARSE #3:c=0,e=1354526,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1278243423594556 EXEC #3:c=0,e=106,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1278243423603269 FETCH #3:c=0,e=47065,p=0,cr=319,cu=0,mis=0,r=1,dep=1,og=4,tim=1278243423650375 *** 2011-06-24 21:19:25.899 WAIT #0: nam='smon timer' ela= 299999999 sleep time=300 failed=0 p3=0 obj#=1 tim=1278243716699171 kglScanDependencyHandles4Unpin(): cumscan=3 cumupin=4 time=776 upinned=0
以上SMON回滚清理Dead transaction的过程从”system monitor process posted”开始到”Dead transaction 0x00c2.008.0000006d recovered by SMON”结束。另外可以看到在恢复过程中SMON先后请求了’latch: row cache objects’、’latch: undo global data’、’latch: object queue header operation’三种不同类型的latch。
现象
fast_start_parallel_rollback参数决定了SMON在回滚事务时使用的并行度,若将该参数设置为false那么并行回滚将被禁用,若设置为Low(默认值)那么会以2*CPU_COUNT数目的并行度回滚,当设置为High则4*CPU_COUNT数目的回滚进程将参与进来。当我们通过以下查询发现系统中存在大的dead tranacation需要回滚时我们可以通过设置fast_start_parallel_rollback为HIGH来加速恢复:
select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD'; ==============parallel transaction recovery=============== *** 2011-06-24 20:31:01.765 SMON: system monitor process posted msgflag:0x0000 (-/-/-/-/-/-/-) *** 2011-06-24 20:31:01.765 SMON: process sort segment requests begin *** 2011-06-24 20:31:01.765 SMON: process sort segment requests end *** 2011-06-24 20:31:01.765 SMON: parallel transaction recovery begin WAIT #0: nam='DFS lock handle' ela= 504 type|mode=1413545989 id1=3 id2=11 obj#=2 tim=1308918661765715 WAIT #0: nam='DFS lock handle' ela= 346 type|mode=1413545989 id1=3 id2=12 obj#=2 tim=1308918661766135 WAIT #0: nam='DFS lock handle' ela= 565 type|mode=1413545989 id1=3 id2=13 obj#=2 tim=1308918661766758 WAIT #0: nam='DFS lock handle' ela= 409 type|mode=1413545989 id1=3 id2=14 obj#=2 tim=1308918661767221 WAIT #0: nam='DFS lock handle' ela= 332 type|mode=1413545989 id1=3 id2=15 obj#=2 tim=1308918661767746 WAIT #0: nam='DFS lock handle' ela= 316 type|mode=1413545989 id1=3 id2=16 obj#=2 tim=1308918661768146 WAIT #0: nam='DFS lock handle' ela= 349 type|mode=1413545989 id1=3 id2=17 obj#=2 tim=1308918661768549 WAIT #0: nam='DFS lock handle' ela= 258 type|mode=1413545989 id1=3 id2=18 obj#=2 tim=1308918661768858 WAIT #0: nam='DFS lock handle' ela= 310 type|mode=1413545989 id1=3 id2=19 obj#=2 tim=1308918661769224 WAIT #0: nam='DFS lock handle' ela= 281 type|mode=1413545989 id1=3 id2=20 obj#=2 tim=1308918661769555 *** 2011-06-24 20:31:01.769 SMON: parallel transaction recovery end
但是在real world的实践中可以发现当fast_start_parallel_rollback= Low/High,即启用并行回滚时常有并行进程因为各种资源互相阻塞导致回滚工作停滞的例子,当遭遇到这种问题时将fast_start_parallel_rollback设置为FALSE一般可以保证恢复工作以串行形式在较长时间内完成。
如何禁止SMON Recover Dead transaction
可以设置10513事件来临时禁止SMON恢复死事务,这在我们做某些异常恢复的时候显得异常有效,当然不建议在一个正常的生产环境中设置这个事件:
SQL> alter system set events '10513 trace name context forever, level 2';
System altered.
10531 -- event disables transaction recovery which was initiated by SMON
SQL> select ktuxeusn,
2 to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') "Time",
3 ktuxesiz,
4 ktuxesta
5 from x$ktuxe
6 where ktuxecfl = 'DEAD';
KTUXEUSN Time KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
17 24-JUN-2011 22:03:10 0 INACTIVE
66 24-JUN-2011 22:03:10 0 INACTIVE
105 24-JUN-2011 22:03:10 0 INACTIVE
193 24-JUN-2011 22:03:10 33361 ACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
197 24-JUN-2011 22:03:10 20171 ACTIVE
7 rows selected.
SQL> /
KTUXEUSN Time KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
17 24-JUN-2011 22:03:10 0 INACTIVE
66 24-JUN-2011 22:03:10 0 INACTIVE
105 24-JUN-2011 22:03:10 0 INACTIVE
193 24-JUN-2011 22:03:10 33361 ACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
194 24-JUN-2011 22:03:10 0 INACTIVE
197 24-JUN-2011 22:03:10 20171 ACTIVE
7 rows selected.
================smon disabled trans recover trace==================
SMON: system monitor process posted
*** 2011-06-24 22:02:57.980
SMON: Event 10513 is level 2, trans recovery disabled.
Leave a Reply