OracleJob任务异常原因分析及其解决(2)
文章作者 100test 发表时间 2007:03:14 13:59:47
来源 100Test.Com百考试题网
进行恢复尝试
怀疑是CJQ0进程失效,首先设置JOB_QUEUE_PROCESSES为0,Oracle会杀掉CJQ0及相应job进程
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 0.
等2~3分钟,重新设置
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 5.
此时PMON会重起CJQ0进程
Thu Nov 18 11:59:50 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 12:01:30 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY.
Thu Nov 18 12:01:30 2004
Restarting dead background process CJQ0
CJQ0 started with pid=8 |
但是Job仍然不执行,而且在再次修改的时候,CJQ0直接死掉了。
Thu Nov 18 13:52:05 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 14:09:30 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY.
Thu Nov 18 14:10:27 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 14:10:42 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY.
Thu Nov 18 14:31:07 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 14:40:14 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY.
Thu Nov 18 14:40:28 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 14:40:33 2004
ALTER SYSTEM SET job_queue_processes=1 SCOPE=MEMORY.
Thu Nov 18 14:40:40 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY.
Thu Nov 18 15:00:42 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY.
Thu Nov 18 15:01:36 2004
ALTER SYSTEM SET job_queue_processes=15 SCOPE=MEMORY. |
尝试重起数据库,这个必须在晚上进行:
PMON started with pid=2
DBW0 started with pid=3
LGWR started with pid=4
CKPT started with pid=5
SMON started with pid=6
RECO started with pid=7
CJQ0 started with pid=8
QMN0 started with pid=9
.... |
CJQ0正常启动,但是Job仍然不执行。 没办法了...
继续研究...居然发现Oralce有这样一个bug :
1. Clear description of the problem encountered:
slgcsf() / slgcs() on Solaris will stop incrementing after
497 days 2 hrs 28 mins (approx) machine uptime.
2. Pertinent configuration information
No special configuration other than long machine uptime. .
3. Indication of the frequency and predictability of the problem
100% but only after 497 days.
4. Sequence of events leading to the problem
If the gethrtime() OS call returns a value > 42949672950000000
nanoseconds then slgcs() stays at 0xffffffff. This can
cause some problems in parts of the code which rely on
slgcs() to keep moving.
eg: In kkjssrh() does "now = slgcs(&.se)" and compares that
to a previous timestamp. After 497 days uptime slgcs()
keeps returning 0xffffffff so "now - kkjlsrt" will
always return 0. .
5. Technical impact on the customer. Include persistent after effects.
In this case DBMS JOBS stopped running after 497 days uptime.
Other symptoms could occur in various places in the code. |
好么,原来是计时器溢出了,一检查我的主机:
bash-2.03$ uptime
10:00pm up 500 day(s), 14:57, 1 user, load average: 1.31, 1.09, 1.08
bash-2.03$ date
Fri Nov 19 22:00:14 CST 2004 |
刚好到事发时是497天多一点。安排重起主机系统,这个问题够郁闷的,谁曾想Oracle这都成...
Oracle最后声称: fix made it into 9.2.0.6 patchset。在Solaris上的9206尚未发布...晕.好了,就当是个经历吧,如果有问题非常不可思议的话,那么大胆怀疑Oracle吧,是Bug,可能就是Bug。
重起以后问题解决,状态如下:
$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.3.0 - Production on Fri Nov 26 09:21:21 2004
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production
SQL> 0select job,last_date,last_sec,next_date,next_sec from user_jobs.
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC
70 26-NOV-04 09:21:04 26-NOV-04 09:26:00
SQL> /
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC
70 26-NOV-04 09:26:01 26-NOV-04 09:31:00
SQL>
SQL> 0select * from v$timer.
HSECS
3388153
SQL> 0select * from v$timer.
HSECS
3388319
SQL>
FAQ |