Something that has not been made very clear in the Oracle Scheduler documentation is that redirection cannot be used in jobs (<, >, >>, |, &&, ||). Therefore, many developers have tried to use it. So, let's keep in mind that we cannot use redirection, not in 11g as well as older releases of the database.
The scripts must be executable, so don't forget to set the execution bits. This might seem like knocking down an open door, but it's easily forgotten.
The user (who is the process owner of the external job and is nobody:nobody by default) should be able to execute the $ORACLE_HOME/bin/extjob file. In Unix, this means that the user should have execution permissions on all the parent directories of this file. This is not something specific to Oracle; it's just the way a Unix file system works. Really! Check it out. Since 10gR1, Oracle does not give execution privileges to others. A simple test for this is to try starting SQL*Plus as a user who is neither the Oracle installation user, nor a member of the DBA group—but a regular user. If you get all kinds of errors, then it implies that the permissions are not correct, assuming that the environment variables (ORACLE_HOME and PATH) are set up correctly.
The $ORACLE_HOME/install/changePerm.sh script can fix the permissions within ORACLE_HOME (for 10g). In Oracle 11g, this again changed and is no longer needed. The Scheduler interprets the return code of external jobs and records it in the *_scheduler_job_run_details view. This interpretation can be very misleading, especially when using your own exit codes. For example, when you code your script to check the number of arguments, and code an exit 1 when you find the incorrect number of arguments, the error number is translated to ORA-27301: OS failure message:No such file or directory by Oracle using the code in errno.h. In 11g, the Scheduler also records the return code in the error# column. This lets us recognize the error code better and find where it is raised in the script that ran, when the error codes are unique within the script.
When Oracle started with Scheduler, there were some quick changes. Here are the most important changes listed that could cause us problems when the definitions of the mentioned files are not exactly as listed:
- 10.2.0.1: $ORACLE_HOME/bin/extjob should be owned by the user who runs the jobs (process owner) and have 6550 permissions (setuid process owner). In a regular notation, that is what ls –l shows, and the privileges should be -r-sr-s---.
- 10.2.0.2: $ORACLE_HOME/rdbms/admin/externaljob.ora should be owned by root. This file is owned by the Oracle user (the user who installed Oracle) and the Oracle install group with 644 permissions or -rw-r—-r--, as shown by ls –l. This file controls which operating system user is going to be the process owner, or which user is going to run the job. The default contents of this file are as shown in the following screenshot:
$ORACLE_HOME/bin/extjob must be the setuid root (permissions 4750 or -rwsr-x---) and executable for the Oracle install group, where the setuid root means that the root should be the owner of the file. This also means that while executing this binary, we temporarily get root privileges on the system.
$ ORACLE_HOME/bin/extjobo should have normal 755 or -rwxr-xr-x permissions,and be owned by the normal Oracle software owner and group. If this file is missing, just copy it from $ORACLE_HOME/bin/extjob.
On AIX, this is the first release that has external job support.
- 11g release: I n 11g, the same files as in 10.2.0.2 exist with the same permissions. But $ORACLE_HOME/bin/jssu is owned by root and the Oracle install group with the setuid root (permissions 4750 or -rwsr-x---).
It is undoubtedly best to stop using the old 10g external jobs and migrate to the 11g external jobs with credentials as soon as possible.
The security of the remote external jobs is better because of the use of credentials instead of falling back to the contents of a single file in $ORACLE_HOME/, and the flexibility is much better. In 11g, the process owner of the remote external jobs is controlled by the credential and not by a file.
On Windows, this is a little easier with regard to file system security. The OracleJobscheduler service must exist in a running state, and the user who runs this service should have the Logon as batch job privilege. A .batfile cannot be run directly, but should be called as an argument of cmd.exe, for example:
job_name => 'env_windows',
job_type => 'EXECUTABLE',
number_of_arguments => 2,
job_action => 'C:windowssystem32cmd.exe',
auto_drop => FALSE,
enabled => FALSE
This job named env_windows calls cmd.exe, which eventually runs the script named test.bat that we created in d:temp. When the script we want to call needs arguments, they should be listed from argument number 3 onwards.
In addition to the permissions on these files, there also happen to be bugs in the Scheduler. If you ever plan to use chains, apply the patch for bug 5705385. It should be fixed in 18.104.22.168 and 10.2.0.5, and is available as one-off patch for 10.2.0.3 and 10.2.0.4. This bug causes the situation in which a running chain exists without a running job. If you don't have this patch and you start a job that is running a chain, it ends successfully. The next run terminates upon startup. The third start will have the chain running again and the fourth run will again terminate upon startup.
By definition, no running chain can exist when there is no running job that runs the chain. Without this fix, the situation is a little different; a chain can exist in *_SCHEDULER_RUNNING_CHAINS without a job in *_SCHEDULER_RUNNING_JOBS.
Starting the job again cleans this situation up. In case you cannot apply this patch, it is possible to use the job event queue to trigger the detection of this situation and have it cleaned up again. Let's see how this happens:
cursor bugs is
select job_name, chain_name, start_date
from user_scheduler_running_chains c
where not exists (select 'x'
from user_scheduler_running_jobs j
where j.job_name= c.job_name)
and start_date=(select min(start_date)
from user_scheduler_running_chains rc
where c.chain_name = rc.chain_name)
order by start_date;
fetch bugs into r;
-- insert into event_log (dat_sys,error_msg,object_name,
-- values (sysdate,'kick '||r.job_name||' to
fetch bugs into r;
This code was called from the general event catcher routine and was able to deal with this specific bug. The code checks for all occurrences of a job in the buggy state and gives them a little kick. The result is that the situation is as it was intended to be—the job is not running, the chain is not running, and when the job is started again, it really does start.
Whenever available, it is advisable to implement the patch 5705385.
Again, in 10gR2, there is a problem when using Resource Manager. The combination of Oracle Scheduler and Oracle Resource Manager should make a very powerful team because this would enable us to schedule jobs whenever we want, without having to worry about the load on the system. In the Resource Manager, we configure how we want to prioritize jobs, depending on where the resource consumer groups are mapped on. At the moment, the scheduling stops after a few hundred jobs have been processed; and there seems to be no way to get these jobs running other than disabling Resource Manager by activating the default Resource Manager plan, internal_plan. This can happen when you use the job_weight attribute for the jobs. The job_weight attribute is used as a multiplier when selecting a job to be executed. For unknown reasons, the Scheduler stops selecting jobs when the active session pool is relatively small. At first sight, the usage of job_weight may seem a smart choice. However, the problem is that the usage is not clearly defined, and in the end we can better forget about this attribute.
In 10gR2, it does happen every now and then that locks block the cjq0 process. This job is the job queue coordinator, so it does have an impact. A workaround for this is to kill the cjq0 process. It will automatically get restarted.
Also, on all the releases and platforms until version 22.214.171.124, alertlog messages may appear about restarting the dead cjq0 process with ORA-07445 and ORA-0600 errors, if the call stack is for:
- ORA-7445 [kslgetl()] : kslgetl kkjqspl kkjcjexe kkjssrh ksbcti ksbabs ksbrdp
- ORA-7445 [kkjqfre()] : kkjcjpr kkjssrh ksbcti ksbabs ksbrdp
- ORA-600 [ksl_invalid_latch], [kslges] : kslges kslgetl kkjqspl kkjcjexe kkjssrh ksbcti ksbabs ksbrdp
This can very well be the bug 7147411 that is fixed in 126.96.36.199. If you are hit by this bug, apply the patch for it. It is also available for 10g.
Another funny bug is the one in which the cjq0 process ends up in a loop if any other database exists with the same name as the active window. In that case, the solution is to rename or drop that object and bounce the database. You can use the following query to check if you have this problem:
select object_name, object_type
where exists (select window_name
where window_name = object_name and object_type <> 'WINDOW');
If this query returns rows, take the mentioned action. The problem is one of the effects of the bug 5911728. The other symptom is that you get ORA-01422 when you try to drop a chain that has the same name as another database object. To prevent this, make sure that all your chain names and chain rule names are unique within the database. This problem is fixed in 10.2.0.5 and 11gR2.
Sometimes it can happen that in alertlog you see ORA-0600 when starting the cjq0 process. The cjq0 process does start, but the error is strange and should normally not occur. This is the bug 7649136 that is fixed in 10.2.0.5 and 11gR2. The problem is a very rare condition with special timing involved where the NULL pointer happens to pass through.
When you are switching a window, the cjq0 process may get ORA-1000—that is, ORA-01000: maximum open cursors exceeded. This could very well be the bug 8285404 that is fixed in 11gR2 with a few backports available for 188.8.131.52.
In 184.108.40.206, there still are some support procedures in dbms_scheduler that are not working as expected, especially on remote external jobs. Most of these problems are solved in 11gR2 and for most problems, a backport is available. When calling copy_job, you can get an exception instead of a copy of the job if it happens to be a remote external job. When the copy does take place, it is not a complete copy; everything that has to do with the fact that it is a remote external job is left uncopied.
The definition of a job allows us to temporarily disable a job by setting the end_date attribute to a date in the future. If end_date is set and it is crossed, the job is disabled. To re-enable the job, set a new start date and ensure that the end_date is NULL or set it further in the future than the current system date. The NULL case does not work. Setting the end_date to a date very far in the future is a reasonable workaround for this.
In all the releases so far, there are problems with reading job events from the Scheduler event queue. The problem is that some events are repeatedly read from the queue. When building a system that reacts upon events, this is very awkward. The bug in question is 7593937; the patch is available for a backport to 220.127.116.11 and is fixed in 11gR2.
Remote external jobs do get an SSL certificate when they register to the database that they are supposed to perform work for. In the current (18.104.22.168) Scheduler Agent, there is a little bug that generates a certificate that expires after 90 days. A workaround is to re-register the agent. The problem is fixed in 22.214.171.124. If you are running the 126.96.36.199 agent, you will have to upgrade it and re-register the agent.
All of the bugs I found were generic for all the platforms. This list is certainly not a complete list, but it will be usable for many readers who are planning to use the Scheduler. The intention of this list is not to scare people away from the Scheduler, but more to show that problems are solved by Oracle development when we provide them with a case.
Don't make any assumptions about the environment your job runs in. Oracle has made lots of changes with regards to the environment the job gets when it is launched outside the database. In 9i, there is a Java method that enables us to start a job on the operating system that has lots of information in the environment. This environment got poorer with each upgrade. In earlier releases, ORACLE_HOME, PATH, and ORACLE_SID were defined and PWD was in $ORACLE_HOME/dbs.
In 10g, these three variables disappeared. In 11gR1, the current directory of processes scheduled (using the remote external job agent), is the current directory at the time the agent started. So, you should not rely on any variable in the environment.
Before you start using external jobs, think about how the job should define its own environment. Don't forget about the ulimits, they also tend to change between releases. Even something as simple as a change in the stack size (limit –s) can have surprising consequences for some programs. The same is true for the working directory. In Oracle 9i, the working directory was $ORACLE_HOME/dbs. In 10g, this changed to root (/) and in 11g, we are back in the directory from where we started the remote job agent. Don't be surprised if this changes back to root in a future release.
To see what environment your script has to cope with, a simple test script could be helpful. Things to include here are:
echo hostname is:
echo ulimits are:
echo the environment is:
echo connected user is:
Checks to do in the database
If a job does not run, check the enabled state of the job, the chain, and the program(s) it uses. Jobs can become invalid during normal object invalidation actions such as changing objects on which the job depends.
Check the chain definition. In dbms_scheduler, there is now a procedure that analyzes a chain, and then deposits the results in a pl_sql table and server output. This can be very useful and some example code is shown as follows:
rl sys.scheduler$_rule_list := sys.scheduler$_rule_list();
stl sys.scheduler$_step_type_list := sys.scheduler$_step_type_list();
spl sys.scheduler$_chain_link_list:= sys.scheduler$_chain_link_list();
select SYS.SCHEDULER$_RULE(rule_name, condition, action)
bulk collect into rl
where chain_name = 'YOUR_CHAIN_TO_CHECK';
select SYS.SCHEDULER$_step_type(step_name, step_type)
bulk collect into stl
where chain_name = 'YOUR_CHAIN_TO_CHECK';
chain_name => 'YOUR_CHAIN_TO_CHECK',
rules => rl, steps => stl,
step_pairs => spl
for i in spl.first .. spl.last
The following screenshot shows sample output of this code:
The output shows where the chain could start and what the next steps could be. Sometimes, a chain rule cannot be evaluated immediately because a condition is not (yet) true. In such a case, the chain will stall and never reach an end. This status can be read from the *_scheduler_running_chains view and it could very well be the case that only a manual intervention can help the chain into the next step. In a case where a condition might become true after a step ends, the chain rules will only be evaluated again if the evaluation interval for that chain is set manually by calling the evaluate_running_chain procedure. It is OK to do this during development and debugging, but normally one is expected to use evaluation_interval in the create_chain procedure, or set this attribute using the alter_chain procedure.
When things get bad, and a forced or timed evaluation does not enable us to achieve our goal, there is an even more powerful tool that can help. It is the alter_running_chain procedure. Using this procedure, we can set the state of a job step to whatever we want. Again, this is nice for development and debugging, but normally one would expect the rules to be complete enough to handle all the situations. Here is an example showing the use of this procedure:
This is useful if step_03 failed and there was no rule for this possible occurrence. We can also start the following step without bothering with step_03 at all:
Here we just made step_04 running and we kept the status of step_03 as it was. This is very powerful, so use it with care. Otherwise, before you know it your problems would have grown instead of getting solved like you had hoped. Without knowing the other rules of the chain, you might not get what you wanted to achieve. What should be the end result of step_03 that has a failed status and step_04 that has a success status? Check the rules for the chain you are messing with.
A little more difficult are the situations in which a chain runs another chain. In this case, the job_name has to be built up from the parent_job_name until the sub[_sub,…]_job_name (to identify the chain) has to be altered.
Let's take an example. A job JOB01 starts a chain CH01. The chain gets its own job that runs the chain, which is the same as that for normal job steps. This sub_job_name can be found in *_scheduler_running_chains. In order to manipulate a step in the chain CH01, we need to specify it as owner.job_name.job_subname for alter_running_chain to work on the right chain. In this case, it would look as : JOB01.CH01
As always, test your jobs, test your chains, and make sure that every possible result of every chain step can be handled by the accompanying set of chain rules. When all the rules and the mentioned patches are in place, there is not much to worry about.
If a job stops being started and it is supposed to run under the resource control, you might get the job running by effectively disabling the Resource Manager by enabling internal_plan. This assumes that the job still has the status SCHEDULED.
In a Real Application Clusters (RAC) database, if a job is not running when expected, check in which instance the job ran for the last time and check if the instance stickiness has been used. If the instance_stickiness job attribute has been defined, Scheduler will only run the job when the instance where the job ran the last time is running. According to the Scheduler documentation, the job will be run by another instance if the instance to which the job sticks is not able to run the job for a significant time. And, the problem is that there is no definition given for the term "significant".
The checks listed here should normally get you up and running again but the list is not complete. No living creature can ever meet every possible problem, so over time this list will evolve. When development continues working on the issues, we can expect this list to shrink in the later releases.
Debugging jobs is no more difficult than debugging normal PL/SQL or scripts. Especially for the scripts that run as external jobs, don't forget the limitations of the environment definitions. Scripters used to scripting for cron know how to handle a "poor" environment. We hope this article gave enough tools to start debugging. The things learned here should be very useful in real life. In this article, we saw:
- How the file privileges for external jobs should be:
- In Unix systems using various releases of Oracle
- In Windows systems
- What the externaljob.ora file looks like
- Normal Unix redirection cannot be used in jobs
- Windows scripts should be called using cmd.exe
- A list of Scheduler bugs with solutions or workarounds
- How to analyze a job chain using PL/SQL
- How to solve a stalled chain
- How to address a chain step
- How to prevent a stalled chain situation
- What to check in the database when unexpectedly a job does not run
- How to check the environment of a job
- How to cope with problems of the cjq0 process
- How to get such a job running again
If you have read this article you may be interested to view :