Short and Long-Running Processes
As a process moves from activity to activity, it consumes time, and each activity adds to the overall duration. But different sorts of activities have different durations, and it's not uncommon to observe a ten-step process that outpaces, say, a five-step one. It depends, of course, on what those activities are doing.
In SOA, process cycle times range from one second or less to one or more years! The latter sort need not have a large number of activities. The pyramids might have been built rock-by-rock over several decades, but protracted SOA processes typically span only a few dozen tasks, a handful of which consume almost the entire interval.
As we discuss in this article, most of that time is spent waiting. The disputes process often requires several months to complete, because at various times it sits idle waiting for information from the customer, the merchant, or the back office. Business processes crawl along at human speed, it often makes sense to let SOA manage the end-to-end flow.
It's not easy to build an SOA process engine that can simultaneously blaze through a sub-second process but keep on top of a one that hasn't moved in weeks. On the other hand, when a long-running process rouses, we expect the engine to race very quickly to the next milestone. The central argument of this article is that both long-running and short-running processes run in very quick bursts, but whereas a short-running process runs in a single burst, a long-running process might have several bursts, separated by long waits. To support long-running processes, the process engine needs a strategy to keep state.
In this article, we examine the fundamental differences between long-running and short-running processes. We discuss how to model state, and demonstrate how to build a long-running process as a combination of several short-running processes tied together by state. We also show how to compile short-running BPEL processes to improve the execution speed of a burst.
Process Duration—the Long and Short of It
SOA processes have the following types of activities:
- Tasks to extract, manipulate, or transform process data
- Scripts or inline code snippets
- Calls to systems and services, both synchronous and asynchronous
- Events, including timed events, callbacks, and unsolicited notifications from systems
The first three sorts of activities execute quickly, the first two in the order of milliseconds, the third often sub-second but seldom more than a few seconds (in the case of a synchronous call to a slow system). These activities are active: as the process navigates through them, it actively performs work, and in doing so ties up the process engine. Event times are generally much longer and more variable. Events come from other systems, so (with the exception of timed events) the process cannot control how quickly they arrive. The process passively waits for events, in effect going to sleep until they come.
An event can occur at the beginning of a process—indeed, every SOA process starts with an event—or in the middle. An event in the middle is called an intermediate event. The segment of a process between two events is called a burst. In the following figure, events are drawn as circles, activities as boxes, and bursts as bounding boxes that contain activities. Process (a), for example, starts with an event and is followed by two activities—Set Data and Sync Call—which together form a burst. Process (b) starts with an event, continues with a burst (consisting of the activities Set Data and Call System Async), proceeds to an intermediate event (Fast Response), and concludes with a burst containing the activity Sync Call. Process (c) has two intermediate events and three bursts, and (d) has a single intermediate event and two bursts.
Processes are classified by duration as follows:
- Short-running: The process runs comparatively quickly, for not more than a few seconds. Most short-running processes run in single burst (as in process (a) in the figure), but some have intermediate events with fast arrival times—as in (b), where the intermediate event, a response to an asynchronous system call, arrives in about two seconds—and thus run in multiple bursts. TIBCO's BusinessWorks and the BPEL compiler described later in the article are optimized to run both single-burst and multiple-burst short-running processes. BEA's Weblogic Integration can run single-burst, short-running processes with limited overhead, but, as discussed further next, treats cases like (b) as long-running.
- Long-running: The process has multiple bursts, and the waiting times of its intermediate events are longer than the process engine itself is expected to run before its next restart! In process (d), for example, the engine is restarted for maintenance while the process waits two days for a human action. The process survives the restart because its state is persisted. At the end of its first burst (that is, after the Assign Work step), the engine writes the state to a database, recording the fact that the process is now waiting on an event for a human action. When the engine comes back up, it fetches the state from the database to remember where it left off. Most BPEL processes are long-running. In Weblogic Integration, stateful processes can run for arbitrarily long durations.
- Mid-running: The process has multiple bursts, but the waiting times of its intermediate events last no more than a few minutes, and do not need to be persisted. Stakeholders accept the risk that if the process engine goes down, in-flight processes are lost. Chordiant's Foundation Server uses mid-running processes to orchestrate the interaction between agent and customer when the customer dials into a call center. The call is modeled as a conversation, somewhat like a sequence of questions and answers. A burst, in this design, processes the previous answer (for example, the Process Answer activity in (c)) and prepares the next question (Prepare Question). Intermediate events (Get Answer) wait for the customer to answer. State is held in memory
Stateful and Stateless Processes in BEA's Weblogic Integration
In Weblogic Integration, single-burst processes are stateless, but multiple-burst processes, even short-running ones, are stateful. Even if the wait between bursts is very small (one or two seconds perhaps), Weblogic Integration nonetheless persists process state to a database. The distinction is subtle, but Weblogic Integration provides visual clues to help us detect the difference. In the next figure, the process on the left is stateless. The process on the right is the same as that on the left except for the addition of an event step called Control Receive; the step, in effect, puts the process in a wait state until it receives a specific event. When this step is added, Weblogic Integration changes the appearance of its start step—Start—from a circle with a thin border to one with a thick border, indicating that the process has changed from being stateless to stateful.
Those who designed Weblogic Integration thought process state so important that they worked into their notation whether a process is stateful or stateless. We now study one of the most critical pieces of any process engine: how it keeps state.
How to Keep Long-Running State
In this section, we study the data models for long-running process state in two commercial process integration platforms: Oracle's BPEL Process Manager and BEA's Weblogic Integration. We also develop our own model, a generalization of the Oracle and BEA approaches, which enables us to achieve the effect of a long-running SOA process from a group of short-running processes. We put this model to practical use later in this article, in the email money transfer example.
SOA process state models contain information about the following:
- Process metadata, including the types of processes currently deployed their versions, and how their activities are assembled.
- Process instances, including status, start time and end time, and the position of the instance in a call graph (that is, parent/child relationships). Some models also track the status of individual activities.
- Pending events, and how to correlate them with process instances.
State in Oracle's BPEL Process Manager
The following figure shows the core tables in the Oracle BPEL model (version 10.1.2).
In this model process, metadata is held in two tables: Process_Default and Process_Revision. The former lists all deployed BPEL processes and their current revision numbers; the process_id field is not a technical key but the name of the process specified by the developer. The latter lists all of the revisions; for a given process, each revision has a distinct GUID, given by the field process_guid.
The seemingly-misnamed table Cube_Instance—actually, cube is synonymous with process in the internals of the product—has information about current and completed process instances. The instance has a unique key, given by cikey. From process_guid we can deduce, by joining with Process_Revision, the process type and revision of the instance. Other important information includes the instance creation date, its parent instance, and its current state. Possible states are active, aborted, stale, and completed, although the state field uses numeric codes for these values.
The Work_Item table tracks the status of instance activities. Cikey indicates the instance to which the activity belongs. Within an instance the activity is identified by the combination of node_id, scope_id, and count_id. The first two of these indicate the position of the activity in the process graph and the scope level to which it belongs; the label column is a friendlier alternative to these, assuming that the developer applied a useful label to the activity. Count_id is required in case the activity executes more than once. Work_Item has its own state field (again numeric), which indicates whether the activity is completed or pending, was cancelled, or encountered an exception.
Dlv_Subscription records pending events and correlates them with instances. Conv_id is a conversation identifier known to both the BPEL process and its partner service. To trigger the event, the partner service passes this identifier as part of its message. The process matches it to a subscriber_id, which uniquely identifies the activity that is waiting on the event. Thus, when the event arrives, the process knows exactly from which point to continue. (Technically, subscriber_id is a delimited string, which encodes as part of its structure the values of cikey, node_id, scope_id, and count_id that point to a unique Work_Item record.) The partner also specifies an operation name, which specifies which type of event it is firing. If the process is waiting on several events in the same conversation (as part of an event pick, also known as a deferred choice), operation_name determines which path to follow. The combination of operation_name and conv_id points to a unique activity (that is, to a unique subscriber_id).
State in BEA's Weblogic Integration
The following figure shows three important tables in the Weblogic Integration model:
WLI_Process_Def has metadata about types of deployed processes and their activities. The table has one row for each activity. Process_type is the human-readable name of a process. Activity_id is the numeric identifier of an activity in the process, although user_node_name, the descriptive name provided by the developer is more intuitive.
Process instance information is held in WLI_Process_Instance_Info. Each instance has a unique numeric identifier, given by process_instance. Process_type specifies the process definition on which the instance is based. Process_status specifies, in a numeric code, whether the instance is active, pending, or aborted. The table also tracks process start and end times, as well as time in excess of the SLA (sla_exceed_time). Through Weblogic Integration's administration console, the administrator can configure an SLA on process cycle time.
In Weblogic Integration a process instance can receive intermediate events by several means. One of the most important of these is by listening for messages published by Weblogic Integration's message broker system. The table WLI_Message_Broker_Dynamic keeps track of specific events waiting on broker messages. The column subscriber_instance is the process instance identifier; it matches the process_instance value in WLI_Process_Instance_Info. Rule_name is, in effect, a pointer to the event in that instance. Filter_value is an XQuery expression that checks the content of the message to determine whether to accept the event. When a message arrives, the broker checks for any subscription events, and triggers those whose filter test passes.
Our Own State Model
Our own model, shown in the next figure, follows a design approach similar to that of the Oracle and BEA models.
To begin, the model features a single metadata table, called ProcessStarter, which enumerates the types of processes deployed (processType) and specifies for each the type of event that can start it (triggeringEventType). The table's main purpose is to route start events: when an external event arrives, if ProcessStarter can map it to a process, then a new instance of that process is created from the event.
Several tables track the state of process instances. The Process table assigns a unique identifier to each instance (procID), indicates its type (processType), locates it in a conversation (convID), and records its start time, end time, and status (pending, completed, or aborted). The ProcessVariable table persists process variables, ensuring that instance-specific data survives system restarts. A variable is identified by a name (name) that is unique within its level of scope (scope) in a process instance (procID). The ProcessAudit table keeps a chronological list of important occurrences in a process instance. It is tied to a specific instance (procID), and has both a timestamp and a text entry. The entry can optionally be associated with a specific process activity (activityID). Implementations can extend the model by providing a custom state table (such as the hypothetical MyAppState in the diagram) that associates application-specific fields (myState, in this example) with an instance.
Finally, the PendingEvent table assists in correlating intermediate events. An event is identified by the combination of its process instance (procID), its activity node in the process (activityID), and if it is part of a deferred choice, the identity of that choice (choiceActivityID). (If the event is not part of a choice, choiceActivityID is zero or null.) There are two types of events: timed events and events triggered by a message. If the event is a timed event, timeToFire specifies the date and time (somewhere in the future) when the event should fire. If the event is message-based, triggeringEventType indicates the type of message that triggers it. When the event is created, the Boolean field isDone is set to false. When the event fires, isDone switches to true. If the event is part of a choice, isDone is set to true for all events in the choice, thereby ensuring that only one event is chosen.
The model assumes that all messages carry the following fields:
- Event Type
- Recipient Process Type
- Conversation ID
When a message arrives, the following logic determines how to route it:
If there is an instance of the process in the conversation (that is, if there are rows in Process where processType and convID match the values from the message), check whether it has a pending event of the given event type (that is, check for rows in PendingEvent where procID matches the value from Process, isDone is false, and triggeringEventType matches the event type). If it does, fire the event. Otherwise, discard the event.
If there is no instance of the process in the conversation, check whether the process can be started by this type of event. (That is, check for rows in ProcessStarter where processType and triggeringEventType match those from the message.) If so, instantiate the process. Otherwise, discard the event.
Combining Short-Running Processes with State in TIBCO's BusinessWorks
The next discussion covers the TIBCO implementation of the email transfer process.
Our Use Case—Sending Money by Email
With this model in place, we build a process that spans several days as a set of short-running processes, none of which lasts more than a few seconds. The use case we consider is email money transfer. In a transfer there are four main parties: the sender, the sender's bank, the recipient, and the recipient's bank. We build the process for the sender's bank.
The following figure depicts the required flow of events:
When the bank receives the request to send funds from the sender (Sender's Request), it validates the request (Validate Request), and rejects it if discovers a problem (Send Reject to Sender). If the request is valid, the bank informs the sender of its acceptance (Send Accept to Sender), notifies the recipient by email (Send Email To Recipient), and sets aside funds from the sender's account (Allocate Funds). The first burst is complete, but several possible paths can follow:
- There is a time limit on the transfer, and if it expires the transfer is aborted.
- The sender may cancel the transfer.
- The sender's bank may reject the recipient's bank's request to move the funds into the recipient's account. The recipient may try again later.
- The sender's bank may accept the recipient's bank's request to move the funds into the recipient's account.
The control flow to support this logic is a deferred choice inside a loop. The loop runs for as long as the variable loopExit is false. The process initializes the value to false (Set loopExit=false) immediately before entering the loop. Paths 1, 2, and 4 set it to true (Set loopExit=true) when they complete, signaling that there is no further work to do and the loop need not make another iteration. Path 3 leaves the loopExit flag alone, keeping it as false, thus allowing another iteration (and another chance to complete the transfer). Each of the iteration is a burst.
There are three events in the deferred choice, one for expiry (path 1), one for cancellation (path 2), and one for the recipient's bank transfer request (paths 3 and 4). The logic for cancellation and expiry (headed by the events Sender's Cancellation and Expired respectively) is identical: the process sends a cancellation email to the recipient (Send Email Recipient), informs the sender that the transfer is aborted (Send Abort to Sender), and restores the funds to the sender's account (Restore Funds). In the transfer request path (starting with the event Recipient Bank's Transfer Request), the sender bank validates the transfer (Validate Transfer) and sends the outcome to the recipient's bank (Send Reject to Recipient Bank or Send Accept to Recipient Bank). If validation passes, the process also notifies the sender that the transfer is complete (Send Completion to Sender) and commits the funds it had earlier allocated (Commit Funds).
The sender's bank's process is long-running, typically spanning several days from start to finish. To build it using a short-running process engine, such as TIBCO's BusinessWorks, we need to break it into smaller processes: one to handle the sender's request to send funds, one to handle the recipient's bank's request to complete the transfer, one to handle the sender's cancellation, one to handle expiry, and one to manage the overall event routing. In dividing the process into pieces, we lose the loop and deferred choice, but we add housekeeping responsibility to each piece.
The Router Process
The next figure shows the BusinessWorks process to handle the overall routing.
When it receives an inbound message on a JMS queue in GetEvent, the router process checks the event type to determine to which BusinessWorks process to route the event. There are three event types:
- Request: Sent by the account holder (known as the sender). Because this request starts the process, it must not contain a conversation identifier. If it does, the route process immediately logs the event as an error and discards it (Log Illegal Input). Otherwise, it queries the ProcessStarter table, in the step Check Starter Enabled, to verify that the email transfer process may be started by this type of event. (It checks that there is a record in the table that matches the given event type and process type.) If this check passes, the route process creates a unique conversation identifier (Set Conv ID) and calls the request process to handle the event (Call Request Process).
- Transfer: Sent by the recipient bank. The route process checks that the message has a conversation identifier. If it does, it calls the transfer process (Call Transfer Process) to handle the event. Otherwise, it logs the event and discards it (Log Illegal Input).
- Cancel: Sent by the sender or internally by the timer process (discussed further next). The route process checks that the message has a conversation identifier. If it does, it calls the cancellation process (Call Cancel Process) to handle the event. Otherwise, it logs the event and discards it (Log Illegal Input).
The Request Process
The next figure shows the BusinessWorks process to handle the sender's request to send funds:
The process begins by creating a unique process identifier (Set Proc ID) and then validates the request (Validate Request). If the request is invalid, the process sends a rejection to the sender (Send Reject to Sender) and writes three records to the database:
- A record in the Process table (using Add Process Record Aborted) that sets the status of the instance to ABORTED. The process identifier is the one created in Set Proc ID.
- A log of the validation failure (using Add Audit Invalid Req) in the ProcessAudit table.
- A copy of the inbound message in the ProcessVariable table, using Add Variable Request. The earlier step RequestAsString converts the message from XML to string form.
Thus, there is a record that the instance was aborted, an explanation in the audit trail why it failed, and a copy of its message data.
The happy path, in which the request passes validation, contains three steps that we described earlier: Send Email Recipient, Send Accept to Sender, and Allocate Funds. It also creates the following records in the database:
- A record in the Process table (using Add Process Record Pending) about the instance, with a status of PENDING and the identifier created in Set Proc ID.
- An indication that the validation passed (using Add Audit Valid Request) in the ProcessAudit table.
- A copy of the inbound message (using Add Variable Request 2) in the ProcessVariable table.
- Three PendingEvent records, for transfer, expiry, and cancel respectively (using the steps Add Transfer Event, Add Expiry Event, Add Cancel Event). The records share a common choiceActivityID, and for each the isDone field is set to false.
- A record in the custom table EXState (using Add EXState), which extends the Process table with information specific to email transfers. The next figure shows the EXState table and its relationship to Process. The table adds one field to the mix, numRejects, which is initialized here to zero and is incremented each time the sender's bank rejects the recipient's bank's transfer request.
When the happy path completes, the PendingEvents table has, among its contents, three records similar to the following:
Choice Activity ID
Time To Fire
Triggering Event Type
Dec 13, 2008
According to this information, process instance 123 has three pending events, whose activityIDs are Cancel, Expiry, and Transfer respectively. These events are set in a single deferred choice, whose choiceActivityID is 1. None of these events has occurred, indicated by isDone being false. The Cancel and Transfer events are triggered by the inbound events types EX.Cancel and EX.Transfer respectively. The Expiry event does not have a triggering event type, but has a timeToFire configured for December 13, 2008; Expiry is a timed event.
When one of these events arrives, it is processed only if the isDone field is false; otherwise it is discarded. When it is processed, the isDone flag is set to true for all three events. Marking all three true in effect marks the whole deferred choice as complete, and prevents a second event from occurring.
The Transfer Process
The process that handles the recipient's bank's request for transfer is shown in the following figure.
The process begins immediately by querying the PendingEvent table to check that its event is still pending (FindEvent). If it has already been marked as completed, the process rejects the request (Send Reject to Recipient Bank Event Not Found) and quits. Assuming the event is permitted; the process marks the choice as completed (Remove Event) and validates the request (Validate). If validation passes, the process, as already discussed, sends an acceptance to the recipient's bank (Send Accept Recipient Bank) and a completion notification to the sender (Send Completion Sender), commits the funds (Commit Funds), and then performs the following table updates:
- In the Process table, it sets the instance status to COMPLETED (using Close Process).
- It adds an entry to the ProcessAudit table (using Add Audit), indicating that the transfer succeeded.
- It saves the transfer request message to the ProcessVariable table. If a previous version of the message is already there, the process overwrites it (Update Variable); otherwise, it inserts a new message (Insert Variable).
If validation fails, the process sends a rejection message to the recipient bank (Send Reject Recipient Bank) and makes four table updates:
- It restores the deferred choice (using Restore Event), setting isDone to false for each of the three events (Restore Event).
- It increments the numRejects field in the EXState table (Add Reject).
- It adds an entry to the ProcessAudit table (using Add Audit), indicating that the transfer failed.
- It saves the transfer request message to the ProcessVariable table, using the same logic as above.
The successful validation path effectively terminates the larger process by removing all of its pending events. The failed validation path effectively loops back in the larger process to an earlier point, giving each of the events another chance to fire.
The Cancellation Process
The process to handle cancellation, shown in the next figure, starts out much the same way.
The process first checks that the event is still pending (Find Event), and if so, disable the deferred choice (Remove Event). The process then notifies the sender and the recipient of the cancellation (Send Recipient Email and Send Abort to Sender), restores the funds (Restore Funds), and updates the tables as follows:
- It marks the status of the instance as ABORTED (Close Process).
- It adds an audit entry indicating cancellation (Add Audit).
- It saves the cancellation event to the ProcessVariable table (Save Variable).
The Expiration Process
The process to handle expired transfers, shown in the next figure, is somewhat different.
The expiration process is not designed to handle the expiry of a single transfer. Rather, it scans the PendingEvents table for all expired transfers (Get Expired Transfers), and fires a cancellation event for each of them. The outer box labeled For Each Expired is a for loop that, for each record returned by the query, constructs a cancellation message (Create Cancellation Message) and launches a cancellation process (Launch Cancellation Process) to handle the message. It launches the process by sending a message on the JMS queue to which the routing process listens. The routing process, when it receives the event, routes it to the cancellation process. Thus, it is the cancellation process that will disable the deferred choice and abort the instance, not the timer process.
The timer process runs on a predefined schedule. The Poller step defines how often it runs (every fifteen minutes, for example). The timer process is not designed to run at the very moment a particular transfer expires. BusinessWorks manages the schedule internally; the schedule is not configured in our process state model.
A Note on Implementation
TIBCO's BusinessWorks is designed for performance, and admittedly our processes make database updates rather liberally. (The request process has seven updates in the happy path!) More efficient alternatives are to flatten the data model (so that there are fewer tables to update) or build stored procedures to bundle updates (resulting in less IO to the database server).
Another option is use TIBCO's proprietary checkpoint mechanism to serialize process state to the disk. The checkpoint feature is clumsy but is often an efficient way to achieve the effect of long-running state in an engine that is designed for short-running processes. As a proprietary capability, it does not work as part of a generalized state model, which is why we did not demonstrate it here.
In the above article, we have learned about:
- process duration
- how to keep Long-Running State
- combining Short-Running Processes with State in TIBCO's BusinessWorks
In the next part, you will learn about Fast Short-Running BPEL.
If you have read this article you may be interested to view :