Scheduler Agent Supervisor Pattern

Akka.Net implementation pattern for high-level retry logic for major fails (Long-lasting)

Context and Problem

An application performs tasks that comprise a number of steps, some of which may invoke remote services or access remote resources. The individual steps may be independent of each other, but they are orchestrated by the application logic that implements the task.

Whenever possible, the application should ensure that the task runs to completion and resolve any failures that might occur when accessing remote services or resources. These failures could occur for a variety of reasons.

For example, the network might be down, communications could be interrupted, a remote service may be unresponsive or in an unstable state, or a remote resource might be temporarily inaccessible—perhaps due to resource constraints. In many cases these failures may be transient and can be handled by using the Retry Pattern.

If the application detects a more permanent fault from which it cannot easily recover, it must be able to restore the system to a consistent state and ensure integrity of the entire end-to-end operation.

Solution

The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate the steps (individual items of work) to be performed as part of the task (the overall process):

The Scheduler arranges for the individual steps that comprise the overall task to be executed and orchestrates their operation. These steps can be combined into a pipeline or workflow, and the Scheduler is responsible for ensuring that the steps in this workflow are performed in the appropriate order. The Scheduler maintains information about the state of the workflow as each step is performed (such as “step not yet started,” “step running,” or “step completed”) and records information about this state. This state information should also include an upper limit of the time allowed for the step to finish (referred to as the Complete By time). If a step requires access to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically communicates with an Agent by using asynchronous request/response messaging. This can be implemented by using queues, although other distributed messaging technologies could be used instead.
The Scheduler, Agent, and Supervisor are logical components and their physical implementation depends on the technology being used. For example, several logical agents may be implemented as part of a single web service.The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource referenced by a step in a task. Each Agent typically wraps calls to a single service or resource, implementing the appropriate error handling and retry logic (subject to a timeout constraint, described later). If the steps in the workflow being run by the Scheduler utilize several services and resources across different steps, each step might reference a different Agent (this is an implementation detail of the pattern).
The Supervisor monitors the status of the steps in the task being performed by the Scheduler. It runs periodically (the frequency will be system-specific), examines the status of steps as maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover the step or execute the appropriate remedial action (this may involve modifying the status of a step). Note that the recovery or remedial actions are typically implemented by the Scheduler and Agents. The Supervisor should simply request that these actions be performed.

The Scheduler maintains information about the progress of the task and the state of each step in a durable data store, referred to as the State Store. The Supervisor can use this information to help determine whether a step has failed. The below illustrates the relationship between the Scheduler, the Agents, the Supervisor, and the State Store.

scheduler

The actors in the Scheduler Agent Supervisor pattern

When an application wishes to run a task, it submits a request to the Scheduler. The Scheduler records initial state information about the task and its steps (for example, “step not yet started”) in the State Store and then commences performing the operations defined by the workflow. As the Scheduler starts each step, it updates the information about the state of that step in the State Store (for example, “step running”).

If a step references a remote service or resource, the Scheduler sends a message to the appropriate Agent. The message may contain the information that the Agent needs to pass to the service or access the resource, in addition to the Complete By time for the operation. If the Agent completes its operation successfully, it returns a response to the Scheduler. The Scheduler can then update the state information in the State Store (for example, “step completed”) and perform the next step. This process continues until the entire task is complete.

An Agent can implement any retry logic that is necessary to perform its work. However, if the Agent does not complete its work before the Complete By period expires the Scheduler will assume that the operation has failed. In this case, the Agent should stop its work and not attempt to return anything to the Scheduler (not even an error message), or attempt any form of recovery. The reason for this restriction is that, after a step has timed out or failed, another instance of the Agent may be scheduled to run the failing step (this process is described later).

If the Agent itself fails, the Scheduler will not receive a response. The pattern may not make a distinction between a step that has timed out and one that has genuinely failed.

If a step times out or fails, the State Store will contain a record that indicates that the step is running (“step running”), but the Complete By time will have passed. The Supervisor looks for steps such as this and attempts to recover them. One possible strategy is for the Supervisor to update the Complete By value to extend the time available to complete the step, and then send a message to the Scheduler identifying the step that has timed out . The Scheduler can then attempt to repeat this step. However, such a design requires the tasks to be idempotent.

It may be necessary for the Supervisor to prevent the same step from being retried if it continually fails or times out. To achieve this, the Supervisor could maintain a retry count for each step, along with the state information, in the State Store. If this count exceeds a predefined threshold the Supervisor can adopt a strategy such as waiting for an extended period before notifying the Scheduler that it should retry the step, in the expectation that the fault will be resolved during this period. Alternatively, the Supervisor can send a message to the Scheduler to request the entire task be undone by implementing a Compensating Transaction (this approach will depend on the Scheduler and Agents providing the information necessary to implement the compensating operations for each step that completed successfully).

If the Scheduler is restarted after a failure, or the workflow being performed by the Scheduler terminates unexpectedly, the Scheduler should be able to determine the status of any in-flight task that it was handling when it failed, and be prepared to resume this task from the point at which it failed. The implementation details of this process are likely to be system specific. If the task cannot be recovered, it may be necessary to undo the work already performed by the task. This may also require implementing a Compensating Transaction.

The key advantage of this pattern is that the system is resilient in the event of unexpected temporary or unrecoverable failures. The system can be constructed to be self-healing. For example, if an Agent or the Scheduler crashes, a new one can be started and the Supervisor can arrange for a task to be resumed. If the Supervisor fails, another instance can be started and can take over from where the failure occurred. If the Supervisor is scheduled to run periodically, a new instance may be automatically started after a predefined interval. The State Store may be replicated to achieve an even greater degree of resiliency.

When to Use this Pattern

Use this pattern when a process that runs in a distributed environment such as the cloud must be resilient to communications failure and/or operational failure.

This pattern might not be suitable for tasks that do not invoke remote services or access remote resources.

Close example of an Akka implementation

Akka uses actors which closely resemble the scheduler

Akka scheduler