Leader Election Pattern

 

Comments

 

Active/Passive election pattern for custom designed services. Acquires worker lock once acquired is active. More than two services would introduce complexity, to overcome this use.

 

 

Bully Algorithm

Ring Algorithm

 

Code

 

The BlobDistributedMutex class in the example contains the RunTaskWhenMutexAquired method that enables a role instance to attempt to obtain a lease over a specified blob.

 

The details of the blob (the name, container, and storage account) are passed to the constructor in a BlobSettings object when the BlobDistributedMutex object is created (this object is a simple struct that is included in the sample code).

 

The constructor also accepts aTask that references the code that the role instance should run if it successfully acquires the lease over the blob and is elected the leader.

 

Note that the code that handles the low-level details of obtaining the lease is implemented in a separate helper class named BlobLeaseManager.

 

public class BlobDistributedMutex
{
  ...
  private readonly BlobSettings blobSettings;
  private readonly Func taskToRunWhenLeaseAcquired;
  ...

  public BlobDistributedMutex(BlobSettings blobSettings, 
           Func taskToRunWhenLeaseAquired)
  {
    this.blobSettings = blobSettings;
    this.taskToRunWhenLeaseAquired = taskToRunWhenLeaseAquired;
  }
  
  public async Task RunTaskWhenMutexAcquired(CancellationToken token)
  {
    var leaseManager = new BlobLeaseManager(blobSettings);
    await this.RunTaskWhenBlobLeaseAcquired(leaseManager, token);    
  }

The RunTaskWhenMutexAquired method in the code sample above invokes the RunTaskWhenBlobLeaseAcquired method shown in the following code sample to actually acquire the lease.

 

The RunTaskWhenBlobLeaseAcquired method runs asynchronously. If the lease is successfully acquired, the role instance has been elected the leader. The purpose of the taskToRunWhenLeaseAcquired delegate is to perform the work that coordinates the other role instances.

 

If the lease is not acquired, another role instance has been elected as the leader and the current role instance remains a subordinate. Note that the TryAcquireLeaseOrWait method is a helper method that uses the BlobLeaseManager object to obtain the lease.

 

private async Task RunTaskWhenBlobLeaseAcquired(
    BlobLeaseManager leaseManager, CancellationToken token)
  {
    while (!token.IsCancellationRequested)
    {
      // Try to acquire the blob lease. 
      // Otherwise wait for a short time before trying again.
      string leaseId = await this.TryAquireLeaseOrWait(leaseManager, token);

      if (!string.IsNullOrEmpty(leaseId))
      {
        // Create a new linked cancellation token source so that if either the 
        // original token is cancelled or the lease cannot be renewed, the
        // leader task can be cancelled.
        using (var leaseCts = 
          CancellationTokenSource.CreateLinkedTokenSource(new[] { token }))
        {
          // Run the leader task.
          var leaderTask = this.taskToRunWhenLeaseAquired.Invoke(leaseCts.Token);          
          ...
        }
      }
    }
  }

The task started by the leader also executes asynchronously. While this task is running, the RunTaskWhenBlobLeaseAquired method shown in the following code sample periodically attempts to renew the lease. This action helps to ensure that the role instance remains the leader.

 

In the sample solution, the delay between renewal requests is less than the time specified for the duration of the lease in order to prevent another role instance from being elected the leader. If the renewal fails for any reason, the task is cancelled.

 

If the lease fails to be renewed or the task is cancelled (possibly as a result of the role instance shutting down), the lease is released. At this point, this or another role instance might be elected as the leader. The code extract below shows this part of the process.

<?div>
private async Task RunTaskWhenBlobLeaseAcquired(
    BlobLeaseManager leaseManager, CancellationToken token)
  {
    while (...)
    {
      ...
      if (...)
      {
        ...
        using (var leaseCts = ...)
        {
          ...
          // Keep renewing the lease in regular intervals. 
          // If the lease cannot be renewed, then the task completes.
          var renewLeaseTask = 
            this.KeepRenewingLease(leaseManager, leaseId, leaseCts.Token);

          // When any task completes (either the leader task itself or when it could
          // not renew the lease) then cancel the other task.
          await CancelAllWhenAnyCompletes(leaderTask, renewLeaseTask, leaseCts);
        }
      }
    }
  }
  ...
}

The KeepRenewingLease method is another helper method that uses the BlobLeaseManager object to renew the lease.

 

TheCancelAllWhenAnyCompletes method cancels the tasks specified as the first two parameters.

 

The following code example shows how to use the BlobDistributedMutex class in a worker role. This code obtains a lease over a blob named MyLeaderCoordinatorTask in the leases container in development storage, and specifies that the code defined in the MyLeaderCoordinatorTaskmethod should run if the role instance is elected the leader.

 

var settings = new BlobSettings(CloudStorageAccount.DevelopmentStorageAccount, 
  "leases", "MyLeaderCoordinatorTask");
var cts = new CancellationTokenSource();
var mutex = new BlobDistributedMutex(settings, MyLeaderCoordinatorTask);
mutex.RunTaskWhenMutexAcquired(this.cts.Token);

// Method that runs if the role instance is elected the leader
private static async Task MyLeaderCoordinatorTask(CancellationToken token)
{
  ...
}

Note the following points about the sample solution:

 

  • The blob is a potential single point of failure. If the blob service becomes unavailable, or the blob is inaccessible, the leader will be unable to renew the lease and no other role instance will be able to obtain the lease. In this case, no role instance will be able to act as the leader. However, the blob service is designed to be resilient, so complete failure of the blob service is considered to be extremely unlikely.
  • If the task being performed by the leader stalls, the leader might continue to renew the lease, preventing any other role instance from obtaining the lease and taking over the leader role in order to coordinate tasks. In the real world, the health of the leader should be checked at frequent intervals.
  • The election process is non-deterministic. You cannot make any assumptions about which role instance will obtain the blob lease and become the leader.
  • The blob used as the target of the blob lease should not be used for any other purpose. If a role instance attempts to store data in this blob, this data will not be accessible unless the role instance is the leader and holds the blob lease.