The load on a cloud application typically varies over time based on the number of active users or the types of activities they are performing. For example, more users are likely to be active during business hours, or the system may be required to perform computationally expensive analytics at the end of each month. There may also be sudden and unanticipated bursts in activity. If the processing requirements of the system exceed the capacity of the resources that are available, it will suffer from poor performance and may even fail. The system may be obliged to meet an agreed level of service, and such failure could be unacceptable.
There are many strategies available for handling varying load in the cloud, depending on the business goals for the application. One strategy is to use autoscaling to match the provisioned resources to the user needs at any given time. This has the potential to consistently meet user demand, while optimizing running costs. However, while autoscaling may trigger the provisioning of additional resources, this provisioning is not instantaneous. If demand grows quickly, there may be a window of time where there is a resource deficit.
An alternative strategy to autoscaling is to allow applications to use resources only up to some soft limit, and then throttle them when this limit is reached. The system should monitor how it is using resources so that, when usage exceeds some system-defined threshold, it can throttle requests from one or more users to enable the system to continue functioning and meet any service level agreements (SLAs) that are in place. For more information on monitoring resource usage, see the Instrumentation and Telemetry Guidance.
The system could implement several throttling strategies, including:
Belowshows an area graph for resource utilization (a combination of memory, CPU, bandwidth, and other factors) against time for applications that are making use of three features. A feature is an area of functionality, such as a component that performs a specific set of tasks, a piece of code that performs a complex calculation, or an element that provides a service such as an in-memory cache. These features are labeled A, B, and C.
Graph showing resource utilization against time for applications running on behalf of three users
The graph above illustrates the effects of deferring operations. Just prior to time T1, the total resources allocated to all applications using these features reach a threshold (the soft limit of resource utilization). At this point, the applications are in danger of exhausting the resources available. In this system, Feature B is less critical than Feature A or Feature C, so it is temporarily disabled and the resources that it was using are released. Between times T1 and T2, the applications using Feature A and Feature C continue running as normal. Eventually, the resource use of these two features diminishes to the point when, at time T2, there is sufficient capacity to enable Feature B again.
The autoscaling and throttling approaches can also be combined to help keep the applications responsive and within SLAs. If the demand is expected to remain high, throttling may provide a temporary solution while the system scales out. At this point, the full functionality of the system can be restored.
Below shows an area graph of the overall resource utilization by all applications running in a system against time, and illustrates how throttling can be combined with autoscaling.
Graph showing the effects of combining throttling with autoscaling
At time T1, the threshold specifying the soft limit of resource utilization is reached. At this point, the system can start to scale out. However, if the new resources do not become available sufficiently quickly then the existing resources may be exhausted and the system could fail. To prevent this from occurring, the system is temporarily throttled, as described earlier. When autoscaling has completed and the additional resources are available, throttling can be relaxed.
Use this pattern: