January 14th, 2016 - by Emmanuel Stalling
What is Stolen CPU?
Stolen CPU represents the CPU cycles that are reclaimed by a virtual machine’s hypervisor because it reached maximum processing capacity performing other tasks. Specifically, it involves the re-allocation of processing resources to account for a lack somewhere else. Stolen CPU is measured by virtual machine hypervisors as “CPU steal” or “steal time.” These are measurements of how long a virtual CPU remains idle while it waits for a physical CPU to provide support for its virtual processes. Thus, it is a good means of eliminating possible causes of idle CPUs. If the steal time is near zero and idle times remain relatively high, something else is causing the CPUs to stall. If steal times and idle times rise and fall congruently, stolen CPU is probably to blame.
Why Stolen CPU might be a Critical Issue
Stolen CPU is a critical issue in the same way a physical CPU running constantly at full load is an issue. If you’ve ever tried to perform a simple task on a desktop PC like opening a web browser while a large archive is being unzipped, you have an idea of the type of slowdown that occurs when CPUs reach maximum capacity. Since virtual machines often serve as the backend for web applications, so this can lead to the application taking too long to load or not loading altogether if overlapping CPU steal crashes the system. Eventually, this can have a negative impact on your website traffic and customer retention.
Similarly, for virtual machines that are a part of enterprise database applications, the resulting slowdown can prevent mission-critical and/or time-sensitive data from reaching its destination. What begins as an inconvenience for employees who rely on the system can quickly become lower operational efficiency as overloaded CPUs bog down the system. Operational efficiency can be the difference between retaining customers or losing them to more expedient competition.
The Causes of Stolen CPU
The main causes of CPU steal are poor allocation and insufficient resources. Unlike RAM, which has hard limits, CPU cycles are not inherently divided among virtual machines on a physical server. If the server administrator neglects to set hard limits on the amount of CPU that can be used by virtual machines, the virtual CPUs will “borrow” resources from the shared pool of cycles to complete their tasks, as most presume there are measures to avoid complications.
Insufficient resources are a more straightforward cause. Sometimes it begins in the physical CPU, which may be too old or lack enough cores to handle the workload of its hosted VMs. It may also be the result of a hosting service overselling the capability of their physical servers, causing all of their clients to suffer the slowdown of shared CPU resources. Stolen CPU is particularly prominent in cloud-computing, as it usually makes heavy use of virtualization and has many congruent tasks running off the same physical CPU.
Troubleshooting Stolen CPU Issues
For service providers:
Resource Limiting - Consider the amount of processing power used by each virtual machine, according to how many virtual machines are running off a particular server. For instance, if there are four virtual machines on a single server, the throttle may be set to 25% to prevent any of them from stealing CPU from the other. If there are three, set it at 33% and so forth.
Server Performance Monitoring - A reliable monitoring solution ensure the numbers presented are accurate. You may be able to pinpoint spikes in CPU activity that lead to CPU steal, such as a search engine’s web-crawler parsing your website, and develop methods to offset the issue during specific times.
Upgrade Software - Sometimes complications are software related - e.g. a hypervisor that lacks the technology to properly allocate resources to support the amount of virtual machines it hosts. Upgrading to a newer version or a different software altogether might be the answer.
Extend Computing Resources - Replacing processor(s) with more processing power (GHz) or additional cores extends the resources available to your virtual machines.
Migrate Virtual Machines - Moving VMs to a different server or removing non-critical VMs is a viable option if you have the time and hardware. Spreading around tasks to different physical CPUs will reduce the load on each virtual CPU.
Check with Hosting Provider - If CPU usage is low and CPU steal is consistently higher than acceptable norms, it may be that the service host has oversold their capabilities. If no contingencies are offered, it may be necessary to switch to a new provider.
Monitor End-Users - Ensure that end-user activity isn’t running up CPU usage, and if necessary, switch to more powerful machines to take the load off the server.