Part 1. The Case for Energy-Proportional Computing

by: Luiz André Barroso and Urs Hölzle
Google

Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems.

Energy efficiency, a new focus for general-purpose computing, has been a major technology driver in the mobile and embedded areas for some time. Earlier work emphasized extending battery life, but it has since expanded to include peak power reduction because thermal constraints began to limit further CPU performance improvements.

Energy management has now become a key issue for servers and data center operations, focusing on the reduction of all energy-related costs, including capital, operating expenses, and environmental impacts. Many energy-saving techniques developed for mobile devices became natural candidates for tackling this new problem space. Although servers clearly provide many parallels to the mobile space, we believe that they require additional energy-efficiency innovations.

In current servers, the lowest energy-efficiency region corresponds to their most common operating mode. Addressing this mismatch will require significant rethinking of components and systems. To that end, we propose that energy proportionality should become a primary design goal. Although our experience in the server space motivates these observations, we believe that energy-proportional computing also will benefit other types of computing devices.

Dollars & CO2
Recent reports 1,2 highlight a growing concern with computer-energy consumption and show how current trends could make energy a dominant factor in the total cost of ownership. 3 Besides the server electricity bill, TCO includes other energy-dependent components such as the cost of energy for the cooling infrastructure and provisioning costs, specifically the data center infrastructure's cost. To a first-order approximation, both cooling and provisioning costs are proportional to the average energy that servers consume, therefore energy efficiency improvements should benefit all energy-dependent TCO components.

Efforts such as the Climate Savers Computing Initiative (www.climatesaverscomputing.org) could help lower worldwide computer energy consumption by promoting widespread adoption of high-efficiency power supplies and encouraging the use of power-savings features already present in users' equipment. The introduction of more efficient CPUs based on chip multiprocessing has also contributed positively toward more energy-efficient servers. 3 However, long-term technology trends invariably indicate that higher performance means increased energy usage. As a result, energy efficiency must improve as fast as computing performance to avoid a significant growth in computers' energy footprint.

Servers versus Laptops
Many of the low-power techniques developed for mobile devices directly benefit general-purpose servers, including multiple voltage planes, an array of energy-efficient circuit techniques, clock gating, and dynamic voltage-frequency scaling. Mobile devices require high performance for short periods while the user awaits a response, followed by relatively long idle intervals of seconds or minutes. Many embedded computers, such as sensor network agents, present a similar bimodal usage model. 4

This kind of activity pattern steers designers to emphasize high energy efficiency at peak performance levels and in idle mode, supporting inactive low-energy states, such as sleep or standby, that consume near-zero energy. However, the usage model for servers, especially those used in large-scale Internet services, has very different characteristics.



Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle and seldom operate near their maximum utilization, instead operating most of the time at between 10 and 50 percent of their maximum
utilization levels.

Figure 1 shows the distribution of CPU utilization levels for thousands of servers during a six-month interval. 5 Although the actual shape of the distribution varies significantly across services, two key observations from Figure 1 can be generalized: Servers are rarely completely idle and seldom operate near their maximum utilization. Instead, servers operate most of the time at between 10 and 50 percent of their maximum utilization levels. Such behavior is not accidental, but results from observing sound service provisioning and distributed systems design principles.

An Internet service provisioned such that the average load approaches 100 percent will likely have difficulty meeting throughput and latency service-level agreements because minor traffic fluctuations or any internal disruption, such as hardware or software faults, could tip it over the edge. Moreover, the lack of a reasonable amount of slack makes regular operations exceedingly complex because any maintenance task has the potential to cause serious service disruptions. Similarly, well-provisioned services are unlikely to spend significant amounts of time completely idle because doing so would represent a substantial waste of capital.

Even during periods of low service demand, servers are unlikely to be fully idle. Large-scale services usually require hundreds of servers and distribute the load over these machines. In some cases, it might be possible to completely idle a subset of servers during low-activity periods by, for example, shrinking the number of active front ends. Often, though, this is hard to accomplish because data, not just computation, is distributed among machines. For example, common practice calls for spreading user data across many databases to eliminate the bottleneck that a central database holding all users poses.

Spreading data across multiple machines improves data availability as well because it reduces the likelihood that a crash will cause data loss. It can also help hasten recovery from crashes by spreading the recovery load across a greater number of nodes, as is done in the Google File System. 6 As a result, all servers must be available, even during low-load periods. In addition, networked servers frequently perform many small back- ground tasks that make it impossible for them to enter a sleep state.

With few windows of complete idleness, servers cannot take advantage of the existing inactive energy-savings modes that mobile devices otherwise find so effective. Although developers can sometimes restructure applications to create useful idle intervals during periods of reduced load, in practice this is often difficult and even harder to maintain. The Tickless kernel 7 exemplifies some of the challenges involved in creating and maintaining idleness. Moreover, the most attractive inactive energy-savings modes tend to be those with the highest wake-up penalties, such as disk spin-up time, and thus their use complicates application deployment and greatly reduces their practicality.(computer).

0 comment: