For improving energy efficiency and conforming with the power budget, it is important to be able to measure power consumption of cloud computing servers. Intel's Running Average Power Limit (RAPL) interface is a powerful tool for this purpose. RAPL provides power limiting features and accurate energy readings for CPUs and DRAM which are easily accessible through different interfaces on large distributed computing systems. Since its introduction, RAPL has been used extensively in power measurement and modeling. However, the advantages and disadvantages of RAPL have not been well investigated yet. To fill up this gap, we conduct series of experiments to disclose the underlying strengths and weaknesses of RAPL interface by using both customized microbenchmarks and three well-known application level benchmarks Stream, Stress-ng and ParFullCMS. Moreover, to make the analysis as realistic as possible, we leverage a production-level power measurement dataset from the Taito, a supercomputing cluster of the Finnish Center of Scientific Computing (CSC). Our results illustrate different aspects of RAPL and document the findings through a comprehensive analysis. Our observations reveal that RAPL readings are highly correlated with plug power, promisingly accurate enough and have negligible performance overhead. Experimental results suggest RAPL can be a very useful tool to measure and monitor energy consumption of servers without deploying any complex power meters. We also show that there are still some open issues such as driver support, non-atomicity of register updates and unpredictable timings that might weaken the usability of RAPL in certain scenarios. For such scenarios, we pinpoint solutions and workarounds.
Most video streaming traffic is delivered over HTTP using standard web servers. While traditional web server workloads consist of requests that are primarily for small files that can be serviced from the file system cache, HTTP video streaming workloads often service a long tail of large infrequently requested videos. As a result, optimizing disk accesses is critical to obtaining good server throughput. In this paper we explore serialized, aggressive disk prefetching, a technique which can be used to improve the throughput of HTTP streaming video web servers. We identify how serialization and aggressive prefetching affect performance and, based on our findings, we construct and evaluate Libception, an application-level shim library that implements both techniques. By dynamically linking against Libception at runtime, applications are able to transparently obtain benefits from serialization and aggressive prefetching without needing to change their source code. In contrast to other approaches that modify applications, make kernel changes, or attempt to optimize kernel tuning, Libception provides a portable and relatively simple system in which techniques for optimizing I/O in HTTP video streaming servers can be implemented and evaluated. We empirically evaluate the efficacy of serialization and aggressive prefetching both with and without Libception, using three web servers (Apache, nginx and the userver) running on two operating systems (FreeBSD and Linux). We find that, by using Libception, we can improve streaming throughput for all three web servers by at least a factor of 2 on FreeBSD and a factor of 2.5 on Linux. Additionally, we find that with significant tuning of Linux kernel parameters, we can achieve similar performance to Libception by globally modifying Linuxs disk prefetch behaviour. Finally, we demonstrate Libceptions potential utility for improving the performance of other workloads by using it to reduce the completion time for a microbenchmark involving two applications competing for disk resources.
Elasticity is one of the main features of cloud computing allowing customers to scale their resources based on the workload. Many autoscalers have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application based on the workload utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is often compared only to static provisioning using a predefined QoS target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy as there is seldom enough analysis on the performance of the autoscalers in different operating conditions and with different applications. In our work, we conduct an experimental performance evaluation of autoscaling policies, using as application model workflows, a commonly used formalism for automating resource management for applications with well-defined yet complex structures. We present a detailed comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the seven policies, we conduct various forms of pairwise and group comparisons. We report both individual and aggregated metrics. As many workflows have deadline requirements on the tasks, we study the effect of autoscaling on workflow deadlines. Additionally, we look into the effect of autoscaling on the accounted and hourly-based charged costs, and evaluate performance variability caused by the autoscaler selection for each group of workflow sizes. Our results highlight the trade-offs between the suggested policies, how they can impact meeting the deadlines, and how they perform in different operating conditions, thus enabling a better understanding of the current state-of-the-art.
Many cost-conscious public cloud workloads (tenants) are turning to Amazon EC2s spot instances because, on average, these instances offer significantly lower prices (up to 10 times lower) than on-demand and reserved instances of comparable advertized resource capacities. To use spot instances effectively, a tenant must carefully weigh the lower costs of these instances against their poorer availability. Towards this, we empirically study four features of EC2 spot instance operation that a cost-conscious tenant may find useful to model. Using extensive evaluation based on both historical and current spot instance data, we show shortcomings in the state-of-the-art modeling of these features that we overcome. Our analysis reveals many novel properties of spot instance operation some of which offer predictive value while others do not. Using these insights, we design predictors for our features that offer a balance between computational efficiency (allowing for online resource procurement) and cost-efficacy. We explore case studies wherein we implement prototypes of dynamic spot instance procurement advised by our predictors for two types of workloads. Compared to the state-of-the-art, our approach achieves (i) comparable cost but much better performance (fewer bid failures) for a latency-sensitive in-memory Memcached cache, and (ii) an additional 18% cost-savings with comparable (if not better than) performance for a delay-tolerant batch workload.