ACM DL

ACM Transactions on

Modeling and Performance Evaluation of Computing Systems (TOMPECS)

Menu
Latest Articles

Efficient Straggler Replication in Large-Scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion.... (more)

Production Application Performance Data Streaming for System Monitoring

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and... (more)

NEWS

About TOMPECS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems (ToMPECS) is a new ACM journal that publishes refereed articles on all aspects of the modeling, analysis, and performance evaluation of computing and communication systems.

The target areas for the application of these performance evaluation methodologies are broad, and include traditional areas such as computer networks, computer systems, storage systems, telecommunication networks, and Web-based systems, as well as new areas such as data centers, green computing/communications, energy grid networks, and on-line social networks.

Issues of the journal will be published on a quarterly basis, appearing both in print form and in the ACM Digital Library. 

Forthcoming Articles

Optimizing N-Tier Application Scalability in the Cloud: A Study of Soft Resource Allocation

Performance of a fixed reward incentive scheme for two-hop DTNs with competing relays

We analyse the performance of an incentive scheme for two-hop DTNs in which a backlogged source proposes a fixed reward to the relays to deliver a message. Only one message at a time is proposed by the source. For a given message, only the first relay to deliver it gets the reward corresponding to this message thereby inducing a competition between the relays. The relays seek to maximize the expected reward for each message whereas the objective of the source is to satisfy a given constraint on the probability of message delivery. We show that the optimal policy of a relay is of threshold type: it accepts a message until a first threshold and then keeps the message until it either meets the destination or reaches the second threshold. Formulas for computing the thresholds as well as probability of message delivery are derived for a backlogged source.

A New Framework for Evaluating Straggler Detection Mechanisms in MapReduce

Big Data systems (e.g., MapReduce, Hadoop, Spark) rely increasingly on speculative execution to mask slow tasks, also known as stragglers, because a job's execution time is dominated by the slowest task instance. Big Data systems typically identify stragglers and speculatively run copies of those tasks with the expectation a copy may complete faster to shorten job execution times. There is a rich body of recent results on straggler mitigation in MapReduce. However, the majority of these do not consider the problem of accurately detecting stragglers. Instead, they adopt a particular straggler detection approach and then study its effectiveness in terms of performance, e.g., reduction in job completion time, or efficiency, e.g., high resource utilization. In this paper, we consider a complete framework for straggler detection and mitigation. We start with a set of metrics that can be used to characterize and detect stragglers including Precision, Recall, Detection Latency, Undetected Time and Fake Positive. We then develop an architectural model by which these metrics can be linked to measures of performance including execution time and system energy overheads. We further conduct a series of experiments to demonstrate which metrics and approaches are more effective in detecting stragglers and are also predictive of effectiveness in terms performance and energy efficiencies. For example, our results indicate that the default Hadoop straggler detector could be made more effective. In certain cases, Precision is low and only 55% of those detected are actual stragglers and the Recall, i.e., percent of actual detected stragglers, is also relatively low at 56%. For the same case, the hierarchical approach (i.e., a green-driven detector based on the default one) achieves a Precision of 99% and a Recall of 29%. Further, these increases in Precision can be translated to achieve lower execution time and energy consumption, and thus higher performance and energy efficiency; compared to the default Hadoop mechanism, the energy consumption is reduced by almost 31%. These results demonstrate how our framework can offer useful insights and be applied in practical settings to characterize and design new straggler detection mechanisms for MapReduce systems.

On the Endurance of the d-Choices Garbage Collection Algorithm for Flash-based SSDs

Garbage collection (GC) algorithms for flash-based solid state drives (SSDs) have a profound impact on its performance and many studies have focused on assessing the so-called write amplification of various GC algorithms. In this paper we consider the family of d-choices GC algorithms and study (a) the extent in which these algorithms induce unequal wear and (b) the manner in which they affect the life time of the drive. For this purpose we introduce two performance measures: the PE fairness and SSD endurance. We study the impact of the d-choices GC algorithm on both these measures under different workloads (uniform, synthetic and trace-based) when combined with two different write modes. Numerical results show that the less complex of the two write modes, which does not require any form of hot/cold data identification, may give rise to a better SSD endurance. Further the d-choices GC algorithm is often shown to strike a good balance between garbage collection and wear leveling for small d values (e.g., d=10), yielding high endurance.

Performance of Redundancy(d) with Identical/Independent Replicas

Queueing systems with redundancy have received considerable attention recently. The idea of redundancy is to reduce latency by replicating each incoming job a number of times and to assign these replicas to a set of randomly selected servers. As soon as one replica completes service the remaining replicas are cancelled. Most prior work on queueing systems with redundancy assumes that the job durations of the different replicas are i.i.d., which yields insights that can be misleading for computer system design. In this paper we develop a differential equation, using the cavity method, to assess the workload and response time distribution in a large homogeneous system with redundancy without the need to rely on this independence assumption. More specifically, we assume that the duration of each replica of a single job is identical across the servers and follows a general service time distribution. Simulation results suggest that the differential equation yields exact results as the system size tends to infinity and can be used to study the stability of the system. We also compare our system to the one with i.i.d. replicas and show the similarity in the analysis used for independent resp. identical replicas.

SORT: Semi Online Reliable Task Mapping for Embedded Multi-core Systems

All ACM Journals | See Full Journal Index

Search TOMPECS
enter search term and/or author name