ACM DL

ACM Transactions on

Modeling and Performance Evaluation of Computing Systems (TOMPECS)

Menu
Latest Articles

Efficient Traffic Load-Balancing via Incremental Expansion of Routing Choices

Routing policies play a major role in the performance of communication networks. Backpressure-based adaptive routing algorithms where traffic is load... (more)

EnergyQARE: QoS-Aware Data Center Participation in Smart Grid Regulation Service Reserve Provision

Power market operators have recently introduced smart grid demand response (DR), in which electricity consumers regulate their power usage following market requirements. DR helps stabilize the grid and enables integrating a larger amount of intermittent renewable power generation. Data centers provide unique opportunities for DR participation due... (more)

A Quantitative and Comparative Study of Network-Level Efficiency for Cloud Storage Services

Cloud storage services such as Dropbox and OneDrive provide users with a convenient and reliable way to store and share data from anywhere, on any... (more)

Effect of Recommendations on Serving Content with Unknown Demand

We consider the task of content replication in distributed content delivery systems used by Video-on-Demand (VoD) services with large content... (more)

Managing Response Time Tails by Sharding

Matrix analytic methods are developed to compute the probability distribution of response times (i.e., data access times) in distributed storage systems protected by erasure coding, which is implemented by sharding a data object into N fragments, only K<; N of which are required to reconstruct the object. This leads to a partial-fork-join model... (more)

Provisioning and Performance Evaluation of Parallel Systems with Output Synchronization

Parallel server frameworks are widely deployed in modern large-data processing applications. Intuitively, splitting and parallel processing of the... (more)

NEWS

About TOMPECS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems (ToMPECS) is a new ACM journal that publishes refereed articles on all aspects of the modeling, analysis, and performance evaluation of computing and communication systems.

The target areas for the application of these performance evaluation methodologies are broad, and include traditional areas such as computer networks, computer systems, storage systems, telecommunication networks, and Web-based systems, as well as new areas such as data centers, green computing/communications, energy grid networks, and on-line social networks.

Issues of the journal will be published on a quarterly basis, appearing both in print form and in the ACM Digital Library. 

Forthcoming Articles
Production Application Performance Data Streaming for System Monitoring

In this paper, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production High Performance Computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as MPI. Several profiling and tracing tools exist that collect heavy run-time data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a system-wide and low overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the software and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication (IPC) method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with Message Passing Interface (MPI), as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open-source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% overhead on the application in realistic deployment scenarios.

Efficient Straggler Replication in Large-scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling tasks and waiting for any one copy to finish. Despite being adopted in practice, there is little analysis of how replication affects the latency and the cost of additional computing resources. In this paper we provide a framework to analyze this latency-cost trade-off and find the best replication strategy by answering design questions such as: 1) when to replicate straggling tasks, 2) how many replicas to launch, and 3) whether to kill the original copy or not. Our analysis reveals that for certain execution time distributions, a small amount of task replication can drastically reduce both latency as well as the cost of computing resources. We also propose an algorithm to estimate the latency and cost based on the empirical distribution of task execution time. Evaluations using samples in the Google Cluster Trace suggest further latency and cost reduction compared to the existing replication strategy used in MapReduce.

Performance of Redundancy(d) with Identical/Independent Replicas

Queueing systems with redundancy have received considerable attention recently. The idea of redundancy is to reduce latency by replicating each incoming job a number of times and to assign these replicas to a set of randomly selected servers. As soon as one replica completes service the remaining replicas are cancelled. Most prior work on queueing systems with redundancy assumes that the job durations of the different replicas are i.i.d., which yields insights that can be misleading for computer system design. In this paper we develop a differential equation, using the cavity method, to assess the workload and response time distribution in a large homogeneous system with redundancy without the need to rely on this independence assumption. More specifically, we assume that the duration of each replica of a single job is identical across the servers and follows a general service time distribution. Simulation results suggest that the differential equation yields exact results as the system size tends to infinity and can be used to study the stability of the system. We also compare our system to the one with i.i.d. replicas and show the similarity in the analysis used for independent resp. identical replicas.

All ACM Journals | See Full Journal Index

Search TOMPECS
enter search term and/or author name