Concurrency II
In the last five years, multicore processors have become ubiquitous in all sectors of the computer market. While there have been many studies of how best to schedule applications to take advantage of more cores, few have focused on multi-threaded managed language applications which are prevalent from the embedded to the server domain, and in between. Managed languages complicate performance studies because they have additional virtual machine threads that collect garbage and dynamically compile, closely interacting with application threads. Further complexity is introduced as modern multicore machines have multiple sockets and dynamic voltage scaling options, broadening goals to reduce both power and running time.
In this paper, we explore the performance of Java applications, studying application and virtual machine (JVM) threads and how best to map them to a multicore, multi-socket environment. We vary the number of threads, and explore both the cost of separating JVM threads from application threads, and the opportunity to speed up or slow down the clock frequency of isolated threads. We perform experiments with the multi-threaded DaCapo benchmarks and pseudojbb2005 running on the Jikes Research Virtual Machine, on a dual-socket, 8-core Intel Nehalem machine to reveal several novel, and sometimes counter-intuitive, findings. For example, if power-constrained, scaling down the frequency of JVM threads costs a fraction of the performance in comparison with scaling down application threads. However, the cost of isolating certain JVM threads, such as collector threads, in order to scale down frequency often leads to worse performance than running on only one socket.
As the interaction between application, runtime environment, and multicore multi-socket machine grow more complex, our analysis is one of the first to explore the non-trivial experimental space to reveal new, valuable insights on how to get the most out of modern hardware.
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. The programmer explicitly identifies potential parallelism and the runtime schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work- stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2Ã-- to 12Ã-- slowdown over orthodox sequential code. In this paper we identify the key sources of overhead in work- stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3Ã-- overhead and the previous implementation of the system we use has an overhead of 4.1Ã--. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.
Molecule is a domain specific language library embedded in Scala for easing the creation of scalable and modular interactive applications on the JVM. Interactive applications are modeled as parallel process networks that exchange information over mobile communication channel interfaces.
In this paper, we present a concurrent programming environment that combines functional and imperative programming. Using a monad, we structure the sequential or parallel coordination of user-level threads, without JVM modifications or compiler support. Our mobile channel interfaces expose reusable and parallelizable higher-order functions, as if they were streams in a lazily evaluated functional programming language. The support for graceful termination of entire process networks is simplified by integrating channel poisoning with monadic exceptions and resource control. Our runtime and system-level interfaces leverage message batching and a novel flow parallel scheduler to limit expensive context switches in multicore environments. We illustrate the expressiveness and performance benefits on a 24-core AMD Opteron machine with three classical examples: a thread ring, a genuine prime sieve and a chameneos-redux.
Increasing levels of hardware parallelism are one of the main challenges for programmers and implementers of managed runtimes. Any concurrency or scalability improvements must be evaluated experimentally. However, application benchmarks available today may not reflect the highly concurrent applications we anticipate in the future. They may also behave in ways that VM developers do not expect. We provide a set of platform independent concurrency-related metrics and an in-depth observational study of current state of the art benchmarks, discovering how concurrent they really are, how they scale the work, and how synchronize and communicate via shared memory.







