The 3-day Principles and Practice of Parallel Programming (PPoPP 2009) conference ended last week with a session on parallel compilers and tools, another on high end computing software, and a keynote by ACM fellow Jack Dennis. Although the last day was a short one, it was not light on content. Here's a summary.
[ Read: All articles on PPoPP 2009 ]
Parallel compilers and tools
The first session focused on automatic transformations that could make parallel programs run faster. Papers included:
- Techniques for Efficient Placement of Synchronization Primitives, by Alex Nicolau, Arun Kejariwal, and Guangqiang Li. Simply by moving around sync points and doing expression hoisting the team achieved a 60+% speedups on some kernels on a dual-core x86 chip.
- A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors, by Seung Woo Son, Mahmut Kandemir, Mustafa Karakoy, and Dhruva Chakrabarti. Reducing harmful prefetches caused substantial improvements on an 8-core SIMICS-simulated machine. Although Michael Wolfe of PGI pointed out that maybe the cache benefits were coming from synchronizing the threads to work on the same areas of memory instead of 8 completely different areas of memory more than anything else.
- Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, by Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Bondhugula, J Ramanujam, Atanas Rountev, and P Sadayappan. Using a polyhedral model of scheduling tiles, a 2x speedup was possible.
- Effective Performance Measurement and Analysis of Multithreaded Applications, by Nathan Tallent, and John Mellor-Crummey. A great paper on measuring parallel idleness and then relating it back to the cause (the threads and source lines that are working during the periods of high idleness).
High end computing software
The second session talked about applications using all the computing power enabled by hardware and software advances:
- Petascale Computing with Accelerators, by Michael Kistler, John Gunnels, Daniel Brokenshire, and Brad Benton. The Los Alamos National Lab (LANL) Roadrunner computer is one of the fastest in the world, thanks to it's unique tri-blade design featuring sets of one blade with 2 dual-core Opertons and two blades with IBM PowerXCell processors. Depending on how you count them, the computer has over 130K cores. Adding the Cell processors boosted performance from 44 TFlops to over 1300 TFlops (a 30x difference).
- MPIWiz: Subgroup Reproducible Replay of MPI Applications, by Ruini Xue, Xuezheng Liu, Ming Wu, Zhenyu Guo, Wenguang Chen, Weimin Zheng, Zheng Zhang, and Geoffrey Voelker. Their system could make bugs happen again with the same time stamps, addresses, and ranks, while minimizing the amount of log data that had to be captured.
- Formal Verification of Practical MPI Programs, by Anh Vo, Sarvani Vakkalanka, Michael Delisi, Ganesh Gopalakrishnan, Mike Kirby, and Rajeev Thakur. A dynamic debugging system that finds MPI problems like deadlocks and MPI object leaks with no false alarms and no omissions.
- Efficient, Portable Implementation of Asynchronous Multi-place Programs, by Ganesh Bikshandi, Jose Castanos, Sreedhar Kodali, Krishna Nandivada, Igor Peshansky, Vijay Saraswat, Sayantan Sur, Pradeep Varma, and Tong Wen. In this paper, the team proposed an Asynchronous Partioned Global Adress Space (APGAS) system, which was then used to implement a subset of the X10 language.
The final keynote was called "How to Build Programmable Multi-Core Chips" by Jack Dennis. Professor Dennis' main theme was the importance of composability and modularity in parallel programming.
Composability is the ability to use any parallel program as a component (or module) in a larger, more sophisticated program. In order to be composable, programs need to follow time-honored principles such as information hiding and being able to make local changes without messing up the operation of other modules. Dennis suggested integrating the file system with memory, providing a guarantee of determinancy except in a few special cases, and coming up with a way to reallocate processing resources as easily and readily as we do memory now.
I probably wouldn't have gone to this conference it it hadn't been local. Overall it was oriented far more towards academics than practitioners. However I did pick up a few ideas and meet some interesting people, so I don't regret attending. Next year's conference will be in Bangalore, India.