Next, we perform a thorough analysis of the different levels of resource sharing in the context of Mellanox InfiniBand. We first demonstrate the two extreme cases-one where all threads share a single communication endpoint and another where each thread gets its own dedicated communication endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies in both these cases. In this paper, we explore the tradeoff space between performance and communication resource usage in MPI threads environments. With MPI threads, current MPI implementations share a single communication endpoint for all threads, which is ideal for resource usage but is hurtful for performance. In the MPI-everywhere model, each MPI process has a dedicated set of communication resources (also known as endpoints), which is ideal for performance but is resource wasteful. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. Hybrid MPI threads programming is gaining prominence as an alternative to the traditional "MPI everywhere" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Using the lessons learned from this analysis, we design an improved resource-sharing model to produce \emph that can achieve the same performance as with dedicated communication resources per thread but using just a third of the resources. Hybrid MPI threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. We utilize true network helper threads and load-balancing via work stealing in the runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations - such as support for irregular parallelism. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Unified Parallel C (UPC) is an emerging PGAS language. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Multi-core architectures are becoming more and more popular in HEC (High End Computing) era.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |