東華大學圖書館 |

Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs./
作者:	Tabbakh, Abdulaziz.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:	161 p.
附註:	Source: Dissertations Abstracts International, Volume: 80-06, Section: B.
Contained By:	Dissertations Abstracts International80-06B.
標題:	Electrical engineering. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=11016209

Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs.
Tabbakh, Abdulaziz.

Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 161 p.

Source: Dissertations Abstracts International, Volume: 80-06, Section: B.

Thesis (Ph.D.)--University of Southern California, 2018.

This item must not be sold to any third party vendors.

Graphics Processing Units (GPUs) are designed primarily to execute multimedia, and game rendering applications. These applications are characterized with streaming data that have little to no data sharing between threads. Because of their high power efficiency, massive parallel computational capability, and high off-chip memory bandwidth, GPUs are now making in-roads into executing general purpose applications that have significant, but somewhat irregular, parallelism. The improvements in the programming interfaces such as CUDA and OpenCL accelerate the adoption of GPUs for general purpose applications. However, these new application usages do not align well with the underlying GPU architecture. In particular, some of the irregular applications do share data between threads and they also exhibit inter-thread communication patterns that are not well supported in current GPU hardware. Unlike traditional graphics applications, that mostly deal with streaming data, the new class of applications also shows some level of temporal and spatial locality between threads executing in the same kernel or thread block. But GPUs have limited cache capacity and do not support efficient inter-thread communication through memory. As such the programmer/compiler ought to find ad-hoc solutions to tackle these challenges. This thesis presents a set of unifying GPU memory system improvements that enable efficient data sharing between threads, and also a comprehensive coherence and consistency models to enable efficient inter-thread communication. The first part of this thesis shows that there is significant data sharing across threads in a GPU while executing general purpose applications. However, due to poor thread scheduling data sharing leads to the replication of data in multiple private caches across many streaming multiprocessor cores (SMs) in a GPU, which in turn reduces the effective cache size. To tackle this challenge this thesis presents an efficient data sharing mechanism that reduces redundant data copies in the memory system. It includes a sharing-aware thread block (also called Cooperative Thread Array (CTA)) scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. The design is further enhanced with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. The evaluation experiments show that the proposed design reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%. The second part of the thesis focuses on supporting intuitive memory coherence and consistency models that programmers are familiar with in the CPU domain. The thesis presents a GPU-centric Time Stamp Coherence (G-TSC), a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time rather than physical time and uses time stamp based self invalidation of cached data, which reduce the coherence traffic dramatically. The thesis demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features, and then presents a number of solutions that tackle GPU-centric challenges. Evaluation of G-TSC shows that it outperforms time-based coherence by 38% with release consistency. The third part of the thesis explores efficient approaches to enforce sequential consistency in GPUs. The main intuition behind this work is that a significant fraction of the coherence traffic can be curtailed by simply delaying the propagation of updated data values across SMs until the end of of an epoch, where an epoch is broadly defined as the time between two data race occurrences. A data race occurs when two threads concurrently access data where at least one access is a write access. The thesis presents a simple bloom filter based signature generation mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. Data updates are propagated when a race is detected from the signatures which in turn provides sequentially consistent execution. The evaluation of the proposed scheme shows that it can achieve sequential consistency with performance overhead as low as 5% and with energy overhead as low as 2.7%. Although GPUs are equipped with multi-level caches, general purpose applications on GPUs experience significant memory access bottlenecks.Subjects--Topical Terms:

649834
Electrical engineering.

Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs.
LDR:05791nmm a2200301 4500 001 2207850
005 20190923114241.5
008 201008s2018 ||||||||||||||||| ||eng d
035 $a (MiAaPQ)AAI11016209
035 $a (MiAaPQ)US_Calif_478336
035 $a AAI11016209
040 $a MiAaPQ $c MiAaPQ
100 1 $a Tabbakh, Abdulaziz. $3 3434851
245 1 0 $a Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 161 p.
500 $a Source: Dissertations Abstracts International, Volume: 80-06, Section: B.
500 $a Publisher info.: Dissertation/Thesis.
500 $a Advisor: Annavaram, Murali.
502 $a Thesis (Ph.D.)--University of Southern California, 2018.
506 $a This item must not be sold to any third party vendors.
520 $a Graphics Processing Units (GPUs) are designed primarily to execute multimedia, and game rendering applications. These applications are characterized with streaming data that have little to no data sharing between threads. Because of their high power efficiency, massive parallel computational capability, and high off-chip memory bandwidth, GPUs are now making in-roads into executing general purpose applications that have significant, but somewhat irregular, parallelism. The improvements in the programming interfaces such as CUDA and OpenCL accelerate the adoption of GPUs for general purpose applications. However, these new application usages do not align well with the underlying GPU architecture. In particular, some of the irregular applications do share data between threads and they also exhibit inter-thread communication patterns that are not well supported in current GPU hardware. Unlike traditional graphics applications, that mostly deal with streaming data, the new class of applications also shows some level of temporal and spatial locality between threads executing in the same kernel or thread block. But GPUs have limited cache capacity and do not support efficient inter-thread communication through memory. As such the programmer/compiler ought to find ad-hoc solutions to tackle these challenges. This thesis presents a set of unifying GPU memory system improvements that enable efficient data sharing between threads, and also a comprehensive coherence and consistency models to enable efficient inter-thread communication. The first part of this thesis shows that there is significant data sharing across threads in a GPU while executing general purpose applications. However, due to poor thread scheduling data sharing leads to the replication of data in multiple private caches across many streaming multiprocessor cores (SMs) in a GPU, which in turn reduces the effective cache size. To tackle this challenge this thesis presents an efficient data sharing mechanism that reduces redundant data copies in the memory system. It includes a sharing-aware thread block (also called Cooperative Thread Array (CTA)) scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. The design is further enhanced with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. The evaluation experiments show that the proposed design reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%. The second part of the thesis focuses on supporting intuitive memory coherence and consistency models that programmers are familiar with in the CPU domain. The thesis presents a GPU-centric Time Stamp Coherence (G-TSC), a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time rather than physical time and uses time stamp based self invalidation of cached data, which reduce the coherence traffic dramatically. The thesis demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features, and then presents a number of solutions that tackle GPU-centric challenges. Evaluation of G-TSC shows that it outperforms time-based coherence by 38% with release consistency. The third part of the thesis explores efficient approaches to enforce sequential consistency in GPUs. The main intuition behind this work is that a significant fraction of the coherence traffic can be curtailed by simply delaying the propagation of updated data values across SMs until the end of of an epoch, where an epoch is broadly defined as the time between two data race occurrences. A data race occurs when two threads concurrently access data where at least one access is a write access. The thesis presents a simple bloom filter based signature generation mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. Data updates are propagated when a race is detected from the signatures which in turn provides sequentially consistent execution. The evaluation of the proposed scheme shows that it can achieve sequential consistency with performance overhead as low as 5% and with energy overhead as low as 2.7%. Although GPUs are equipped with multi-level caches, general purpose applications on GPUs experience significant memory access bottlenecks.
590 $a School code: 0208.
650 4 $a Electrical engineering. $3 649834
690 $a 0544
710 2 $a University of Southern California. $3 700129
773 0 $t Dissertations Abstracts International $g 80-06B.
790 $a 0208
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=11016209