東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

FindBook

Google Book

Amazon

博客來

Resource-Efficient Execution of Deep Learning Computations.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Resource-Efficient Execution of Deep Learning Computations./
作者:	Narayanan, Deepak.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2021,
面頁冊數:	187 p.
附註:	Source: Dissertations Abstracts International, Volume: 83-05, Section: B.
Contained By:	Dissertations Abstracts International83-05B.
標題:	Scheduling. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28812877
ISBN:	9798494452825

Resource-Efficient Execution of Deep Learning Computations.
Narayanan, Deepak.

Resource-Efficient Execution of Deep Learning Computations. - Ann Arbor : ProQuest Dissertations & Theses, 2021 - 187 p.

Source: Dissertations Abstracts International, Volume: 83-05, Section: B.

Thesis (Ph.D.)--Stanford University, 2021.

This item must not be sold to any third party vendors.

Deep Learning models have enabled state-of-the-art results across a broad range of applications. Training these models, however, is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. As Moore's Law slows down, numerous parallel accelerators have been introduced to meet this new computational demand. This dissertation shows how model- and hardware-aware optimizations in software systems can help intelligently navigate this heterogeneity. In particular, it demonstrates how careful automated scheduling of computation across levels of the software stack can be used to perform distributed training and resource allocation more efficiently. In the first part of this dissertation, we study pipelining, a technique commonly used as a performance optimization in various systems, as a way to perform more efficient distributed model training for both models with small training footprints and those with training footprints larger than the memory capacity of a single GPU. For certain types of models, pipeline parallelism can facilitate model training with lower communication overhead than previous methods. We introduce new strategies for pipeline parallelism, with different tradeoffs between training throughput, memory footprint, and weight update semantics; these outperform existing methods in certain settings. Pipeline parallelism can also be used in conjunction with other forms of parallelism, helping create a richer search space of parallelization strategies. By partitioning the training graph across accelerators in a model-aware way, pipeline parallelism combined with data parallelism can be up to 5x faster than data parallelism in isolation. We also use a principled combination of pipeline parallelism, tensor model parallelism, and data parallelism to efficiently scale training to language models with a trillion parameters on 3072 A100 GPUs (aggregate throughput of 502 petaFLOP/s, which is 52% of peak device throughput). In the second part of this dissertation, we show how heterogeneous compute resources (e.g., different GPU generations like NVIDIA K80 and V100 GPUs) in a shared cluster (either in a private deployment or in the public cloud) should be partitioned among multiple users to optimize objectives specified over one or more training jobs. By formulating existing policies as optimization problems over the allocation, and then using a concept we call effective throughput, policies can be extended to be heterogeneity-aware. A policy-agnostic scheduling mechanism then helps realize the heterogeneity-aware allocations returned by these policies in practice. We can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x, using these heterogeneity-aware policies. Towards the end of this dissertation, we also touch on how the dynamic pricing information of spot instances can be plugged into this heterogeneity-aware policy framework to optimize cost objectives in the public cloud. This can help reduce cost compared to using more expensive on-demand instances alone.

ISBN: 9798494452825Subjects--Topical Terms:

750729
Scheduling.

Resource-Efficient Execution of Deep Learning Computations.
LDR:04216nmm a2200325 4500 001 2344913
005 20220531062213.5
008 241004s2021 ||||||||||||||||| ||eng d
020 $a 9798494452825
035 $a (MiAaPQ)AAI28812877
035 $a (MiAaPQ)STANFORDqx792hd7022
035 $a AAI28812877
040 $a MiAaPQ $c MiAaPQ
100 1 $a Narayanan, Deepak. $3 3683748
245 1 0 $a Resource-Efficient Execution of Deep Learning Computations.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2021
300 $a 187 p.
500 $a Source: Dissertations Abstracts International, Volume: 83-05, Section: B.
500 $a Advisor: Zaharia, Matei;Fatahalian, Kayvon;Re, Christopher.
502 $a Thesis (Ph.D.)--Stanford University, 2021.
506 $a This item must not be sold to any third party vendors.
520 $a Deep Learning models have enabled state-of-the-art results across a broad range of applications. Training these models, however, is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. As Moore's Law slows down, numerous parallel accelerators have been introduced to meet this new computational demand. This dissertation shows how model- and hardware-aware optimizations in software systems can help intelligently navigate this heterogeneity. In particular, it demonstrates how careful automated scheduling of computation across levels of the software stack can be used to perform distributed training and resource allocation more efficiently. In the first part of this dissertation, we study pipelining, a technique commonly used as a performance optimization in various systems, as a way to perform more efficient distributed model training for both models with small training footprints and those with training footprints larger than the memory capacity of a single GPU. For certain types of models, pipeline parallelism can facilitate model training with lower communication overhead than previous methods. We introduce new strategies for pipeline parallelism, with different tradeoffs between training throughput, memory footprint, and weight update semantics; these outperform existing methods in certain settings. Pipeline parallelism can also be used in conjunction with other forms of parallelism, helping create a richer search space of parallelization strategies. By partitioning the training graph across accelerators in a model-aware way, pipeline parallelism combined with data parallelism can be up to 5x faster than data parallelism in isolation. We also use a principled combination of pipeline parallelism, tensor model parallelism, and data parallelism to efficiently scale training to language models with a trillion parameters on 3072 A100 GPUs (aggregate throughput of 502 petaFLOP/s, which is 52% of peak device throughput). In the second part of this dissertation, we show how heterogeneous compute resources (e.g., different GPU generations like NVIDIA K80 and V100 GPUs) in a shared cluster (either in a private deployment or in the public cloud) should be partitioned among multiple users to optimize objectives specified over one or more training jobs. By formulating existing policies as optimization problems over the allocation, and then using a concept we call effective throughput, policies can be extended to be heterogeneity-aware. A policy-agnostic scheduling mechanism then helps realize the heterogeneity-aware allocations returned by these policies in practice. We can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x, using these heterogeneity-aware policies. Towards the end of this dissertation, we also touch on how the dynamic pricing information of spot instances can be plugged into this heterogeneity-aware policy framework to optimize cost objectives in the public cloud. This can help reduce cost compared to using more expensive on-demand instances alone.
590 $a School code: 0212.
650 4 $a Scheduling. $3 750729
650 4 $a Deep learning. $3 3554982
650 4 $a Communication. $3 524709
650 4 $a Systems design. $3 3433840
650 4 $a Optimization. $3 891104
650 4 $a COVID-19. $3 3554449
650 4 $a Artificial intelligence. $3 516317
650 4 $a Design. $3 518875
690 $a 0459
690 $a 0800
690 $a 0389
710 2 $a Stanford University. $3 754827
773 0 $t Dissertations Abstracts International $g 83-05B.
790 $a 0212
791 $a Ph.D.
792 $a 2021
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28812877