東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Data Reduction for Communication-Eff...

Lu, Hanlin.

FindBook

Google Book

Amazon

博客來

Data Reduction for Communication-Efficient Machine Learning.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Data Reduction for Communication-Efficient Machine Learning./
作者:	Lu, Hanlin.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2021,
面頁冊數:	126 p.
附註:	Source: Dissertations Abstracts International, Volume: 83-03, Section: B.
Contained By:	Dissertations Abstracts International83-03B.
標題:	Construction. -
電子資源:	https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28841713
ISBN:	9798460447800

Data Reduction for Communication-Efficient Machine Learning.
Lu, Hanlin.

Data Reduction for Communication-Efficient Machine Learning. - Ann Arbor : ProQuest Dissertations & Theses, 2021 - 126 p.

Source: Dissertations Abstracts International, Volume: 83-03, Section: B.

Thesis (Ph.D.)--The Pennsylvania State University, 2021.

This item must not be sold to any third party vendors.

In recent years, we have observed a dramatic growth of data generation in edge-based machine learning applications. Motivated by the need of solving machine learning problem over distributed datasets, we would like to reduce the size of datasets as well as minimizing the machine learning performance degradation. Suppose we are given a dataset P, it could be represented by a data cube with three dimensions: cardinality n, number of features d and number of precision bits b. In this dissertation, we will explore different data reduction techniques to reduce these three dimensions and make three steps toward reducing the total size of the dataset. In our first step, we consider using coreset to reduce the cardinality of the collected dataset. Coreset is a small weighted dataset, functioning as a proxy of the original dataset. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. That is, we are required to construct different coresets to support different machine learning models. In our first step, we resolve this dilemma by developing robust coreset construction algorithms based on k-clustering algorithms. Our solution is proved to give a guaranteed approximation for a broad range of machine learning problems with sufficiently continuous cost functions. In our second step, we propose the first framework to incorporate quantization techniques into the process of coreset construction. Specifically, we theoretically analyze the ML error caused by a combination of coreset construction techniques and quantization techniques. Based on that, we formulate an optimization problem to minimize the ML error under a fixed budget of communication cost. To improve the scalability for large datasets, we identify two proxies of the original objective function, for which efficient algorithms are developed. For the case of data on multiple nodes, we further design a novel algorithm to allocate the communication budgets to different nodes while minimizing the overall ML error. As our third step, we consider the problem of solving edge-based k-means on a large dataset in high dimensional space. In this application scenario, data sources offload machine learning computation to nearby edge servers under limited communication budget and computation power. To solve this problem, we propose to construct small data summaries with fewer data samples (by techniques for Cardinality Reduction (CR)), fewer features (by techniques for Dimensionality Reduction (DR)) and fewer precision bits (by techniques for Quantization (QT)). By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art data reduction methods, we show that: (i) it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant communication cost, (ii) the order of applying DR and CR leads to a tradeoff between the complexity and the communication cost, (iii) combining DR/CR methods with a properly selected quantizer can further reduce the communication cost without compromising the other performance metrics. At last, in each step, the effectiveness of our analysis is verified through extensive experiments on multiple real datasets and different machine learning problems.

ISBN: 9798460447800Subjects--Topical Terms:

3561054
Construction.
Subjects--Index Terms:

Machine learning

Data Reduction for Communication-Efficient Machine Learning.
LDR:04365nmm a2200325 4500 001 2283887
005 20211115071711.5
008 220723s2021 ||||||||||||||||| ||eng d
020 $a 9798460447800
035 $a (MiAaPQ)AAI28841713
035 $a (MiAaPQ)PennState_23883hzl263
035 $a AAI28841713
040 $a MiAaPQ $c MiAaPQ
100 1 $a Lu, Hanlin. $3 3562962
245 1 0 $a Data Reduction for Communication-Efficient Machine Learning.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2021
300 $a 126 p.
500 $a Source: Dissertations Abstracts International, Volume: 83-03, Section: B.
500 $a Advisor: He, Ting.
502 $a Thesis (Ph.D.)--The Pennsylvania State University, 2021.
506 $a This item must not be sold to any third party vendors.
520 $a In recent years, we have observed a dramatic growth of data generation in edge-based machine learning applications. Motivated by the need of solving machine learning problem over distributed datasets, we would like to reduce the size of datasets as well as minimizing the machine learning performance degradation. Suppose we are given a dataset P, it could be represented by a data cube with three dimensions: cardinality n, number of features d and number of precision bits b. In this dissertation, we will explore different data reduction techniques to reduce these three dimensions and make three steps toward reducing the total size of the dataset. In our first step, we consider using coreset to reduce the cardinality of the collected dataset. Coreset is a small weighted dataset, functioning as a proxy of the original dataset. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. That is, we are required to construct different coresets to support different machine learning models. In our first step, we resolve this dilemma by developing robust coreset construction algorithms based on k-clustering algorithms. Our solution is proved to give a guaranteed approximation for a broad range of machine learning problems with sufficiently continuous cost functions. In our second step, we propose the first framework to incorporate quantization techniques into the process of coreset construction. Specifically, we theoretically analyze the ML error caused by a combination of coreset construction techniques and quantization techniques. Based on that, we formulate an optimization problem to minimize the ML error under a fixed budget of communication cost. To improve the scalability for large datasets, we identify two proxies of the original objective function, for which efficient algorithms are developed. For the case of data on multiple nodes, we further design a novel algorithm to allocate the communication budgets to different nodes while minimizing the overall ML error. As our third step, we consider the problem of solving edge-based k-means on a large dataset in high dimensional space. In this application scenario, data sources offload machine learning computation to nearby edge servers under limited communication budget and computation power. To solve this problem, we propose to construct small data summaries with fewer data samples (by techniques for Cardinality Reduction (CR)), fewer features (by techniques for Dimensionality Reduction (DR)) and fewer precision bits (by techniques for Quantization (QT)). By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art data reduction methods, we show that: (i) it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant communication cost, (ii) the order of applying DR and CR leads to a tradeoff between the complexity and the communication cost, (iii) combining DR/CR methods with a properly selected quantizer can further reduce the communication cost without compromising the other performance metrics. At last, in each step, the effectiveness of our analysis is verified through extensive experiments on multiple real datasets and different machine learning problems.
590 $a School code: 0176.
650 4 $a Construction. $3 3561054
650 4 $a Cameras. $3 524039
650 4 $a Deep learning. $3 3554982
650 4 $a Datasets. $3 3541416
650 4 $a Communication. $3 524709
650 4 $a Power. $3 518736
650 4 $a Bandwidths. $3 3560998
650 4 $a Autonomous vehicles. $3 2179092
650 4 $a Optimization. $3 891104
650 4 $a Neural networks. $3 677449
650 4 $a Sensors. $3 3549539
650 4 $a Internet of Things. $3 3538511
650 4 $a Algorithms. $3 536374
650 4 $a Surveillance. $3 3559358
650 4 $a Computer science. $3 523869
653 $a Machine learning
690 $a 0459
690 $a 0984
710 2 $a The Pennsylvania State University. $3 699896
773 0 $t Dissertations Abstracts International $g 83-03B.
790 $a 0176
791 $a Ph.D.
792 $a 2021
793 $a English
856 4 0 $u https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28841713