東華大學圖書館 |

Language: English

Help

回圖書館首頁

手機版館藏查詢

Back

Switch To: Labeled | MARC Mode | ISBD

Large Scale Data Analysis in Paralle...

Lin, Hao.

Linked to FindBook

Google Book

Amazon

博客來

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud./
Author:	Lin, Hao.
Published:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
Description:	109 p.
Notes:	Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.
Contained By:	Dissertation Abstracts International80-01B(E).
Subject:	Computer science. -
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10829520
ISBN:	9780438328501

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
Lin, Hao.

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 109 p.

Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.

Thesis (Ph.D.)--Purdue University, 2018.

Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.

ISBN: 9780438328501Subjects--Topical Terms:

523869
Computer science.

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
LDR:02711nmm a2200313 4500 001 2202161
005 20190513114558.5
008 201008s2018 ||||||||||||||||| ||eng d
020 $a 9780438328501
035 $a (MiAaPQ)AAI10829520
035 $a (MiAaPQ)purdue:22893
035 $a AAI10829520
040 $a MiAaPQ $c MiAaPQ
100 1 $a Lin, Hao. $3 3428908
245 1 0 $a Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 109 p.
500 $a Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.
500 $a Adviser: Samuel P. Midkiff.
502 $a Thesis (Ph.D.)--Purdue University, 2018.
520 $a Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.
520 $a In the era of cloud computing, batch data process workloads like RABID applications are targeted to run in VMs or containers in a cloud-based data center. Efficient scheduling of data center VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We propose an innovative data-driven approach to achieve efficient pro-active VM scheduling. Our approach uses a multi-capacity bin-packing technique that efficiently places VMs onto physical servers. We use time-series analysis to extract not only low frequency information about future VM workloads but also high frequency information for VM workload correlations. This approach can also be implemented in RABID and leverages its high performance.
590 $a School code: 0183.
650 4 $a Computer science. $3 523869
650 4 $a Computer engineering. $3 621879
690 $a 0984
690 $a 0464
710 2 $a Purdue University. $b Electrical and Computer Engineering. $3 1018497
773 0 $t Dissertation Abstracts International $g 80-01B(E).
790 $a 0183
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10829520