Language:
English
繁體中文
Help
回圖書館首頁
手機版館藏查詢
Login
Back
Switch To:
Labeled
|
MARC Mode
|
ISBD
Distributed Feature Selection in Lar...
~
Wang, Xiangyu.
Linked to FindBook
Google Book
Amazon
博客來
Distributed Feature Selection in Large n and Large p Regression Problems.
Record Type:
Electronic resources : Monograph/item
Title/Author:
Distributed Feature Selection in Large n and Large p Regression Problems./
Author:
Wang, Xiangyu.
Published:
Ann Arbor : ProQuest Dissertations & Theses, : 2016,
Description:
129 p.
Notes:
Source: Dissertation Abstracts International, Volume: 78-02(E), Section: B.
Contained By:
Dissertation Abstracts International78-02B(E).
Subject:
Statistics. -
Online resource:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10146973
ISBN:
9781369022711
Distributed Feature Selection in Large n and Large p Regression Problems.
Wang, Xiangyu.
Distributed Feature Selection in Large n and Large p Regression Problems.
- Ann Arbor : ProQuest Dissertations & Theses, 2016 - 129 p.
Source: Dissertation Abstracts International, Volume: 78-02(E), Section: B.
Thesis (Ph.D.)--Duke University, 2016.
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator (message) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
ISBN: 9781369022711Subjects--Topical Terms:
517247
Statistics.
Distributed Feature Selection in Large n and Large p Regression Problems.
LDR
:03939nmm a2200313 4500
001
2116760
005
20170508081324.5
008
180830s2016 ||||||||||||||||| ||eng d
020
$a
9781369022711
035
$a
(MiAaPQ)AAI10146973
035
$a
AAI10146973
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Wang, Xiangyu.
$3
966797
245
1 0
$a
Distributed Feature Selection in Large n and Large p Regression Problems.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2016
300
$a
129 p.
500
$a
Source: Dissertation Abstracts International, Volume: 78-02(E), Section: B.
500
$a
Adviser: David B. Dunson.
502
$a
Thesis (Ph.D.)--Duke University, 2016.
520
$a
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator (message) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
520
$a
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named DECO for distributed variable selection and parameter estimation. In DECO, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
520
$a
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework DEME (DECO-message) by leveraging both the DECO and the message algorithm. The new framework first partitions the dataset in the sample space into row cubes using message and then partition the feature space of the cubes using DECO. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the DECO and message algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
590
$a
School code: 0066.
650
4
$a
Statistics.
$3
517247
650
4
$a
Computer science.
$3
523869
690
$a
0463
690
$a
0984
710
2
$a
Duke University.
$b
Statistical Science.
$3
1023903
773
0
$t
Dissertation Abstracts International
$g
78-02B(E).
790
$a
0066
791
$a
Ph.D.
792
$a
2016
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10146973
based on 0 review(s)
Location:
ALL
電子資源
Year:
Volume Number:
Items
1 records • Pages 1 •
1
Inventory Number
Location Name
Item Class
Material type
Call number
Usage Class
Loan Status
No. of reservations
Opac note
Attachments
W9327379
電子資源
01.外借(書)_YB
電子書
EB
一般使用(Normal)
On shelf
0
1 records • Pages 1 •
1
Multimedia
Reviews
Add a review
and share your thoughts with other readers
Export
pickup library
Processing
...
Change password
Login