語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Statistical Methods for Analyzing La...
~
Dey, Rounak.
FindBook
Google Book
Amazon
博客來
Statistical Methods for Analyzing Large-Scale Biological Data.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Statistical Methods for Analyzing Large-Scale Biological Data./
作者:
Dey, Rounak.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:
212 p.
附註:
Source: Dissertations Abstracts International, Volume: 80-07, Section: B.
Contained By:
Dissertations Abstracts International80-07B.
標題:
Biostatistics. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=11006859
ISBN:
9780438594548
Statistical Methods for Analyzing Large-Scale Biological Data.
Dey, Rounak.
Statistical Methods for Analyzing Large-Scale Biological Data.
- Ann Arbor : ProQuest Dissertations & Theses, 2018 - 212 p.
Source: Dissertations Abstracts International, Volume: 80-07, Section: B.
Thesis (Ph.D.)--University of Michigan, 2018.
This item must not be added to any third party search indexes.
With the development of high-throughput biomedical technologies in recent years, the size of a typical biological dataset is increasing at a fast pace, especially in the genomics, proteomics and metabolomics literatures. Typically, these large datasets contain a huge amount of information on each subject, where the number of subjects can range from small to often extremely large. The challenges of analyzing these large datasets are twofold, namely the problem of high-dimensionality, and the heavy computational burden associated with analyzing them. The goal of this dissertation is to develop statistical and computational methods to address some of these challenges in order to provide researchers with analytical tools that are scalable to handle these large datasets, as well as able to solve the issues arising from high-dimensionality. In Chapter II, we study the asymptotic behaviors of principal component analysis (PCA) in high-dimensional data under the generalized spiked population model. We propose a series of methods for the consistent estimation of the population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage-bias adjustment for the predicted PC scores. In Chapter III, we investigate the over-fitting problem of partial least squares (PLS) regression with high-dimensional predictors, which can result in the predicted and observed outcomes being almost identical, even when the outcome is independent of the predictor. We further discuss a shrinkage-bias problem similar to the shrinkage-bias in high-dimensional PCA, and propose a two-stage PLS (TPLS) method that can address both of these problems. In Chapter IV, we focus on the large-scale genome-wide or phenome-wide association studies (GWASs or PheWASs) of the electronic health records (EHR) or biobank-based binary phenotypes. Due to the severe case-control imbalance in most of the EHR or biobank-based binary phenotypes, the existing methods cannot provide a scalable and accurate way to analyze them. We develop a computationally efficient single-variant test, that is ∼100 times faster than the state of the art Firth's test, and can provide well-calibrated p values even for phenotypes with extremely unbalanced case-control ratios. Further, our test can adjust for non-genetic covariates, and can retain similar power as the Firth's test. In Chapter V, we show that due to the severe case-control imbalance in most of the biobank-based binary phenotypes, applying the traditional Z-score-based method to meta-analyze the association results across multiple biobank-based association studies, can result in conservative or anti-conservative p values. We propose two alternative meta-analysis methods that can provide well-calibrated meta-analysis p values, even when the individual studies are extremely unbalanced in their case-control ratios. Our first method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines, and the second method involves sharing the overall genotype counts from each study. In summary, the purpose of this dissertation is to develop statistical and computational methods that can efficiently utilize the ever-growing nature of modern biological datasets, and facilitate researchers by addressing some of the problems associated with the high-dimensionality of the datasets, as well as by reducing the heavy computational burden of analyzing these large datasets.
ISBN: 9780438594548Subjects--Topical Terms:
1002712
Biostatistics.
Statistical Methods for Analyzing Large-Scale Biological Data.
LDR
:04722nmm a2200325 4500
001
2208001
005
20190929184026.5
008
201008s2018 ||||||||||||||||| ||eng d
020
$a
9780438594548
035
$a
(MiAaPQ)AAI11006859
035
$a
(MiAaPQ)umichrackham:001999
035
$a
AAI11006859
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Dey, Rounak.
$3
3435008
245
1 0
$a
Statistical Methods for Analyzing Large-Scale Biological Data.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2018
300
$a
212 p.
500
$a
Source: Dissertations Abstracts International, Volume: 80-07, Section: B.
500
$a
Publisher info.: Dissertation/Thesis.
500
$a
Advisor: Lee, Seunggeun Shawn.
502
$a
Thesis (Ph.D.)--University of Michigan, 2018.
506
$a
This item must not be added to any third party search indexes.
506
$a
This item must not be sold to any third party vendors.
520
$a
With the development of high-throughput biomedical technologies in recent years, the size of a typical biological dataset is increasing at a fast pace, especially in the genomics, proteomics and metabolomics literatures. Typically, these large datasets contain a huge amount of information on each subject, where the number of subjects can range from small to often extremely large. The challenges of analyzing these large datasets are twofold, namely the problem of high-dimensionality, and the heavy computational burden associated with analyzing them. The goal of this dissertation is to develop statistical and computational methods to address some of these challenges in order to provide researchers with analytical tools that are scalable to handle these large datasets, as well as able to solve the issues arising from high-dimensionality. In Chapter II, we study the asymptotic behaviors of principal component analysis (PCA) in high-dimensional data under the generalized spiked population model. We propose a series of methods for the consistent estimation of the population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage-bias adjustment for the predicted PC scores. In Chapter III, we investigate the over-fitting problem of partial least squares (PLS) regression with high-dimensional predictors, which can result in the predicted and observed outcomes being almost identical, even when the outcome is independent of the predictor. We further discuss a shrinkage-bias problem similar to the shrinkage-bias in high-dimensional PCA, and propose a two-stage PLS (TPLS) method that can address both of these problems. In Chapter IV, we focus on the large-scale genome-wide or phenome-wide association studies (GWASs or PheWASs) of the electronic health records (EHR) or biobank-based binary phenotypes. Due to the severe case-control imbalance in most of the EHR or biobank-based binary phenotypes, the existing methods cannot provide a scalable and accurate way to analyze them. We develop a computationally efficient single-variant test, that is ∼100 times faster than the state of the art Firth's test, and can provide well-calibrated p values even for phenotypes with extremely unbalanced case-control ratios. Further, our test can adjust for non-genetic covariates, and can retain similar power as the Firth's test. In Chapter V, we show that due to the severe case-control imbalance in most of the biobank-based binary phenotypes, applying the traditional Z-score-based method to meta-analyze the association results across multiple biobank-based association studies, can result in conservative or anti-conservative p values. We propose two alternative meta-analysis methods that can provide well-calibrated meta-analysis p values, even when the individual studies are extremely unbalanced in their case-control ratios. Our first method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines, and the second method involves sharing the overall genotype counts from each study. In summary, the purpose of this dissertation is to develop statistical and computational methods that can efficiently utilize the ever-growing nature of modern biological datasets, and facilitate researchers by addressing some of the problems associated with the high-dimensionality of the datasets, as well as by reducing the heavy computational burden of analyzing these large datasets.
590
$a
School code: 0127.
650
4
$a
Biostatistics.
$3
1002712
690
$a
0308
710
2
$a
University of Michigan.
$b
Biostatistics.
$3
3352160
773
0
$t
Dissertations Abstracts International
$g
80-07B.
790
$a
0127
791
$a
Ph.D.
792
$a
2018
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=11006859
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9384550
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入