語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
FindBook
Google Book
Amazon
博客來
Robust Learning and Evaluation in Sequential Decision Making.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Robust Learning and Evaluation in Sequential Decision Making./
作者:
Keramati, Ramtin.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2021,
面頁冊數:
135 p.
附註:
Source: Dissertations Abstracts International, Volume: 83-05, Section: B.
Contained By:
Dissertations Abstracts International83-05B.
標題:
Diabetes. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28688332
ISBN:
9798544203834
Robust Learning and Evaluation in Sequential Decision Making.
Keramati, Ramtin.
Robust Learning and Evaluation in Sequential Decision Making.
- Ann Arbor : ProQuest Dissertations & Theses, 2021 - 135 p.
Source: Dissertations Abstracts International, Volume: 83-05, Section: B.
Thesis (Ph.D.)--Stanford University, 2021.
This item must not be sold to any third party vendors.
Reinforcement learning (RL), as a branch of artificial intelligence, is concerned with making a good sequence of decisions given experience and rewards in a stochastic environment. RL algorithms, propelled by the rise of deep learning and neural networks, have shown an impressive performance in achieving human-level performance in games like Go, Chess, and Atari. However, when applied to high-stakes real-world applications, these impressive performances are not matched. This dissertation tackles some important challenges around robustness that hinder our ability to unleash the potential of RL to real-world applications. We look at the robustness of RL algorithms in both online and online settings and introduce new algorithms that may be of particular interest when applying RL to real-world applications such as health care and education.In the first line of work, we consider an online setting where the agent can interact with the environment and collect experience and rewards to learn the optimal sequence of decisions. In many real-world applications, online interactions are limited, limiting our ability to collect data. That raises the necessity of sample efficient algorithms. In addition, safety concerns highlight the importance of learning risk-sensitive policies in these applications. We, therefore, combine recent advances in distributional reinforcement learning with the principle of optimism in the face of uncertainty to develop a scalable algorithm to learn a CVaR (conditional value at risk) optimal policy in a sample ecient manner to minimize the number of interactions needed with the environment.In high-stakes real-world applications, often any online interaction is undesirable and we have to be able to perform off-policy policy evaluation (OPE). OPE methods evaluate a new policy (evaluation policy) given the experiences collected using another policy (behavior policy). For example, an agent that aims to learn the adaptive treatment plan for patients in a hospital may not be able to collect any experiences interacting with patients but can use data containing past decisions made by clinicians and their outcomes. OPE is a counterfactual and challenging task that is often solved by making a crucial assumption, sequential ignorability. Sequential ignorability states that the evaluation policy has access to all the information used by the behavior policy to make decisions, in other words, there are no unobserved confounders. This assumption is often violated in observational data, and failure to acknowledge that results in an arbitrary biased estimate of the evaluation policy. In this dissertation, we consider the bounded e↵ect of unobserved confounders and develop a scalable algorithm to provide bounds on OPE. Our work can be used to raise concerns or certify the superior performance of an evaluation policy under the existence of unobserved confounders and prevents undesirable outcomes of deploying a new decision policy.One shortcoming of the existing OPE method for sequential decision-making is that they often evaluate the expected performance given a distribution. However, in most real-world applications, we would like to assess if the population's subgroups benefit from a newly suggested policy. In this dissertation, we take a step toward quantifying heterogeneity in OPE for sequential decision making and identify subgroups with similar benefits or harm from the evaluation policy.
ISBN: 9798544203834Subjects--Topical Terms:
544344
Diabetes.
Robust Learning and Evaluation in Sequential Decision Making.
LDR
:04572nmm a2200361 4500
001
2349840
005
20221010063635.5
008
241004s2021 ||||||||||||||||| ||eng d
020
$a
9798544203834
035
$a
(MiAaPQ)AAI28688332
035
$a
(MiAaPQ)STANFORDdd732zb2339
035
$a
AAI28688332
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Keramati, Ramtin.
$3
3689262
245
1 0
$a
Robust Learning and Evaluation in Sequential Decision Making.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2021
300
$a
135 p.
500
$a
Source: Dissertations Abstracts International, Volume: 83-05, Section: B.
500
$a
Advisor: Brunskill, Emma;Pavone, Marco;Van Roy, Benjamin.
502
$a
Thesis (Ph.D.)--Stanford University, 2021.
506
$a
This item must not be sold to any third party vendors.
520
$a
Reinforcement learning (RL), as a branch of artificial intelligence, is concerned with making a good sequence of decisions given experience and rewards in a stochastic environment. RL algorithms, propelled by the rise of deep learning and neural networks, have shown an impressive performance in achieving human-level performance in games like Go, Chess, and Atari. However, when applied to high-stakes real-world applications, these impressive performances are not matched. This dissertation tackles some important challenges around robustness that hinder our ability to unleash the potential of RL to real-world applications. We look at the robustness of RL algorithms in both online and online settings and introduce new algorithms that may be of particular interest when applying RL to real-world applications such as health care and education.In the first line of work, we consider an online setting where the agent can interact with the environment and collect experience and rewards to learn the optimal sequence of decisions. In many real-world applications, online interactions are limited, limiting our ability to collect data. That raises the necessity of sample efficient algorithms. In addition, safety concerns highlight the importance of learning risk-sensitive policies in these applications. We, therefore, combine recent advances in distributional reinforcement learning with the principle of optimism in the face of uncertainty to develop a scalable algorithm to learn a CVaR (conditional value at risk) optimal policy in a sample ecient manner to minimize the number of interactions needed with the environment.In high-stakes real-world applications, often any online interaction is undesirable and we have to be able to perform off-policy policy evaluation (OPE). OPE methods evaluate a new policy (evaluation policy) given the experiences collected using another policy (behavior policy). For example, an agent that aims to learn the adaptive treatment plan for patients in a hospital may not be able to collect any experiences interacting with patients but can use data containing past decisions made by clinicians and their outcomes. OPE is a counterfactual and challenging task that is often solved by making a crucial assumption, sequential ignorability. Sequential ignorability states that the evaluation policy has access to all the information used by the behavior policy to make decisions, in other words, there are no unobserved confounders. This assumption is often violated in observational data, and failure to acknowledge that results in an arbitrary biased estimate of the evaluation policy. In this dissertation, we consider the bounded e↵ect of unobserved confounders and develop a scalable algorithm to provide bounds on OPE. Our work can be used to raise concerns or certify the superior performance of an evaluation policy under the existence of unobserved confounders and prevents undesirable outcomes of deploying a new decision policy.One shortcoming of the existing OPE method for sequential decision-making is that they often evaluate the expected performance given a distribution. However, in most real-world applications, we would like to assess if the population's subgroups benefit from a newly suggested policy. In this dissertation, we take a step toward quantifying heterogeneity in OPE for sequential decision making and identify subgroups with similar benefits or harm from the evaluation policy.
590
$a
School code: 0212.
650
4
$a
Diabetes.
$3
544344
650
4
$a
Human performance.
$3
3562051
650
4
$a
Artificial intelligence.
$3
516317
650
4
$a
Sepsis.
$3
3560733
650
4
$a
Optimization.
$3
891104
650
4
$a
Decision making.
$3
517204
650
4
$a
Neural networks.
$3
677449
650
4
$a
Autism.
$3
526650
650
4
$a
Algorithms.
$3
536374
650
4
$a
Confidence intervals.
$3
566017
650
4
$a
Ablation.
$3
3562462
650
4
$a
Feedback.
$3
677181
650
4
$a
Games.
$3
525308
650
4
$a
Education.
$3
516579
650
4
$a
Psychology.
$3
519075
650
4
$a
Computer science.
$3
523869
650
4
$a
Disability studies.
$3
543687
650
4
$a
Recreation.
$3
535376
690
$a
0800
690
$a
0515
690
$a
0621
690
$a
0984
690
$a
0201
690
$a
0814
710
2
$a
Stanford University.
$3
754827
773
0
$t
Dissertations Abstracts International
$g
83-05B.
790
$a
0212
791
$a
Ph.D.
792
$a
2021
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28688332
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9472278
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入