語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Automatic Failure Diagnosis for Dist...
~
Zhang, Yongle.
FindBook
Google Book
Amazon
博客來
Automatic Failure Diagnosis for Distributed Systems.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Automatic Failure Diagnosis for Distributed Systems./
作者:
Zhang, Yongle.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2021,
面頁冊數:
110 p.
附註:
Source: Dissertations Abstracts International, Volume: 82-10, Section: B.
Contained By:
Dissertations Abstracts International82-10B.
標題:
Computer engineering. -
電子資源:
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28263396
ISBN:
9798597081953
Automatic Failure Diagnosis for Distributed Systems.
Zhang, Yongle.
Automatic Failure Diagnosis for Distributed Systems.
- Ann Arbor : ProQuest Dissertations & Theses, 2021 - 110 p.
Source: Dissertations Abstracts International, Volume: 82-10, Section: B.
Thesis (Ph.D.)--University of Toronto (Canada), 2021.
This item must not be sold to any third party vendors.
Distributed software systems have become the backbone of Internet services. Failures in pro-duction distributed systems have severe consequences. A 63-minute outage of Amazon in 2018 caused a 100-million loss in revenue. Therefore, diagnosing such failures in distributed systems is particularly critical because it can reduce the service downtime and associated cost. However, failure diagnosis at data center scale is notoriously difficult because these systems are complex: there are numerous threads, processes, and nodes communicating concurrently.Despite decades of efforts dedicated to automated failure diagnosis, existing diagnosis techniques are either intrusive and incur non-negligible performance overhead in a production environment, or face scalability challenges when applied to complex software systems.This dissertation aims to automate human diagnosis procedure for distributed system failures. It makes two main contributions towards improving automated failure diagnosis techniques. The first contribution of this dissertation is a technique that can automatically locate the root cause in a failed distributed system execution. Identifying the root cause in a failed execution of a distributed system with billions of executed instructions is like finding a needle in a haystack. This dissertation designs and evaluates a tool, called Kairux, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. The second contribution is a technique that can automatically reproduce failure from production distributed systems. Given a failure report, the first step of developers' diagnosis is typically to reproduce the failure. To automate this step, this dissertation designs and evaluates a technique, called Pensieve, that mimics developers' analysis of a chain of causally dependent events that lead to the failure using log analysis and program analysis. This dissertation provides the implementation of a practical tool capable of reconstructing near-minimal failure reproduction steps from log files and system bytecode, without human involvement.By evaluating on some of the most complex, real-world failures from widely-deployed dis-tributed systems such as HBase, HDFS, and ZooKeeper, this dissertation shows that Pensieve is capable of formulating a minimal set of operations necessary to reproduce the failure, and Kairux can further pinpoint each failure's respective root cause.
ISBN: 9798597081953Subjects--Topical Terms:
621879
Computer engineering.
Subjects--Index Terms:
Debugging
Automatic Failure Diagnosis for Distributed Systems.
LDR
:03560nmm a2200361 4500
001
2283266
005
20211029084533.5
008
220723s2021 ||||||||||||||||| ||eng d
020
$a
9798597081953
035
$a
(MiAaPQ)AAI28263396
035
$a
AAI28263396
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Zhang, Yongle.
$3
3515169
245
1 0
$a
Automatic Failure Diagnosis for Distributed Systems.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2021
300
$a
110 p.
500
$a
Source: Dissertations Abstracts International, Volume: 82-10, Section: B.
500
$a
Advisor: Yuan, Ding.
502
$a
Thesis (Ph.D.)--University of Toronto (Canada), 2021.
506
$a
This item must not be sold to any third party vendors.
520
$a
Distributed software systems have become the backbone of Internet services. Failures in pro-duction distributed systems have severe consequences. A 63-minute outage of Amazon in 2018 caused a 100-million loss in revenue. Therefore, diagnosing such failures in distributed systems is particularly critical because it can reduce the service downtime and associated cost. However, failure diagnosis at data center scale is notoriously difficult because these systems are complex: there are numerous threads, processes, and nodes communicating concurrently.Despite decades of efforts dedicated to automated failure diagnosis, existing diagnosis techniques are either intrusive and incur non-negligible performance overhead in a production environment, or face scalability challenges when applied to complex software systems.This dissertation aims to automate human diagnosis procedure for distributed system failures. It makes two main contributions towards improving automated failure diagnosis techniques. The first contribution of this dissertation is a technique that can automatically locate the root cause in a failed distributed system execution. Identifying the root cause in a failed execution of a distributed system with billions of executed instructions is like finding a needle in a haystack. This dissertation designs and evaluates a tool, called Kairux, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. The second contribution is a technique that can automatically reproduce failure from production distributed systems. Given a failure report, the first step of developers' diagnosis is typically to reproduce the failure. To automate this step, this dissertation designs and evaluates a technique, called Pensieve, that mimics developers' analysis of a chain of causally dependent events that lead to the failure using log analysis and program analysis. This dissertation provides the implementation of a practical tool capable of reconstructing near-minimal failure reproduction steps from log files and system bytecode, without human involvement.By evaluating on some of the most complex, real-world failures from widely-deployed dis-tributed systems such as HBase, HDFS, and ZooKeeper, this dissertation shows that Pensieve is capable of formulating a minimal set of operations necessary to reproduce the failure, and Kairux can further pinpoint each failure's respective root cause.
590
$a
School code: 0779.
650
4
$a
Computer engineering.
$3
621879
650
4
$a
Computer science.
$3
523869
653
$a
Debugging
653
$a
Diagnosis
653
$a
Distributed systems
653
$a
Failure reproduction
653
$a
Root cause
690
$a
0464
690
$a
0984
710
2
$a
University of Toronto (Canada).
$b
Electrical and Computer Engineering.
$3
2096349
773
0
$t
Dissertations Abstracts International
$g
82-10B.
790
$a
0779
791
$a
Ph.D.
792
$a
2021
793
$a
English
856
4 0
$u
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=28263396
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9434999
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入