東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Advances in Audiovisual Speech Proce...

Tao, Fei.

FindBook

Google Book

Amazon

博客來

Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition./
作者:	Tao, Fei.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:	177 p.
附註:	Source: Dissertations Abstracts International, Volume: 80-09, Section: B.
Contained By:	Dissertations Abstracts International80-09B.
標題:	Electrical engineering. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13849175
ISBN:	9780438962880

Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition.
Tao, Fei.

Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 177 p.

Source: Dissertations Abstracts International, Volume: 80-09, Section: B.

Thesis (Ph.D.)--The University of Texas at Dallas, 2018.

This item must not be sold to any third party vendors.

Speech processing systems are widely used in existing commercial applications, including virtual assistants in smartphones and home assistant devices. Speech-based commands provide convenient hands-free functionality for users. Two key speech processing systems in practical applications are voice activity detection (VAD), which aims to detect when a user is speaking to a system, and automatic speech recognition (ASR), which aims to recognize what the user is speaking. A limitation in these speech tasks is the drop in performance observed in noisy environments or when the speech mode differs from neutral speech (e.g., whisper speech). Emerging audiovisual solutions provide principled frameworks to increase the robustness of the systems by incorporating features describing lip motion. This study proposes novel audiovisual solutions for VAD and ASR tasks. The dissertation introduces unsupervised and supervised audiovisual voice activity detection (AV-VAD). The unsupervised approach combines visual features that are characteristic of the semi-periodic nature of the articulatory production around the orofacial area. The visual features are combined using principal component analysis (PCA) to obtain a single feature. The threshold between speech and non-speech activity is automatically estimated with the expectation-maximization (EM) algorithm. The decision boundary is improved by using the Bayesian information criterion (BIC) algorithm, resolving temporal ambiguities caused by different sampling rates and anticipatory movements. The supervised framework corresponds to the bimodal recurrent neural network (BRNN), which captures the taskrelated characteristics in the audio and visual inputs, and models the temporal information within and across modalities. The approach relied on three subnetworks implemented with long short-term memory (LSTM) networks. This framework is implemented with either hand-crafted features or features representations directly derived from the data (i.e., end-toend system). The study also extends this framework by increasing the temporal modeling by using advanced LSTMs (A-LSTMs). For audiovisual automatic speech recognition (AV-ASR), the study explores the use of visual features to compensate for the mismatch observed when the system is evaluated with whisper speech. We propose supervised adaptation schemes which significantly reduce the mismatch between normal and whisper speech across speakers. The study also introduces the Gating neural network (GNN). The GNN aims to attenuate the effect of unreliable features, creating AV-ASR systems that improve, or at least maintain, the performance of an ASR system implemented only with speech. Finally, the dissertation introduces the front-end alignment neural network (AliNN) to address the temporal alignment problem between audio and visual features. This front-end system is important as the lip motion often precedes speech (e.g., anticipatory movements). The framework relies on RNN with attention model. The resulting aligned features are concatenated and fed to conventional back-end ASR systems obtaining performance improvements. The proposed approaches for AV-VAD and AV-ASR systems are evaluated on large audiovisual corpora, achieving competitive performance under real world scenarios, outperforming conventional audio-based VAD and ASR systems or alternative audiovisual systems proposed by previous studies. Taken collectively, this dissertation has made algorithmic advancements for audiovisual systems, representing novel contributions to the field of multimodal processing.

ISBN: 9780438962880Subjects--Topical Terms:

649834
Electrical engineering.

Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition.
LDR:04808nmm a2200337 4500 001 2207877
005 20190923114246.5
008 201008s2018 ||||||||||||||||| ||eng d
020 $a 9780438962880
035 $a (MiAaPQ)AAI13849175
035 $a (MiAaPQ)0382vireo:739Tao
035 $a AAI13849175
040 $a MiAaPQ $c MiAaPQ
100 1 $a Tao, Fei. $3 2132692
245 1 0 $a Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 177 p.
500 $a Source: Dissertations Abstracts International, Volume: 80-09, Section: B.
500 $a Publisher info.: Dissertation/Thesis.
500 $a Advisor: Busso, Carlos.
502 $a Thesis (Ph.D.)--The University of Texas at Dallas, 2018.
506 $a This item must not be sold to any third party vendors.
506 $a This item must not be added to any third party search indexes.
520 $a Speech processing systems are widely used in existing commercial applications, including virtual assistants in smartphones and home assistant devices. Speech-based commands provide convenient hands-free functionality for users. Two key speech processing systems in practical applications are voice activity detection (VAD), which aims to detect when a user is speaking to a system, and automatic speech recognition (ASR), which aims to recognize what the user is speaking. A limitation in these speech tasks is the drop in performance observed in noisy environments or when the speech mode differs from neutral speech (e.g., whisper speech). Emerging audiovisual solutions provide principled frameworks to increase the robustness of the systems by incorporating features describing lip motion. This study proposes novel audiovisual solutions for VAD and ASR tasks. The dissertation introduces unsupervised and supervised audiovisual voice activity detection (AV-VAD). The unsupervised approach combines visual features that are characteristic of the semi-periodic nature of the articulatory production around the orofacial area. The visual features are combined using principal component analysis (PCA) to obtain a single feature. The threshold between speech and non-speech activity is automatically estimated with the expectation-maximization (EM) algorithm. The decision boundary is improved by using the Bayesian information criterion (BIC) algorithm, resolving temporal ambiguities caused by different sampling rates and anticipatory movements. The supervised framework corresponds to the bimodal recurrent neural network (BRNN), which captures the taskrelated characteristics in the audio and visual inputs, and models the temporal information within and across modalities. The approach relied on three subnetworks implemented with long short-term memory (LSTM) networks. This framework is implemented with either hand-crafted features or features representations directly derived from the data (i.e., end-toend system). The study also extends this framework by increasing the temporal modeling by using advanced LSTMs (A-LSTMs). For audiovisual automatic speech recognition (AV-ASR), the study explores the use of visual features to compensate for the mismatch observed when the system is evaluated with whisper speech. We propose supervised adaptation schemes which significantly reduce the mismatch between normal and whisper speech across speakers. The study also introduces the Gating neural network (GNN). The GNN aims to attenuate the effect of unreliable features, creating AV-ASR systems that improve, or at least maintain, the performance of an ASR system implemented only with speech. Finally, the dissertation introduces the front-end alignment neural network (AliNN) to address the temporal alignment problem between audio and visual features. This front-end system is important as the lip motion often precedes speech (e.g., anticipatory movements). The framework relies on RNN with attention model. The resulting aligned features are concatenated and fed to conventional back-end ASR systems obtaining performance improvements. The proposed approaches for AV-VAD and AV-ASR systems are evaluated on large audiovisual corpora, achieving competitive performance under real world scenarios, outperforming conventional audio-based VAD and ASR systems or alternative audiovisual systems proposed by previous studies. Taken collectively, this dissertation has made algorithmic advancements for audiovisual systems, representing novel contributions to the field of multimodal processing.
590 $a School code: 0382.
650 4 $a Electrical engineering. $3 649834
650 4 $a Computer science. $3 523869
690 $a 0544
690 $a 0984
710 2 $a The University of Texas at Dallas. $b Electrical Engineering. $3 1679269
773 0 $t Dissertations Abstracts International $g 80-09B.
790 $a 0382
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13849175