東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Internationalization of Task-Oriente...

Moradshahi, Mehrad.

FindBook

Google Book

Amazon

博客來

Internationalization of Task-Oriented Dialogue Systems.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Internationalization of Task-Oriented Dialogue Systems./
作者:	Moradshahi, Mehrad.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2023,
面頁冊數:	115 p.
附註:	Source: Dissertations Abstracts International, Volume: 85-06, Section: B.
Contained By:	Dissertations Abstracts International85-06B.
標題:	Multilingualism. -
電子資源:	https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30726801
ISBN:	9798381019070

Internationalization of Task-Oriented Dialogue Systems.
Moradshahi, Mehrad.

Internationalization of Task-Oriented Dialogue Systems. - Ann Arbor : ProQuest Dissertations & Theses, 2023 - 115 p.

Source: Dissertations Abstracts International, Volume: 85-06, Section: B.

Thesis (Ph.D.)--Stanford University, 2023.

This item must not be sold to any third party vendors.

Virtual assistants and Task-oriented Dialogue (ToD) agents are increasingly prevalent due to their utility in daily tasks. Despite the linguistic diversity worldwide, only a few dominant languages are supported by these digital assistants. This restriction is due to the high cost and manual effort required to produce large, hand-annotated datasets to train these agents. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.This thesis introduces a novel solution to automatically create ToD agents in new languages by leveraging dialogue data in the source language and neural machine translation. The approach is based on automatic entity-aware training data translation, a concise dialogue data representation enabling effective zero-shot training, and a scalable and robust approach for creating end-to-end high-quality fewshot, validation, and test data, minimizing the manual effort needed.To address data scarcity, we use neural machine translation to translate the training dataset from the source to the target language. We show that naive application of this approach would not yield good performance as entities in the input can be mistranslated, transliterated, or omitted and no longer match with that in the annotation. We propose a series of techniques to improve the quality of the dataset by (1) leveraging word alignments from the neural translation model's cross-attention weights to preserve entities and (2) applying automatic data filtering based on textual semantic similarity to exclude poor translations. Using this approach, we create multilingual versions of Schema2QA, a single-turn question-answering dataset, in 10 different languages. Agents trained on our automatically translated data improve upon previous state-of-the-art by 30-40% and comes within 5-8% of the original English agent.Translation is inherently noisy and poses a special challenge in the end-to-end dialogue setting where the amount of natural language encoded grows with each turn. The accumulation of errors can prevent a correct parse for the rest of the dialogue. To address this, we introduce a new distilled dialogue data representation which significantly reduces the amount of natural language encoded and decoded by the model. On the BiToD dataset, using our representation, we found a 14% improvement in Dialogue Success Rate (DSR) in the fewshot setting.The lack of a high-quality realistic testbed for multilingual ToD evaluation has impeded accurate measurement of research progress on the topic. Prior work deployed human translators to either translate or post-edit an automatically translated dataset. However, this was done only for one or two subtasks of a dialogue agent, and training an intractable end-to-end agent was not possible. To address this, we initiated a global effort to extend a large-scale multi-domain dataset, RiSAWOZ (initially in Chinese), to several new languages: English, Korean, French, Hindi, and code-mixed English-Hindi. To ensure the best quality and fluency, we used human post-editing only for the fewshot, validation, and test data. The challenges encountered in creating this dataset at scale led us to create a toolset that makes post-editing for a new language much faster and cheaper. Experiments show that few-shot training achieves 63-88% performance of the original full-shot. The remaining gap motivates further research on multilingual ToD.

ISBN: 9798381019070Subjects--Topical Terms:

598147
Multilingualism.

Internationalization of Task-Oriented Dialogue Systems.
LDR:04685nmm a2200373 4500 001 2394027
005 20240414211951.5
006 m o d
007 cr#unu||||||||
008 251215s2023 ||||||||||||||||| ||eng d
020 $a 9798381019070
035 $a (MiAaPQ)AAI30726801
035 $a (MiAaPQ)STANFORDkg582jk7231
035 $a AAI30726801
040 $a MiAaPQ $c MiAaPQ
100 1 $a Moradshahi, Mehrad. $3 3763516
245 1 0 $a Internationalization of Task-Oriented Dialogue Systems.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2023
300 $a 115 p.
500 $a Source: Dissertations Abstracts International, Volume: 85-06, Section: B.
500 $a Advisor: Lam, Monica;Boneh, Dan;Sadigh, Dorsa.
502 $a Thesis (Ph.D.)--Stanford University, 2023.
506 $a This item must not be sold to any third party vendors.
520 $a Virtual assistants and Task-oriented Dialogue (ToD) agents are increasingly prevalent due to their utility in daily tasks. Despite the linguistic diversity worldwide, only a few dominant languages are supported by these digital assistants. This restriction is due to the high cost and manual effort required to produce large, hand-annotated datasets to train these agents. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.This thesis introduces a novel solution to automatically create ToD agents in new languages by leveraging dialogue data in the source language and neural machine translation. The approach is based on automatic entity-aware training data translation, a concise dialogue data representation enabling effective zero-shot training, and a scalable and robust approach for creating end-to-end high-quality fewshot, validation, and test data, minimizing the manual effort needed.To address data scarcity, we use neural machine translation to translate the training dataset from the source to the target language. We show that naive application of this approach would not yield good performance as entities in the input can be mistranslated, transliterated, or omitted and no longer match with that in the annotation. We propose a series of techniques to improve the quality of the dataset by (1) leveraging word alignments from the neural translation model's cross-attention weights to preserve entities and (2) applying automatic data filtering based on textual semantic similarity to exclude poor translations. Using this approach, we create multilingual versions of Schema2QA, a single-turn question-answering dataset, in 10 different languages. Agents trained on our automatically translated data improve upon previous state-of-the-art by 30-40% and comes within 5-8% of the original English agent.Translation is inherently noisy and poses a special challenge in the end-to-end dialogue setting where the amount of natural language encoded grows with each turn. The accumulation of errors can prevent a correct parse for the rest of the dialogue. To address this, we introduce a new distilled dialogue data representation which significantly reduces the amount of natural language encoded and decoded by the model. On the BiToD dataset, using our representation, we found a 14% improvement in Dialogue Success Rate (DSR) in the fewshot setting.The lack of a high-quality realistic testbed for multilingual ToD evaluation has impeded accurate measurement of research progress on the topic. Prior work deployed human translators to either translate or post-edit an automatically translated dataset. However, this was done only for one or two subtasks of a dialogue agent, and training an intractable end-to-end agent was not possible. To address this, we initiated a global effort to extend a large-scale multi-domain dataset, RiSAWOZ (initially in Chinese), to several new languages: English, Korean, French, Hindi, and code-mixed English-Hindi. To ensure the best quality and fluency, we used human post-editing only for the fewshot, validation, and test data. The challenges encountered in creating this dataset at scale led us to create a toolset that makes post-editing for a new language much faster and cheaper. Experiments show that few-shot training achieves 63-88% performance of the original full-shot. The remaining gap motivates further research on multilingual ToD.
590 $a School code: 0212.
650 4 $a Multilingualism. $3 598147
650 4 $a Error analysis. $3 3562845
650 4 $a Semantics. $3 520060
650 4 $a Bilingual education. $3 2122778
650 4 $a Education. $3 516579
650 4 $a Language. $3 643551
650 4 $a Logic. $3 529544
650 4 $a Mathematics. $3 515831
690 $a 0282
690 $a 0515
690 $a 0679
690 $a 0395
690 $a 0405
710 2 $a Stanford University. $3 754827
773 0 $t Dissertations Abstracts International $g 85-06B.
790 $a 0212
791 $a Ph.D.
792 $a 2023
793 $a English
856 4 0 $u https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30726801