東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

FindBook

Google Book

Amazon

博客來

Joint Vision and Language Modeling for Semantic Understanding.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Joint Vision and Language Modeling for Semantic Understanding./
作者:	Liu, Sheng.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2022,
面頁冊數:	163 p.
附註:	Source: Dissertations Abstracts International, Volume: 83-12, Section: B.
Contained By:	Dissertations Abstracts International83-12B.
標題:	Computer science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29209240
ISBN:	9798819394786

Joint Vision and Language Modeling for Semantic Understanding.
Liu, Sheng.

Joint Vision and Language Modeling for Semantic Understanding. - Ann Arbor : ProQuest Dissertations & Theses, 2022 - 163 p.

Source: Dissertations Abstracts International, Volume: 83-12, Section: B.

Thesis (Ph.D.)--State University of New York at Buffalo, 2022.

This item must not be sold to any third party vendors.

Vision-and-language is the intersection of computer vision and natural language processing. It is a broad field of study which aims to jointly understand and interpret both visual contents, e.g., images or videos, and texts. Vision-and-language is critical for building intelligent agents that can interact with humans and has a variety of applications, e.g., human computer interaction, socially assistive robots.While deep learning-based vision-and-language models can perform many vision-and-language tasks, including but not limited to image captioning, visual question answering, text-to-image retrieval, the scalability, the interpretability and the ability to perform semantic video understanding hinder further advance of vision-and-language methods. To address the challenge of scalability in visual question answering, we propose to formulate visual question answering as a fill-in-the-blank problem and answer the question with a novel query-dependent prompt generation method. In addition, we introduce a novel task called open-vocabulary visual instance search, which aims to search for arbitrary kinds of visual instances using textual queries. To help improve the interpretability of vision-and-language models, we propose to perform visual relationship grounding with a novel And-Or graph-based compositional model. To understand the semantics of videos so that more accurate video captions can be generated, we introduce a sibling Transformer encoder (SibTEn), which encodes videos with a dual-branch architecture. Extensive experiments on different benchmarks validate the effectiveness of the proposed methods.

ISBN: 9798819394786Subjects--Topical Terms:

523869
Computer science.
Subjects--Index Terms:

Vision-and-language

Joint Vision and Language Modeling for Semantic Understanding.
LDR:02852nmm a2200373 4500 001 2349021
005 20220920134647.5
008 241004s2022 ||||||||||||||||| ||eng d
020 $a 9798819394786
035 $a (MiAaPQ)AAI29209240
035 $a AAI29209240
040 $a MiAaPQ $c MiAaPQ
100 1 $a Liu, Sheng. $3 1281028
245 1 0 $a Joint Vision and Language Modeling for Semantic Understanding.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2022
300 $a 163 p.
500 $a Source: Dissertations Abstracts International, Volume: 83-12, Section: B.
500 $a Advisor: Yuan, Junsong.
502 $a Thesis (Ph.D.)--State University of New York at Buffalo, 2022.
506 $a This item must not be sold to any third party vendors.
506 $a This item must not be added to any third party search indexes.
520 $a Vision-and-language is the intersection of computer vision and natural language processing. It is a broad field of study which aims to jointly understand and interpret both visual contents, e.g., images or videos, and texts. Vision-and-language is critical for building intelligent agents that can interact with humans and has a variety of applications, e.g., human computer interaction, socially assistive robots.While deep learning-based vision-and-language models can perform many vision-and-language tasks, including but not limited to image captioning, visual question answering, text-to-image retrieval, the scalability, the interpretability and the ability to perform semantic video understanding hinder further advance of vision-and-language methods. To address the challenge of scalability in visual question answering, we propose to formulate visual question answering as a fill-in-the-blank problem and answer the question with a novel query-dependent prompt generation method. In addition, we introduce a novel task called open-vocabulary visual instance search, which aims to search for arbitrary kinds of visual instances using textual queries. To help improve the interpretability of vision-and-language models, we propose to perform visual relationship grounding with a novel And-Or graph-based compositional model. To understand the semantics of videos so that more accurate video captions can be generated, we introduce a sibling Transformer encoder (SibTEn), which encodes videos with a dual-branch architecture. Extensive experiments on different benchmarks validate the effectiveness of the proposed methods.
590 $a School code: 0656.
650 4 $a Computer science. $3 523869
650 4 $a Computer engineering. $3 621879
650 4 $a Information technology. $3 532993
650 4 $a Information science. $3 554358
653 $a Vision-and-language
653 $a Vision-and-language pre-training
653 $a Visual instance search
690 $a 0984
690 $a 0489
690 $a 0464
690 $a 0723
710 2 $a State University of New York at Buffalo. $b Computer Science and Engineering. $3 1035503
773 0 $t Dissertations Abstracts International $g 83-12B.
790 $a 0656
791 $a Ph.D.
792 $a 2022
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29209240