語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
FindBook
Google Book
Amazon
博客來
Joint Vision and Language Modeling for Semantic Understanding.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Joint Vision and Language Modeling for Semantic Understanding./
作者:
Liu, Sheng.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2022,
面頁冊數:
163 p.
附註:
Source: Dissertations Abstracts International, Volume: 83-12, Section: B.
Contained By:
Dissertations Abstracts International83-12B.
標題:
Computer science. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29209240
ISBN:
9798819394786
Joint Vision and Language Modeling for Semantic Understanding.
Liu, Sheng.
Joint Vision and Language Modeling for Semantic Understanding.
- Ann Arbor : ProQuest Dissertations & Theses, 2022 - 163 p.
Source: Dissertations Abstracts International, Volume: 83-12, Section: B.
Thesis (Ph.D.)--State University of New York at Buffalo, 2022.
This item must not be sold to any third party vendors.
Vision-and-language is the intersection of computer vision and natural language processing. It is a broad field of study which aims to jointly understand and interpret both visual contents, e.g., images or videos, and texts. Vision-and-language is critical for building intelligent agents that can interact with humans and has a variety of applications, e.g., human computer interaction, socially assistive robots.While deep learning-based vision-and-language models can perform many vision-and-language tasks, including but not limited to image captioning, visual question answering, text-to-image retrieval, the scalability, the interpretability and the ability to perform semantic video understanding hinder further advance of vision-and-language methods. To address the challenge of scalability in visual question answering, we propose to formulate visual question answering as a fill-in-the-blank problem and answer the question with a novel query-dependent prompt generation method. In addition, we introduce a novel task called open-vocabulary visual instance search, which aims to search for arbitrary kinds of visual instances using textual queries. To help improve the interpretability of vision-and-language models, we propose to perform visual relationship grounding with a novel And-Or graph-based compositional model. To understand the semantics of videos so that more accurate video captions can be generated, we introduce a sibling Transformer encoder (SibTEn), which encodes videos with a dual-branch architecture. Extensive experiments on different benchmarks validate the effectiveness of the proposed methods.
ISBN: 9798819394786Subjects--Topical Terms:
523869
Computer science.
Subjects--Index Terms:
Vision-and-language
Joint Vision and Language Modeling for Semantic Understanding.
LDR
:02852nmm a2200373 4500
001
2349021
005
20220920134647.5
008
241004s2022 ||||||||||||||||| ||eng d
020
$a
9798819394786
035
$a
(MiAaPQ)AAI29209240
035
$a
AAI29209240
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Liu, Sheng.
$3
1281028
245
1 0
$a
Joint Vision and Language Modeling for Semantic Understanding.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2022
300
$a
163 p.
500
$a
Source: Dissertations Abstracts International, Volume: 83-12, Section: B.
500
$a
Advisor: Yuan, Junsong.
502
$a
Thesis (Ph.D.)--State University of New York at Buffalo, 2022.
506
$a
This item must not be sold to any third party vendors.
506
$a
This item must not be added to any third party search indexes.
520
$a
Vision-and-language is the intersection of computer vision and natural language processing. It is a broad field of study which aims to jointly understand and interpret both visual contents, e.g., images or videos, and texts. Vision-and-language is critical for building intelligent agents that can interact with humans and has a variety of applications, e.g., human computer interaction, socially assistive robots.While deep learning-based vision-and-language models can perform many vision-and-language tasks, including but not limited to image captioning, visual question answering, text-to-image retrieval, the scalability, the interpretability and the ability to perform semantic video understanding hinder further advance of vision-and-language methods. To address the challenge of scalability in visual question answering, we propose to formulate visual question answering as a fill-in-the-blank problem and answer the question with a novel query-dependent prompt generation method. In addition, we introduce a novel task called open-vocabulary visual instance search, which aims to search for arbitrary kinds of visual instances using textual queries. To help improve the interpretability of vision-and-language models, we propose to perform visual relationship grounding with a novel And-Or graph-based compositional model. To understand the semantics of videos so that more accurate video captions can be generated, we introduce a sibling Transformer encoder (SibTEn), which encodes videos with a dual-branch architecture. Extensive experiments on different benchmarks validate the effectiveness of the proposed methods.
590
$a
School code: 0656.
650
4
$a
Computer science.
$3
523869
650
4
$a
Computer engineering.
$3
621879
650
4
$a
Information technology.
$3
532993
650
4
$a
Information science.
$3
554358
653
$a
Vision-and-language
653
$a
Vision-and-language pre-training
653
$a
Visual instance search
690
$a
0984
690
$a
0489
690
$a
0464
690
$a
0723
710
2
$a
State University of New York at Buffalo.
$b
Computer Science and Engineering.
$3
1035503
773
0
$t
Dissertations Abstracts International
$g
83-12B.
790
$a
0656
791
$a
Ph.D.
792
$a
2022
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29209240
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9471459
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入