東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Temporal Learning for Video-Language...

Zhang, Songyang.

FindBook

Google Book

Amazon

博客來

Temporal Learning for Video-Language Understanding and Generation.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Temporal Learning for Video-Language Understanding and Generation./
作者:	Zhang, Songyang.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2023,
面頁冊數:	249 p.
附註:	Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
Contained By:	Dissertations Abstracts International85-03A.
標題:	Computer science. -
電子資源:	https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30572454
ISBN:	9798380318648

Temporal Learning for Video-Language Understanding and Generation.
Zhang, Songyang.

Temporal Learning for Video-Language Understanding and Generation. - Ann Arbor : ProQuest Dissertations & Theses, 2023 - 249 p.

Source: Dissertations Abstracts International, Volume: 85-03, Section: A.

Thesis (Ph.D.)--University of Rochester, 2023.

Vision-language studies jointly perceive, understand, and generate over the vision and language modalities to perform various tasks, such as retrieve an image/video by a given sentence or generate an image/video by a given sentence. Great success has been made in image-language studies, however, video-language studies still lag behind. Different from image studies that mainly focus on static objects or scenes, the core challenge in video studies is how to further learn dynamic changes. Time is an intrinsic attribute for both video and language. How to encode time has been studied in both CV and NLP communities, however, their alignments and interactions has been rarely studied until recently.In this thesis, we study temporal learning of video-language tasks from both video and language aspects. In order to interact between different modalities, it's natural to ask how to learn alignment between these two modalities? In the first part, we answer this question by studying a specific video-language alignment task, moment localization with natural language. This task aims to retrieve a specific moment from an untrimmed video by a query sentence. We study this problem from two aspects, video context modeling and temporal language modeling. If we have learnt a good video-language alignment, a following question would be could we leverage such knowledge and benefit some conventional NLP tasks? Or could we learn CV tasks with language as guidance?In the second part, we first investigate a conventional NLP problem, grammar induction. Grammar induction aims to find hierarchical syntactic structures from plain sentences. We found that leveraging the regularities between video and text can improve parser's performance. We further investigate the dataset limitation of this approach and propose a solution by leveraging instructional videos without any human efforts.In the third part, we learn video generation with language. We investigate this problem from dataset collection, spatial-temporal modeling and efficiency. We develop a dataset to enable focused advances in some of the core challenges of multimodal video research. We also leverage text-to-image models to learn the correspondence between text and the visual world, and uses unsupervised learning on unlabeled (unpaired) video data, to learn realistic motion. We also propose a novel temporal shift module to leverage a T2I model as-is for T2V generation without adding any new parameters.Upon these thesis works, we present several exciting research directions for future studies.

ISBN: 9798380318648Subjects--Topical Terms:

523869
Computer science.
Subjects--Index Terms:

Temporal learning

Temporal Learning for Video-Language Understanding and Generation.
LDR:03728nmm a2200385 4500 001 2399400
005 20240916065429.5
006 m o d
007 cr#unu||||||||
008 251215s2023 ||||||||||||||||| ||eng d
020 $a 9798380318648
035 $a (MiAaPQ)AAI30572454
035 $a AAI30572454
040 $a MiAaPQ $c MiAaPQ
100 1 $a Zhang, Songyang. $3 3681695
245 1 0 $a Temporal Learning for Video-Language Understanding and Generation.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2023
300 $a 249 p.
500 $a Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
500 $a Advisor: Luo, Jiebo.
502 $a Thesis (Ph.D.)--University of Rochester, 2023.
520 $a Vision-language studies jointly perceive, understand, and generate over the vision and language modalities to perform various tasks, such as retrieve an image/video by a given sentence or generate an image/video by a given sentence. Great success has been made in image-language studies, however, video-language studies still lag behind. Different from image studies that mainly focus on static objects or scenes, the core challenge in video studies is how to further learn dynamic changes. Time is an intrinsic attribute for both video and language. How to encode time has been studied in both CV and NLP communities, however, their alignments and interactions has been rarely studied until recently.In this thesis, we study temporal learning of video-language tasks from both video and language aspects. In order to interact between different modalities, it's natural to ask how to learn alignment between these two modalities? In the first part, we answer this question by studying a specific video-language alignment task, moment localization with natural language. This task aims to retrieve a specific moment from an untrimmed video by a query sentence. We study this problem from two aspects, video context modeling and temporal language modeling. If we have learnt a good video-language alignment, a following question would be could we leverage such knowledge and benefit some conventional NLP tasks? Or could we learn CV tasks with language as guidance?In the second part, we first investigate a conventional NLP problem, grammar induction. Grammar induction aims to find hierarchical syntactic structures from plain sentences. We found that leveraging the regularities between video and text can improve parser's performance. We further investigate the dataset limitation of this approach and propose a solution by leveraging instructional videos without any human efforts.In the third part, we learn video generation with language. We investigate this problem from dataset collection, spatial-temporal modeling and efficiency. We develop a dataset to enable focused advances in some of the core challenges of multimodal video research. We also leverage text-to-image models to learn the correspondence between text and the visual world, and uses unsupervised learning on unlabeled (unpaired) video data, to learn realistic motion. We also propose a novel temporal shift module to leverage a T2I model as-is for T2V generation without adding any new parameters.Upon these thesis works, we present several exciting research directions for future studies.
590 $a School code: 0188.
650 4 $a Computer science. $3 523869
650 4 $a Information technology. $3 532993
650 4 $a Multimedia communications. $3 590562
653 $a Temporal learning
653 $a Vision-language studies
653 $a Video-language alignment
653 $a Video generation
653 $a Natural language processing
690 $a 0984
690 $a 0489
690 $a 0558
710 2 $a University of Rochester. $b Hajim School of Engineering and Applied Sciences. $3 2099687
773 0 $t Dissertations Abstracts International $g 85-03A.
790 $a 0188
791 $a Ph.D.
792 $a 2023
793 $a English
856 4 0 $u https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30572454