語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Talking Human Synthesis: Learning Ph...
~
Zhang, Chenxu.
FindBook
Google Book
Amazon
博客來
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos./
作者:
Zhang, Chenxu.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2023,
面頁冊數:
115 p.
附註:
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
Contained By:
Dissertations Abstracts International85-03A.
標題:
Computer science. -
電子資源:
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30740491
ISBN:
9798380342445
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos.
Zhang, Chenxu.
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos.
- Ann Arbor : ProQuest Dissertations & Theses, 2023 - 115 p.
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
Thesis (Ph.D.)--The University of Texas at Dallas, 2023.
This item must not be sold to any third party vendors.
Talking video synthesis is a cutting-edge technology that enables the creation of highly realistic video sequences of individuals speaking. This technology has a wide range of applications in fields such as film-making, advertising, gaming, entertainment, social media, and is likely to continue to be an active area of research in the coming years. However, there are still many open questions and challenges in the field of talking video synthesis. In 3D talking face generation, most existing methods can only generate 3D faces with a static head pose, which is inconsistent with how humans perceive faces. Only a few works focus on head pose generation, but even these ignore the attribute of personality. In realistic talking face generation, it is still very challenging to generate photo-realistic talking faces that are indistinguishable from real captured videos, which not only contain synchronized lip motions, but also have personalized and natural head movements and eye blinks, etc. In full-body speech video synthesis, although substantial progress has been made in audio-driven talking video synthesis, there still remain two major difficulties: existing works 1) need a long sequence of training dataset (>1h) to synthesize co-speech gestures, which causes a significant limitation on their applicability; 2) usually fail to generate long sequences, or can only generate long sequences without enough diversity.To address those limitations, my research will be developed in a progressive manner, focusing on three main aspects. Firstly, we will delve into the generation of personalized head poses for 3D talking faces. Secondly, for realistic 2D talking faces, we propose a generation method that takes an audio signal as input and a short target video clip as a reference to synthesize a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are synchronized with the input audio signal. Lastly, we propose a data-efficient ReMix learning method, which can be trained on monocular "in-the-wild" short videos to synthesize photo-realistic talking videos with full-body gestures.To generate personalized head poses for 3D talking faces, we propose a unified audio-driven approach to endow 3D talking faces with personalized pose dynamics. To achieve this goal, we establish an original person-specific dataset, providing corresponding head poses and face shapes for each video. To model implicit face attributes with input audio, we propose a FACe Implicit Attribute Learning Generative Adversarial Network (FACIAL-GAN), which integrates the phonetics-aware, context-aware, and identity-aware information to synthesize the 3D face animation with realistic motions of lips, head poses, and eye blinks. Finally, we make an audio-pose remixed latent space assumption to encourage unpaired audio and pose combinations, which results in diverse "one-to-many" mappings in pose generation. We also develop a dual-function inference scheme to regularize both the start pose and the general appearance of the next sequence, enhancing the long-term video generation of full continuity and diversity.Experimental results indicate that our methods can generate 1) person-specific head pose sequences that are in sync with the input audio and that best match with the human experience of talking heads, 2) realistic talking face videos with not only synchronized lip motions, but also natural head movements and eye blinks, 3) realistic synchronized full-body talking videos with training data efficiency with better qualities than the results of state-of-the-art methods.
ISBN: 9798380342445Subjects--Topical Terms:
523869
Computer science.
Subjects--Index Terms:
Audio-driven generation
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos.
LDR
:04884nmm a2200385 4500
001
2394029
005
20240414211952.5
006
m o d
007
cr#unu||||||||
008
251215s2023 ||||||||||||||||| ||eng d
020
$a
9798380342445
035
$a
(MiAaPQ)AAI30740491
035
$a
(MiAaPQ)0382vireo2198Zhang
035
$a
AAI30740491
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Zhang, Chenxu.
$3
3697265
245
1 0
$a
Talking Human Synthesis: Learning Photorealistic Co-Speech Motions and Visual Appearances From Videos.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2023
300
$a
115 p.
500
$a
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
500
$a
Advisor: Guo, Xiaohu;Bhatia, Dinesh.
502
$a
Thesis (Ph.D.)--The University of Texas at Dallas, 2023.
506
$a
This item must not be sold to any third party vendors.
520
$a
Talking video synthesis is a cutting-edge technology that enables the creation of highly realistic video sequences of individuals speaking. This technology has a wide range of applications in fields such as film-making, advertising, gaming, entertainment, social media, and is likely to continue to be an active area of research in the coming years. However, there are still many open questions and challenges in the field of talking video synthesis. In 3D talking face generation, most existing methods can only generate 3D faces with a static head pose, which is inconsistent with how humans perceive faces. Only a few works focus on head pose generation, but even these ignore the attribute of personality. In realistic talking face generation, it is still very challenging to generate photo-realistic talking faces that are indistinguishable from real captured videos, which not only contain synchronized lip motions, but also have personalized and natural head movements and eye blinks, etc. In full-body speech video synthesis, although substantial progress has been made in audio-driven talking video synthesis, there still remain two major difficulties: existing works 1) need a long sequence of training dataset (>1h) to synthesize co-speech gestures, which causes a significant limitation on their applicability; 2) usually fail to generate long sequences, or can only generate long sequences without enough diversity.To address those limitations, my research will be developed in a progressive manner, focusing on three main aspects. Firstly, we will delve into the generation of personalized head poses for 3D talking faces. Secondly, for realistic 2D talking faces, we propose a generation method that takes an audio signal as input and a short target video clip as a reference to synthesize a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are synchronized with the input audio signal. Lastly, we propose a data-efficient ReMix learning method, which can be trained on monocular "in-the-wild" short videos to synthesize photo-realistic talking videos with full-body gestures.To generate personalized head poses for 3D talking faces, we propose a unified audio-driven approach to endow 3D talking faces with personalized pose dynamics. To achieve this goal, we establish an original person-specific dataset, providing corresponding head poses and face shapes for each video. To model implicit face attributes with input audio, we propose a FACe Implicit Attribute Learning Generative Adversarial Network (FACIAL-GAN), which integrates the phonetics-aware, context-aware, and identity-aware information to synthesize the 3D face animation with realistic motions of lips, head poses, and eye blinks. Finally, we make an audio-pose remixed latent space assumption to encourage unpaired audio and pose combinations, which results in diverse "one-to-many" mappings in pose generation. We also develop a dual-function inference scheme to regularize both the start pose and the general appearance of the next sequence, enhancing the long-term video generation of full continuity and diversity.Experimental results indicate that our methods can generate 1) person-specific head pose sequences that are in sync with the input audio and that best match with the human experience of talking heads, 2) realistic talking face videos with not only synchronized lip motions, but also natural head movements and eye blinks, 3) realistic synchronized full-body talking videos with training data efficiency with better qualities than the results of state-of-the-art methods.
590
$a
School code: 0382.
650
4
$a
Computer science.
$3
523869
650
4
$a
Mass communications.
$3
3422380
650
4
$a
Information technology.
$3
532993
653
$a
Audio-driven generation
653
$a
Cutting-edge technology
653
$a
Visual appearances
690
$a
0984
690
$a
0489
690
$a
0708
710
2
$a
The University of Texas at Dallas.
$b
Computer Science.
$3
1682289
773
0
$t
Dissertations Abstracts International
$g
85-03A.
790
$a
0382
791
$a
Ph.D.
792
$a
2023
793
$a
English
856
4 0
$u
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30740491
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9502349
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入