東華大學圖書館 |

Matching and Segmentation for Multimedia Data.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Matching and Segmentation for Multimedia Data./
Author:	Li, Hui.
Description:	1 online resource (131 pages)
Notes:	Source: Dissertations Abstracts International, Volume: 84-08, Section: B.
Contained By:	Dissertations Abstracts International84-08B.
Subject:	Annotations. -
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30227359click for full text (PQDT)
ISBN:	9798371947512

Matching and Segmentation for Multimedia Data.
Li, Hui.

Matching and Segmentation for Multimedia Data. - 1 online resource (131 pages)

Source: Dissertations Abstracts International, Volume: 84-08, Section: B.

Thesis (Ph.D.)--The University of Liverpool (United Kingdom), 2022.

Includes bibliographical references

With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation.Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better.Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses.This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2023

Mode of access: World Wide Web

ISBN: 9798371947512Subjects--Topical Terms:

3561780
Annotations.
Index Terms--Genre/Form:

542853
Electronic books.

Matching and Segmentation for Multimedia Data.
LDR:05144nmm a2200337K 4500 001 2362652
005 20231102122749.5
006 m o d
007 cr mn ---uuuuu
008 241011s2022 xx obm 000 0 eng d
020 $a 9798371947512
035 $a (MiAaPQ)AAI30227359
035 $a (MiAaPQ)Liverpool_3165729
035 $a AAI30227359
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Li, Hui. $3 1005551
245 1 0 $a Matching and Segmentation for Multimedia Data.
264 0 $c 2022
300 $a 1 online resource (131 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 84-08, Section: B.
500 $a Advisor: Xiao, Jimin;Lim, Eng Gee.
502 $a Thesis (Ph.D.)--The University of Liverpool (United Kingdom), 2022.
504 $a Includes bibliographical references
520 $a With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation.Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better.Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses.This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2023
538 $a Mode of access: World Wide Web
650 4 $a Annotations. $3 3561780
650 4 $a Multimedia. $2 gtt $3 1000510
650 4 $a Visualization. $3 586179
650 4 $a Computer science. $3 523869
650 4 $a Multimedia communications. $3 590562
655 7 $a Electronic books. $2 lcsh $3 542853
690 $a 0984
690 $a 0558
710 2 $a ProQuest Information and Learning Co. $3 783688
710 2 $a The University of Liverpool (United Kingdom). $3 1684840
773 0 $t Dissertations Abstracts International $g 84-08B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30227359 $z click for full text (PQDT)