東華大學圖書館 |

Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains./
Author:	Heravi, Negin.
Description:	1 online resource (117 pages)
Notes:	Source: Dissertations Abstracts International, Volume: 84-05, Section: A.
Contained By:	Dissertations Abstracts International84-05A.
Subject:	Augmented reality. -
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29755715click for full text (PQDT)
ISBN:	9798357500373

Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains.
Heravi, Negin.

Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains. - 1 online resource (117 pages)

Source: Dissertations Abstracts International, Volume: 84-05, Section: A.

Thesis (Ph.D.)--Stanford University, 2022.

Includes bibliographical references

Humans frequently use all their senses to understand and interact with their environments. Our multi-modal mental priors of how objects and materials respond to physical interactions enable us to succeed in many of our everyday tasks. For example, to find a glass in the back of a dark cluttered cabinet, we heavily rely on our senses of touch and hearing as well as our prior knowledge of how a glass feels and sounds. This observation about human behavior motivates us to develop effective ways of modeling the multi-modal signals of vision, haptics, and audio. Such models have applications for robotics as well as for Augmented and Virtual Reality (AR/VR).Similar to humans, robots can benefit from having the capability to infer and use multi-modal signals of vision, haptics, and audio in their manual tasks. For example, they too can take advantage of their haptics and auditory signals where their visual perception fails in cluttered dark environments, such as inside a kitchen cabinet or during contact-rich manipulation tasks such as key insertion.Given that our real life experiences are multi-modal, effective AR/VR environments should be multi-modal as well. With the commercialization of several AR/VR devices over the past few decades, a variety of applications in areas such as ecommerce, gaming, education, and medicine has emerged. However, current AR/VR environments lack rich multi-modal sensory responses, which reduces the realism of these environments.For a model to efficiently render appropriate multi-modal signals in response to user interactions, or for a robot to use high-dimensional sensory observations in a meaningful way, they need to encode this data in low-dimensional representations. This motivates us to develop effective ways of learning representations of these different modalities, which is a challenging goal. From a modeling standpoint, visual cues of an object and its haptic and auditory feedback are heterogeneous, requiring domain-specific knowledge to design the appropriate perceptual modules for each. Furthermore, these representations should ideally be either task agnostic or easily generalizable to new tasks and scenarios since collecting a new dataset per task or object is expensive and impossible to scale. This motivates us to explore physically interpretable and object aware representations. In this dissertation, we demonstrate how object-aware learning-based representations can be used for learning appropriate representations in different modalities.In the first part, we focus on the modality of touch and use deep-learning based methods for haptic texture rendering. We present a learned action-conditional model for haptic textures that uses data from a vision-based tactile sensor (GelSight) and a user's action as input. This model predicts an induced acceleration that is used to provide haptic vibration feedback to a user to induce the sensation of a virtual texture. We show that our model outperforms previous state-of-the-art methods using a quantitative comparison between the predicted and ground truth signal. We further show the performance of our model for real time haptic texture rendering as well as generalization to unseen textures through human user studies.In the second part of this thesis, we explore processing audio signals. We develop a fully differentiable model for rendering and identification of impact sounds called DiffImpact.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2023

Mode of access: World Wide Web

ISBN: 9798357500373Subjects--Topical Terms:

1620831
Augmented reality.
Index Terms--Genre/Form:

542853
Electronic books.

Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains.
LDR:04786nmm a2200397K 4500 001 2362352
005 20231027104011.5
006 m o d
007 cr mn ---uuuuu
008 241011s2022 xx obm 000 0 eng d
020 $a 9798357500373
035 $a (MiAaPQ)AAI29755715
035 $a (MiAaPQ)STANFORDsj589ft0971
035 $a AAI29755715
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Heravi, Negin. $3 3703069
245 1 0 $a Multimodal Object Representation Learning in Haptic, Auditory, and Visual Domains.
264 0 $c 2022
300 $a 1 online resource (117 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 84-05, Section: A.
500 $a Advisor: Bohg, Jeannette;Okamura, Allison.
502 $a Thesis (Ph.D.)--Stanford University, 2022.
504 $a Includes bibliographical references
520 $a Humans frequently use all their senses to understand and interact with their environments. Our multi-modal mental priors of how objects and materials respond to physical interactions enable us to succeed in many of our everyday tasks. For example, to find a glass in the back of a dark cluttered cabinet, we heavily rely on our senses of touch and hearing as well as our prior knowledge of how a glass feels and sounds. This observation about human behavior motivates us to develop effective ways of modeling the multi-modal signals of vision, haptics, and audio. Such models have applications for robotics as well as for Augmented and Virtual Reality (AR/VR).Similar to humans, robots can benefit from having the capability to infer and use multi-modal signals of vision, haptics, and audio in their manual tasks. For example, they too can take advantage of their haptics and auditory signals where their visual perception fails in cluttered dark environments, such as inside a kitchen cabinet or during contact-rich manipulation tasks such as key insertion.Given that our real life experiences are multi-modal, effective AR/VR environments should be multi-modal as well. With the commercialization of several AR/VR devices over the past few decades, a variety of applications in areas such as ecommerce, gaming, education, and medicine has emerged. However, current AR/VR environments lack rich multi-modal sensory responses, which reduces the realism of these environments.For a model to efficiently render appropriate multi-modal signals in response to user interactions, or for a robot to use high-dimensional sensory observations in a meaningful way, they need to encode this data in low-dimensional representations. This motivates us to develop effective ways of learning representations of these different modalities, which is a challenging goal. From a modeling standpoint, visual cues of an object and its haptic and auditory feedback are heterogeneous, requiring domain-specific knowledge to design the appropriate perceptual modules for each. Furthermore, these representations should ideally be either task agnostic or easily generalizable to new tasks and scenarios since collecting a new dataset per task or object is expensive and impossible to scale. This motivates us to explore physically interpretable and object aware representations. In this dissertation, we demonstrate how object-aware learning-based representations can be used for learning appropriate representations in different modalities.In the first part, we focus on the modality of touch and use deep-learning based methods for haptic texture rendering. We present a learned action-conditional model for haptic textures that uses data from a vision-based tactile sensor (GelSight) and a user's action as input. This model predicts an induced acceleration that is used to provide haptic vibration feedback to a user to induce the sensation of a virtual texture. We show that our model outperforms previous state-of-the-art methods using a quantitative comparison between the predicted and ground truth signal. We further show the performance of our model for real time haptic texture rendering as well as generalization to unseen textures through human user studies.In the second part of this thesis, we explore processing audio signals. We develop a fully differentiable model for rendering and identification of impact sounds called DiffImpact.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2023
538 $a Mode of access: World Wide Web
650 4 $a Augmented reality. $3 1620831
650 4 $a Physics. $3 516296
650 4 $a Real time. $3 3562675
650 4 $a Neural networks. $3 677449
650 4 $a Sensors. $3 3549539
650 4 $a Robots. $3 529507
650 4 $a Storytelling. $3 535033
650 4 $a Localization. $3 3560711
650 4 $a Research & development--R&D. $3 3554335
650 4 $a Performance evaluation. $3 3562292
650 4 $a Feedback. $3 677181
650 4 $a Virtual reality. $3 527460
650 4 $a Sound. $3 542298
650 4 $a Robotics. $3 519753
650 4 $a Acoustics. $3 879105
650 4 $a Information technology. $3 532993
655 7 $a Electronic books. $2 lcsh $3 542853
690 $a 0771
690 $a 0605
690 $a 0986
690 $a 0800
690 $a 0505
690 $a 0489
690 $a 0338
710 2 $a ProQuest Information and Learning Co. $3 783688
710 2 $a Stanford University. $3 754827
773 0 $t Dissertations Abstracts International $g 84-05A.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=29755715 $z click for full text (PQDT)