東華大學圖書館 |

Robust Learning Architectures for Perceiving Object Semantics and Geometry.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Robust Learning Architectures for Perceiving Object Semantics and Geometry./
作者:	Li, Chi.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:	255 p.
附註:	Source: Dissertations Abstracts International, Volume: 80-10, Section: B.
Contained By:	Dissertations Abstracts International80-10B.
標題:	Artificial intelligence. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13890201
ISBN:	9781392069028

Robust Learning Architectures for Perceiving Object Semantics and Geometry.
Li, Chi.

Robust Learning Architectures for Perceiving Object Semantics and Geometry. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 255 p.

Source: Dissertations Abstracts International, Volume: 80-10, Section: B.

Thesis (Ph.D.)--The Johns Hopkins University, 2018.

This item must not be sold to any third party vendors.

Parsing object semantics and geometry in a scene is one core task in visual understanding. This includes classification of object identity and category, localizing and segmenting an object from cluttered background, estimating object orientation and parsing 3D shape structures. With the emergence of deep convolutional architectures in recent years, substantial progress has been made towards learning scalable image representation for large-scale vision problems such as image classification. However, there still remains some fundamental challenges in learning robust object representation. First, creating object representations that are robust to changes in viewpoint while capturing local visual details continues to be a problem. In particular, recent convolutional architectures employ spatial pooling to achieve scale and shift invariances, but they are still sensitive to out-of-plane rotations. Second, deep Convolutional Neural Networks (CNNs) are purely driven by data and predominantly pose the scene interpretation problem as an end-to-end black-box mapping. However, decades of work on perceptual organization in both human and machine vision suggests that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this dissertation, we present two methodologies to surmount the aforementioned two issues. We first introduce a multi-domain pooling framework which group local visual signals within generic feature spaces that are invariant to 3D object transformation, thereby reducing the sensitivity of output feature to spatial deformations. We formulate a probabilistic analysis of pooling which further suggests the multi-domain pooling principle. In addition, this principle guides us in designing convolutional architectures which achieve state-of-the-art performance on instance classification and semantic segmentation. We also present a multi-view fusion algorithm which efficiently computes multi-domain pooling feature on incrementally reconstructed scenes and aggregates semantic confidence to boost long-term performance for semantic segmentation. Next, we explore an approach for injecting prior domain structure into neural network training, which leads a CNN to recover a sequence of intermediate milestones towards the final goal. Our approach supervises hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method.One advantage of this approach is that we are able to generalize the model trained from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, to real image domain. We implement this deep supervision framework with a novel CNN architecture which is trained on synthetic image only and achieves the state-of-the-art performance of 2D/3D keypoint localization on real image benchmarks. Finally, the proposed deep supervision scheme also motivates an approach for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views. To learn discriminative pose features, we integrate three new capabilities into a deep CNN: an inference scheme that combines both classification and pose regression based on an uniform tessellation of SE(3), fusion of a class prior into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. Further, an efficient multi-view framework is formulated to address single-view ambiguity. We show the proposed multi-view scheme consistently improves the performance of the single-view network. Our approach achieves the competitive or superior performance over the current state-of-the-art methods on three large-scale benchmarks.

ISBN: 9781392069028Subjects--Topical Terms:

516317
Artificial intelligence.

Robust Learning Architectures for Perceiving Object Semantics and Geometry.
LDR:05099nmm a2200337 4500 001 2209582
005 20191105130513.5
008 201008s2018 ||||||||||||||||| ||eng d
020 $a 9781392069028
035 $a (MiAaPQ)AAI13890201
035 $a (MiAaPQ)0098vireo:3725Li
035 $a AAI13890201
040 $a MiAaPQ $c MiAaPQ
100 1 $a Li, Chi. $3 3436675
245 1 0 $a Robust Learning Architectures for Perceiving Object Semantics and Geometry.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 255 p.
500 $a Source: Dissertations Abstracts International, Volume: 80-10, Section: B.
500 $a Publisher info.: Dissertation/Thesis.
500 $a Advisor: Hager, Gregory Donald.
502 $a Thesis (Ph.D.)--The Johns Hopkins University, 2018.
506 $a This item must not be sold to any third party vendors.
506 $a This item must not be added to any third party search indexes.
520 $a Parsing object semantics and geometry in a scene is one core task in visual understanding. This includes classification of object identity and category, localizing and segmenting an object from cluttered background, estimating object orientation and parsing 3D shape structures. With the emergence of deep convolutional architectures in recent years, substantial progress has been made towards learning scalable image representation for large-scale vision problems such as image classification. However, there still remains some fundamental challenges in learning robust object representation. First, creating object representations that are robust to changes in viewpoint while capturing local visual details continues to be a problem. In particular, recent convolutional architectures employ spatial pooling to achieve scale and shift invariances, but they are still sensitive to out-of-plane rotations. Second, deep Convolutional Neural Networks (CNNs) are purely driven by data and predominantly pose the scene interpretation problem as an end-to-end black-box mapping. However, decades of work on perceptual organization in both human and machine vision suggests that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this dissertation, we present two methodologies to surmount the aforementioned two issues. We first introduce a multi-domain pooling framework which group local visual signals within generic feature spaces that are invariant to 3D object transformation, thereby reducing the sensitivity of output feature to spatial deformations. We formulate a probabilistic analysis of pooling which further suggests the multi-domain pooling principle. In addition, this principle guides us in designing convolutional architectures which achieve state-of-the-art performance on instance classification and semantic segmentation. We also present a multi-view fusion algorithm which efficiently computes multi-domain pooling feature on incrementally reconstructed scenes and aggregates semantic confidence to boost long-term performance for semantic segmentation. Next, we explore an approach for injecting prior domain structure into neural network training, which leads a CNN to recover a sequence of intermediate milestones towards the final goal. Our approach supervises hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method.One advantage of this approach is that we are able to generalize the model trained from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, to real image domain. We implement this deep supervision framework with a novel CNN architecture which is trained on synthetic image only and achieves the state-of-the-art performance of 2D/3D keypoint localization on real image benchmarks. Finally, the proposed deep supervision scheme also motivates an approach for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views. To learn discriminative pose features, we integrate three new capabilities into a deep CNN: an inference scheme that combines both classification and pose regression based on an uniform tessellation of SE(3), fusion of a class prior into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. Further, an efficient multi-view framework is formulated to address single-view ambiguity. We show the proposed multi-view scheme consistently improves the performance of the single-view network. Our approach achieves the competitive or superior performance over the current state-of-the-art methods on three large-scale benchmarks.
590 $a School code: 0098.
650 4 $a Artificial intelligence. $3 516317
650 4 $a Computer science. $3 523869
690 $a 0800
690 $a 0984
710 2 $a The Johns Hopkins University. $b Computer Science. $3 2094966
773 0 $t Dissertations Abstracts International $g 80-10B.
790 $a 0098
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13890201