東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

A tale of two paradigms: Disambiguat...

Huang, Jian.

FindBook

Google Book

Amazon

博客來

A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web.

紀錄類型:	書目-語言資料,印刷品 : Monograph/item
正題名/作者:	A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web./
作者:	Huang, Jian.
面頁冊數:	133 p.
附註:	Source: Dissertation Abstracts International, Volume: 71-09, Section: A, page: 3082.
Contained By:	Dissertation Abstracts International71-09A.
標題:	Library Science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3420152
ISBN:	9781124166872

A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web.
Huang, Jian.

A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web. - 133 p.

Source: Dissertation Abstracts International, Volume: 71-09, Section: A, page: 3082.

Thesis (Ph.D.)--The Pennsylvania State University, 2010.

With the increasing wealth of information on the Web, information integration is ubiquitous as the same real-world entity may appear in a variety of forms extracted from different sources. This dissertation proposes supervised and unsupervised algorithms that are naturally integrated in a scalable framework to solve the entity resolution problem, which lies at the heart of the information integration process.

ISBN: 9781124166872Subjects--Topical Terms:

881164
Library Science.

A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web.
LDR:03976nam 2200325 4500 001 1400471
005 20111010080559.5
008 130515s2010 ||||||||||||||||| ||eng d
020 $a 9781124166872
035 $a (UMI)AAI3420152
035 $a AAI3420152
040 $a UMI $c UMI
100 1 $a Huang, Jian. $3 1271330
245 1 2 $a A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web.
300 $a 133 p.
500 $a Source: Dissertation Abstracts International, Volume: 71-09, Section: A, page: 3082.
500 $a Adviser: C. Lee Giles.
502 $a Thesis (Ph.D.)--The Pennsylvania State University, 2010.
520 $a With the increasing wealth of information on the Web, information integration is ubiquitous as the same real-world entity may appear in a variety of forms extracted from different sources. This dissertation proposes supervised and unsupervised algorithms that are naturally integrated in a scalable framework to solve the entity resolution problem, which lies at the heart of the information integration process.
520 $a This dissertation focuses on two incarnations of the entity resolution problem that arise in the data mining and natural language processing areas. First, name disambiguation occurs when one is seeking a list of publications of an author in a digital library, who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework that disambiguates the extracted author metadata from paper headers in a divide-and-conquer fashion: based on the metadata records extracted from paper headers, a blocking method retrieves candidate classes of authors with similar names and a density-based clustering method, DBSCAN, clusters the records by author. The distance metric between papers used for clustering is calculated by an online active selection Support Vector Machines algorithm LASVM. We prove that by recasting transitivity as density connectivity in DBSCAN, transitivity is guaranteed for core points. The method achieves high accuracy on a manually labeled dataset and readily disambiguates about a million author metadata records in CiteSeer, which paves the way for the fielded search by author name feature in CiteSeer X. Second, as a key step towards document understanding in natural language processing, we investigate the problem of cross document coreference (CDC), which aims to decipher the true reference of a named entity across the boundary of documents. This dissertation presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by information extraction tools and reconciled using a within-document coreference module. We propose to match the profiles by using a learned ensemble distance function comprised of a suite of similarity specialists. We develop a kernelized soft relational clustering algorithm that makes use of the learned distance function to partition the entities into fuzzy sets of identities. Evaluation on a large benchmark collection shows that the proposed methods achieve competitive coreference results. We further discuss the details of the implementation of the CDC and web person search system.
520 $a This dissertation surveys the literature on author name disambiguation in citations and paper headers, citation matching and cross document coreference. Additionally, we explore the social networks of the disambiguated authors, performing a comprehensive study of the network and community level characteristics and proposing a stochastic model to predict collaborations of individuals.
590 $a School code: 0176.
650 4 $a Library Science. $3 881164
650 4 $a Web Studies. $3 1026830
650 4 $a Information Science. $3 1017528
650 4 $a Computer Science. $3 626642
690 $a 0399
690 $a 0646
690 $a 0723
690 $a 0984
710 2 $a The Pennsylvania State University. $3 699896
773 0 $t Dissertation Abstracts International $g 71-09A.
790 1 0 $a Giles, C. Lee, $e advisor
790 $a 0176
791 $a Ph.D.
792 $a 2010
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3420152