東華大學圖書館 |

Language: English

Help

回圖書館首頁

手機版館藏查詢

Back

Switch To: Labeled | MARC Mode | ISBD

Efficient representation and matchin...

Yalniz, Ismet Zeki.

Linked to FindBook

Google Book

Amazon

博客來

Efficient representation and matching of texts and images in scanned book collections.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Efficient representation and matching of texts and images in scanned book collections./
Author:	Yalniz, Ismet Zeki.
Description:	210 p.
Notes:	Source: Dissertation Abstracts International, Volume: 75-07(E), Section: B.
Contained By:	Dissertation Abstracts International75-07B(E).
Subject:	Computer Science. -
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3615463
ISBN:	9781303814631

Efficient representation and matching of texts and images in scanned book collections.
Yalniz, Ismet Zeki.

Efficient representation and matching of texts and images in scanned book collections. - 210 p.

Source: Dissertation Abstracts International, Volume: 75-07(E), Section: B.

Thesis (Ph.D.)--University of Massachusetts Amherst, 2014.

Millions of books from public libraries and private collections have been scanned by various organizations in the last decade. The motivation is to preserve the written human heritage in electronic format for durable storage and efficient access. The information buried in these large book collections has always been of major interest for scholars from various disciplines. Several interesting research problems can be defined over large collections of scanned books given their corresponding optical character recognition (OCR) outputs. At the highest level, one can view the entire collection as a whole and discover interesting contextual relationships or linkages between the books. A more traditional approach is to consider each scanned book separately and perform information search and mining at the book level. Here we also show that one can view each book as a whole composed of chapters, sections, paragraphs, sentences, words or even characters positioned in a particular sequential order sharing the same global context. The information inherent in the entire context of the book is referred to as "global information" and it is demonstrated by addressing a number of research questions defined for scanned book collections.

ISBN: 9781303814631Subjects--Topical Terms:

626642
Computer Science.

Efficient representation and matching of texts and images in scanned book collections.
LDR:05395nmm a2200337 4500 001 2055196
005 20141112080403.5
008 170521s2014 ||||||||||||||||| ||eng d
020 $a 9781303814631
035 $a (MiAaPQ)AAI3615463
035 $a AAI3615463
040 $a MiAaPQ $c MiAaPQ
100 1 $a Yalniz, Ismet Zeki. $3 3168827
245 1 0 $a Efficient representation and matching of texts and images in scanned book collections.
300 $a 210 p.
500 $a Source: Dissertation Abstracts International, Volume: 75-07(E), Section: B.
500 $a Adviser: R. Manmatha.
502 $a Thesis (Ph.D.)--University of Massachusetts Amherst, 2014.
520 $a Millions of books from public libraries and private collections have been scanned by various organizations in the last decade. The motivation is to preserve the written human heritage in electronic format for durable storage and efficient access. The information buried in these large book collections has always been of major interest for scholars from various disciplines. Several interesting research problems can be defined over large collections of scanned books given their corresponding optical character recognition (OCR) outputs. At the highest level, one can view the entire collection as a whole and discover interesting contextual relationships or linkages between the books. A more traditional approach is to consider each scanned book separately and perform information search and mining at the book level. Here we also show that one can view each book as a whole composed of chapters, sections, paragraphs, sentences, words or even characters positioned in a particular sequential order sharing the same global context. The information inherent in the entire context of the book is referred to as "global information" and it is demonstrated by addressing a number of research questions defined for scanned book collections.
520 $a The global sequence information is one of the different types of global information available in textual documents. It is useful for discovering content overlap and similarity across books. Each book has a specific flow of ideas and events which distinguishes it from other books. If this global order is changed, then the flow of events and consequently the story changes completely. This argument is true across document translations as well. Although the local order of words in a sentence might not be preserved after translation, sentences, paragraphs, sections and chapters are likely to follow the same global order. Otherwise the two texts are not considered to be translations of each other.
520 $a A global sequence alignment approach is therefore proposed to discover the contextual similarity between the books. The problem is that conventional sequence alignment algorithms are slow and not robust for book length documents especially with OCR errors, additional or missing content. Here we propose a general framework which can be used to efficiently align and compare the textual content of the books at various coarseness levels and even across languages. In a nut-shell, the framework uses the sequence of words which appear only once in the entire book (referred to as "the sequence of unique words") to represent the text. This representation is compact and it is highly descriptive of the content along with the global word sequence information. It is shown to be more accurate compared to the state of the art for efficiently i) detecting which books are partial duplicates in large scanned book collections (DUPNIQ), and, ii) finding which books are translations of each other without explicitly translating the entire texts using statistical machine translation approaches (TRANSNIQ).
520 $a Using the global order of unique words and their corresponding positions in the text, one can also generate the complete text alignment efficiently using a recursive approach. The Recursive Text Alignment Scheme (RETAS) is several orders of magnitude faster than the conventional sequence alignment approaches for long texts and it is later used for iii) the automatic evaluation of OCR accuracy of books given the OCR outputs and the corresponding electronic versions, iv) mapping the corresponding portions of the two books which are known to be partial duplicates, and finally it is generalized for v) aligning long noisy texts across languages (Recursive Translation Alignment - RTA).
520 $a Another example of the global information is that books are mostly printed in a single global font type. Here we demonstrate that the global font feature along with the letter sequence information can be used for facilitating and/or improving text search in noisy page images. There are two contributions in this area: (vi) an efficient word spotting framework for searching text in noisy document images, and, (vii) a state of the art dependence model approach to resolve arbitrary text queries using visual features. The effectiveness of these approaches is demonstrated for books printed in different scripts for which there is no OCR engine available or the recognition accuracy is low.
590 $a School code: 0118.
650 4 $a Computer Science. $3 626642
650 4 $a Information Science. $3 1017528
650 4 $a Library Science. $3 881164
690 $a 0984
690 $a 0723
690 $a 0399
710 2 $a University of Massachusetts Amherst. $b Computer Science. $3 1023848
773 0 $t Dissertation Abstracts International $g 75-07B(E).
790 $a 0118
791 $a Ph.D.
792 $a 2014
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3615463