東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Geometric methods for mining large a...

Chen, Keke.

FindBook

Google Book

Amazon

博客來

Geometric methods for mining large and possibly private datasets.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Geometric methods for mining large and possibly private datasets./
作者:	Chen, Keke.
面頁冊數:	189 p.
附註:	Source: Dissertation Abstracts International, Volume: 67-09, Section: B, page: 5184.
Contained By:	Dissertation Abstracts International67-09B.
標題:	Computer Science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3233503
ISBN:	9780542861154

Geometric methods for mining large and possibly private datasets.
Chen, Keke.

Geometric methods for mining large and possibly private datasets. - 189 p.

Source: Dissertation Abstracts International, Volume: 67-09, Section: B, page: 5184.

Thesis (Ph.D.)--Georgia Institute of Technology, 2006.

With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. This thesis research addresses three important problems in mining large and possibly private datasets. The first problem is to prove the hypothesis that we can use interactive visualization techniques to develop an effective and yet flexible framework for clustering very large datasets, especially those datasets having irregularly shaped clusters. The second problem is to prove the hypothesis that there is effective method for determining the critical clustering structure of categorical data, i.e., finding the best K number of clusters in categorical data. The third problem is to prove the hypothesis that we can develop multidimensional data perturbation techniques that provide high privacy guarantee with little sacrifice of the accuracy of some data mining models. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy.

ISBN: 9780542861154Subjects--Topical Terms:

626642
Computer Science.

Geometric methods for mining large and possibly private datasets.
LDR:07575nmm 2200313 4500 001 1830531
005 20070430071715.5
008 130610s2006 eng d
020 $a 9780542861154
035 $a (UnM)AAI3233503
035 $a AAI3233503
040 $a UnM $c UnM
100 1 $a Chen, Keke. $3 1919355
245 1 0 $a Geometric methods for mining large and possibly private datasets.
300 $a 189 p.
500 $a Source: Dissertation Abstracts International, Volume: 67-09, Section: B, page: 5184.
500 $a Adviser: Ling Liu.
502 $a Thesis (Ph.D.)--Georgia Institute of Technology, 2006.
520 $a With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. This thesis research addresses three important problems in mining large and possibly private datasets. The first problem is to prove the hypothesis that we can use interactive visualization techniques to develop an effective and yet flexible framework for clustering very large datasets, especially those datasets having irregularly shaped clusters. The second problem is to prove the hypothesis that there is effective method for determining the critical clustering structure of categorical data, i.e., finding the best K number of clusters in categorical data. The third problem is to prove the hypothesis that we can develop multidimensional data perturbation techniques that provide high privacy guarantee with little sacrifice of the accuracy of some data mining models. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy.
520 $a Some of the most challenging problems in numerical data clustering include identifying irregularly shaped clusters, incorporating domain knowledge into clustering, and cluster-labeling for large amount of disk data. These problems are aggravated when the dataset is huge and the clustering phase is performed on a subset of sampled data. Existing automatic approaches are not effective in dealing with the first two problems, while existing visualization approach does not address the challenges in clustering large datasets. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. With the iVIBRATE approach, we address these problems with the visualization-based three-phase framework: "Sampling - Visual Cluster Rendering - Visualization-based Disk Labeling". The distinct characteristics of the iVIBRATE approach are twofold. (1) We design and develop a VISTA visual cluster rendering subsystem, which invites human into the large-scale iterative clustering process through interactive visualization. VISTA can effectively resolve most of the visual cluster overlapping with interactive visual cluster rendering. (2) We also develop an Adaptive ClusterMap Labeling subsystem, which offers visualization-guided disk-labeling solution that is effective in dealing with outliers, irregular clusters, and cluster boundary extension for large datasets.
520 $a There are many categorical data clustering algorithms having been proposed. However, the important problem of identifying the best K number of clusters is not well addressed yet. The second main contribution is the development of "Best K Plot" (BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method has a few unique contributions. (1) The basic method is based on the entropy difference between optimal clustering results with varying Ks. BKPlot can suggest a few candidates for the best K, which identify different layers of critical clustering structures, respectively. (2) We also developed the sample BKPlot theory for characterizing the critical clustering structures in very large categorical datasets. (3) The basic BKPlot method and the sample BKPlot method are extended to characterize the feature of no-cluster datasets, which is then used to identifying the existence of significant clustering structures for a given dataset.
520 $a Data perturbation has become popular as a means of privacy-preserving data mining. Most of data perturbation research has been focused on randomization approach, which tries to individually perturb some columns of multidimensional data in order to achieve privacy while preserving the dimensional statistics of these columns. However, there is little effort being made on developing multidimensional perturbation techniques. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. Concretely, the basic theory of geometric perturbation is developed for privacy preserving data classification. It is known that many popular data classification algorithms describe the classification decision boundary as a hyperplane or hyper curved surface in Euclidean space. We can employ random geometric rotation and translation to well preserve the classification boundary and perturb the multidimensional dataset with high privacy guarantee at the same time. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality. We analyze three kinds of inference attacks to geometric perturbation: naive-inference attack, ICA (Independent Component Analysis)-based attack, and distance-based attack. Then, we develop a randomized optimization algorithm to find a good geometric perturbation that is resilient to the above three kinds of inference attacks.
520 $a The basic theory of geometric perturbation discusses the challenges in perturbing single-party data. When geometric perturbation is applied to collaborative multiparty data classification, there is an additional challenge: the unification of geometric perturbations. Different parties may prefer different geometric perturbation with high privacy guarantee for their own part of training data. In order to mine a cohesive classification model, all geometric perturbations need to be unified into one geometric perturbation. We study two approaches under the data-mining-service based framework: (1) the ranking approach to agree on one perturbation, (2) the space adaptation approach to transform perturbations to one randomly generated secret perturbation. For each approach, two protocols are developed to address different tradeoffs between the three factors: the privacy guarantee, the data quality, and the efficiency of unifying perturbations.
590 $a School code: 0078.
650 4 $a Computer Science. $3 626642
690 $a 0984
710 2 0 $a Georgia Institute of Technology. $3 696730
773 0 $t Dissertation Abstracts International $g 67-09B.
790 1 0 $a Liu, Ling, $e advisor
790 $a 0078
791 $a Ph.D.
792 $a 2006
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3233503