東華大學圖書館 |

Applications and Advances in Similarity-Based Machine Learning.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Applications and Advances in Similarity-Based Machine Learning./
作者:	Spaen, Quico Pepijn.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2019,
面頁冊數:	91 p.
附註:	Source: Dissertations Abstracts International, Volume: 81-10, Section: B.
Contained By:	Dissertations Abstracts International81-10B.
標題:	Operations research. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27540380
ISBN:	9781658464758

Applications and Advances in Similarity-Based Machine Learning.
Spaen, Quico Pepijn.

Applications and Advances in Similarity-Based Machine Learning. - Ann Arbor : ProQuest Dissertations & Theses, 2019 - 91 p.

Source: Dissertations Abstracts International, Volume: 81-10, Section: B.

Thesis (Ph.D.Eng.)--University of California, Berkeley, 2019.

This item must not be sold to any third party vendors.

Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities. A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects. This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks. For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs.

ISBN: 9781658464758Subjects--Topical Terms:

547123
Operations research.
Subjects--Index Terms:

Combinatorial optimization

Applications and Advances in Similarity-Based Machine Learning.
LDR:05917nmm a2200361 4500 001 2274094
005 20201120111326.5
008 220629s2019 ||||||||||||||||| ||eng d
020 $a 9781658464758
035 $a (MiAaPQ)AAI27540380
035 $a AAI27540380
040 $a MiAaPQ $c MiAaPQ
100 1 $a Spaen, Quico Pepijn. $3 3551557
245 1 0 $a Applications and Advances in Similarity-Based Machine Learning.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2019
300 $a 91 p.
500 $a Source: Dissertations Abstracts International, Volume: 81-10, Section: B.
500 $a Advisor: Hochbaum, Dorit S.
502 $a Thesis (Ph.D.Eng.)--University of California, Berkeley, 2019.
506 $a This item must not be sold to any third party vendors.
520 $a Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities. A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects. This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks. For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs.
590 $a School code: 0028.
650 4 $a Operations research. $3 547123
650 4 $a Applied mathematics. $3 2122814
653 $a Combinatorial optimization
653 $a Hochbaum's Normalized Cut
653 $a Markov Random Fields
653 $a Similarity-based machine learning
653 $a Sparse computaton
690 $a 0796
690 $a 0364
710 2 $a University of California, Berkeley. $b Industrial Engineering & Operations Research. $3 1672475
773 0 $t Dissertations Abstracts International $g 81-10B.
790 $a 0028
791 $a Ph.D.Eng.
792 $a 2019
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27540380