Publications in Machine Learning and Data Mining

Clustering of Adolescent Criminal Offenders using Psychological and Criminological Profiles

Markus Breitenbach, Tim Brennan, William Dieterich, Greg Grudic

book chapter in Data Mining for Business Applications by C. Soares and R. Ghani (Eds.), IOS Press, 2010

Attacking WiFi Localization with Directional Antennas

Kevin Bauer, Damon McCoy, Eric Anderson, Markus Breitenbach, Greg Grudic, Dirk Grunwald and Douglas Sicker

Abstract: 802.11 localization algorithms provide the ability to accurately position and track wireless clients thereby enabling location-based services and applications. However, we show that these localization techniques are vulnerable to attacks where an adversary uses a low-cost directional antenna to appear (from the localization algorithm's perspective) to be in another arbitrary location of their choosing. The attacker's ability to actively influence where they are positioned is a key distinguishing feature of the directional attack relative to prior localization attacks that use transmit power control to introduce localization errors. We implement a representative set of received signal strength-based localization algorithms and evaluate the attack in a real office building environment. To mitigate the attacks effectiveness, we develop and evaluate an attack detection scheme that offers a high detection rate with few false positives.

Accepted to IEEE Globecom 2009. To appear in the Proceedings of IEEE Globecom 2009 and IEEE eXplore. [Presentation]

Creating Risk-Scores in Very Imbalanced Datasets - Predicting Extremely Violent Crime among Criminal Offenders Following Release from Prison

Markus Breitenbach, William Dieterich, Tim Brennan, and Adrian Fan

Book chapter in "Rare Association Rule Mining and Knowledge Discovery: Technologies for Infrequent and Critical Event Detection" Yun Sing Koh & Nathan Rountree(Ed.), ISBN 978-1605667546

Mixture of Watson Distributions: A Generative Model for Hyperspherical Embeddings

Avleen Bijral, Markus Breitenbach and Gregory Z. Grudic

Abstract: Machine learning applications often involve data that can be analyzed as unit vectors on a d-dimensional hypersphere, or equivalently are directional in nature. Spectral clustering techniques generate embeddings that constitute an example of directional data and can result in different shapes on a hypersphere (depending on the original structure). Other examples of directional data include text and some sub-domains of bio-informatics. The Watson distribution for directional data presents a tractable form and has more modeling capability than the simple von Mises-Fisher distribution. In this paper, we present a generative model of mixtures of Watson distributions on a hypersphere and derive numerical approximations of the parameters in an Expectation Maximization (EM) setting. This model also allows us to present an explanation for choosing the right embedding dimension for spectral clustering. We analyze the algorithm on a generated example and demonstrate its superiority over the existing algorithms through results on real datasets.

Accepted to Artificial Intelligence and Statistics (AISTATS), 2007, San Juan, Puerto Rico

Download Paper

Clustering of Psychological Personality Tests of Criminal Offenders

Markus Breitenbach, Tim Brennan, William Dieterich and Gregory Z. Grudic

Abstract: In Criminology research the question arises if certain types of delinquents can be identified from data, and while there are many cases that can not be clearly labeled, overlapping taxonomies have been proposed in Moffitt 1993, Lykken 1995 and Mealey 1995. In a recent study Juvenile offenders (N = 1572) from three state systems were assessed on a battery of criminogenic risk and needs factors and their official criminal histories. Cluster analysis methods were applied. One problem we encountered is the large number of hybrid cases that have to belong to two or more classes. To eliminate these cases we propose a method that combines the results of Bagged K-Means and the consistency method, a semi-supervised learning technique. A manual interpretation of the results showed very interpretable patterns that were linked to existing criminologic research.

Accepted for presentation at ECML PKDD 2006 Workshop on Practical Data Mining, 15-22 September 2006, Berlin, Germany.

Clustering of Psychological Personality Tests of Criminal Offenders [PDF] Download Paper
[Slides]

Clustering Through Ranking On Manifolds

Markus Breitenbach and Gregory Z. Grudic

Abstract: Clustering aims to find useful hidden structures in data. In this paper we present a new clustering algorithm that builds upon the consistency method (Zhou, et.al., 2003), a semi-supervised learning technique with the property of learning very smooth functions with respect to the intrinsic structure revealed by the data. Other methods, e.g. Spectral Clustering, obtain good results on data that reveals such a structure. However, unlike Spectral Clustering, our algorithm effectively detects both global and within-class outliers, and the most representative examples in each class. Furthermore, we specify an optimization framework that estimates all learning parameters, including the number of clusters, directly from data. Finally, we show that the learned cluster-models can be used to add previously unseen points to clusters without re-learning the original cluster model. Encouraging experimental results are obtained on a number of real world problems.

Accepted for presentation at ICML 2005, 7-11 August 2005, Bonn, Germany.

Clustering Through Ranking On Manifolds [PDF] Download Paper
See also:Topological mapping with multiple visual manifolds [PDF]

Clustering with Local and Global Consistency

Markus Breitenbach and Gregory Z. Grudic

Abstract: Clustering aims at finding hidden structure in data. In this paper we present a new clustering algorithm that builds upon the local and global consistency method (Zhou, et.al., 2003), a semi-supervised learning technique with the property of learning very smooth functions with respect to the intrinsic structure revealed by the data. Starting from this algorithm, we derive an optimization framework that discovers structure in data without requiring labeled data. This framework is capable of simultaneously optimizing all learning parameters, as well as picking the optimal number of clusters. It also allows easy detection of both global outliers and outliers within clusters. Finally, we show that the learned cluster models can be used to add previously unseen points to the clusters, without re-learning the original cluster model. Encouraging experimental results are obtained on a number of toy and real world problems.

Available as Technical Report CU-CS-973-04.pdf

Clustering with Local and Global Consistency [PDF] Download Paper

Linear Time Nonparametric Classification and Feature Selection with Polynomial MPMC Cascades for large datasets

Markus Breitenbach, Sander M. Bohte and G. Z. Grudic

Abstract: The recently proposed Polynomial MPMC Cascade (PMC) algorithm is a nonparametric classifier for high-dimensional non-linear binary classification with performance competitive with state-of-the-art classifiers like SVMs. Importantly, the algorithm has linear-time complexity with respect to both training-set size and dimensionality of the problem. In this paper, we show how we can exploit this computational efficiency to build classifiers from very large datasets typical in datamining problems. Furthermore, we demonstrate that the PMC algorithm efficiently does feature selection in such very large large problem domains. Experimental results are given on datasets ranging in sizes between 170,000 and 4.8 million examples. We empirically verify the linear time dependence of the algorithm, and explore how the stability of the classifier is influenced by sample size. The techniques discussed in this paper should allow nonlinear binary classifiers to be efficiently learned from tens of millions of training data, with feature selection being a added byproduct.

Available as Technical Report CU-CS-977-04.

Linear Time Nonparametric Classification and Feature Selection with Polynomial MPMC Cascades for large datasets [PDF]Download paper.

Nonparametric Classification with Polynomial MPMC Cascades

Sander M. Bohte, Markus Breitenbach and Gregory Z. Grudic

Abstract:A new class of nonparametric algorithms for high-dimensional binary classification is proposed using cascades of low dimensional polynomial structures. Construction of polynomial cascades is based on Minimax Probability Machine Classification (MPMC), which results in direct estimates of classification accuracy, and provides a simple stopping criteria that does not require expensive cross-validation measures. This Polynomial MPMC Cascade (PMC) algorithm is constructed in linear time with respect to the input space dimensionality, and linear time in the number of examples, making it a potentially attractive alternative to algorithms like support vector machines and standard MPMC. Experimental evidence is given showing that, compared to state-of-the-art classifiers, PMCs are competitive; inherently fast to compute; not prone to overfitting; and generally yield accurate estimates of the maximum error rate on unseen data.

Accepted for presentation at ICML 2004, July 4-8 2004, Banff, Canada.

Nonparametric Classification with Polynomial MPMC Cascades [PDF] Download Paper

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities

Markus Breitenbach, Rodney Nielsen and Gregory Z. Grudic

Abstract:Recently proposed classification algorithms give estimates or worst-case bounds for the probability of misclassification. These accuracy estimates are for all future predictions, even though some predictions are more likely to be correct than others. This paper introduces Probabilistic Random Forests (PRF), which allow data point dependent estimates of misclassification probabilities for binary classification. A PRF model outputs both a classification and a misclassification probability estimate for the data point. PRF makes it possible to assess the risk of misclassification, one prediction at a time, without distribution assumptions or density estimation. Experiments show that PRFs give good estimates of the error probability for each classification.

Available as Technical Report CU-CS-954-03

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities [PDF]Download Paper

Harvesta - A new Approach in Information Management Suited for Corporate Intelligence

Frank Hutter and Markus Breitenbach

This Whitepaper on a software system developed by our company BIIT (Business Intelligence Information Technology) was presented at the 6th annual SCIP European Conference, Munich, 2001. It is available here. This came out of a Software Engineering project and we also got an article written about our project here.

Blog

Consider checking out my blog

Email me