English
 
User Manual Privacy Policy Disclaimer Contact us
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Thesis

A PAC-Bayesian Approach to Structure Learning

MPS-Authors
/persons/resource/persons84206

Seldin,  Y
Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society;

External Ressource
No external resources are shared
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Seldin, Y. (2009). A PAC-Bayesian Approach to Structure Learning. PhD Thesis, Hebrew University of Jerusalem, Jerusalem, Israel.


Cite as: http://hdl.handle.net/11858/00-001M-0000-0013-C341-E
Abstract
Structure learning is an important sub-domain of machine learning. Its goal is a high level understanding of the data. For example, given an image as a collection of pixels, the goal is to identify the objects present in the image, such as people, trees, birds, to infer their actions (e.g., standing, fly- ing) and interactions (e.g. a man is feeding a bird). Structure learning has multiple applications in a variety of domains, including the analysis of cellu- lar processes in biology, computer vision, and natural language processing, to name only a few. In the recent couple of decades enormous progress has been made in data analysis methods that are not based on explicit structure, such as Support Vector Machines (SVMs) or other kernel-based methods. Nevertheless, the importance of learning explicit structures remains crucial. There are multiple benefits to structure-based learning as opposed to meth- ods, where structure is present implicitly, e.g. in the form of a kernel. The primary reason is that it is easier for us as humans to manipulate objects like cars and people rather than raw pixels or kernel matrices. For exam- ple, a natural query for a human would be: “find the photos of me picking mushrooms last summer”. Thus we need a learning algorithm to go all the way up from pixels to persons, mushrooms, trees, etc., and also actions like picking, standing, and so on. Another important reason to learn structure in data is “to understand the world around us” for example, to understand biological processes or even how our own brain works. It also becomes easier to control or influence different processes once we learn their structure and extract simple rules governing their dynamics. Often, the amount of supervision available for a learning algorithm in a structure learning task is limited or even non-existent. Even when present, supervision is often at a high level, whereas the data are represented at a low level. For example, we may infer from an image label that it is an image of a cow, but the algorithm still has to infer the cow’s location and shape. However, many studies have shown that even completely unsupervised learn- ing methods are able to identify meaningful structures present in data and can facilitate high level decisions. But despite their remarkable success in practice, our conceptual understanding of structure learning approaches is highly limited. The issue is so basic that even if we are given two reasonable solutions to some problem (for example, two possible segmentations of an image) we are unable to make a well-founded judgment as to which one is better. Typical forms of evaluation are quite subjective, such as “this seg- mentation looks more natural” or “this has a higher level of correlation with human annotation”. However, this form of evaluation is hard to apply in domains where our own intuition is limited, such as bioinformatics or neuro- science. Model order selection and a proper choice of evaluation procedures have remained open questions in structure learning for over half a century. The lack of solid theoretical foundations has given rise to multiple heuristic approaches which are hard to compare and in the end are slowing down the development of the whole field. The picture is completely different in supervised learning. The first ad- vantage of supervised learning is that it has a well-defined learning objective - the prediction of a label. From this point it becomes clear how to con- duct a formal analysis of different learning approaches (usually in the form of a derivation of a generalization or sample complexity bounds) and how to evaluate them. Nowadays, most successful classification algorithms are accompanied by generalization guarantees and many were derived as algo- rithms for optimization of generalization guarantees. The existence of a clear objective and generalization guarantees for most algorithms makes it possible to compare solutions to the same problem obtained by different ap- proaches (e.g., SVMs and decision trees) both theoretically and practically. The ability to make a formal analysis and compare different approaches accounts for the rapid rise in supervised learning in recent decades. n this thesis it is claimed that the ill-posed nature of unsupervised learning approaches and in particular unsupervised learning approaches to structure learning is caused by the fact that unsupervised learning problems are usually taken out of context. Here we argue that one does not learn structure just for the sake of learning structure, but rather in order to facil- itate solving some higher level task. By identifying that task and looking at structure learning from the point of view of its utility in solving the higher level task we return the structure learning problem to its context. This en- ables a clear and objective comparison of different approaches to structure learning, similar to the way it is done in the supervised learning. We can also examine in which situations knowing structure is beneficial to solving a task, or whether unstructured methods such as SVMs or Principal Com- ponent Analysis (PCA) would perform better. The problem of model order selection can be answered naturally in this context. It also improves our understanding of which questions the structure we have learned can answer and which it cannot. Thus, it is not only desirable, but necessary to consider structure learning within the wider context of its possible applications. We demonstrate our approach to the formulation of structure learning within the context of a higher level task using the example of co-clustering. Co-clustering is a widely used approach to the analysis of data matrices by simultaneous grouping (clustering) of “similar” rows and columns of a data matrix; for instance, clustering of viewers and movies in collaborative filtering, genes and conditions in gene expression data analysis, or words and documents in text analysis. Co-clustering was traditionally considered an unsupervised approach to data analysis. Various solutions were designed by approximating different properties of the data at hand, but in most cases there was no way to compare different solutions and perform model order selection. This dissertation examines co-clustering solutions in the context of their ability to predict new events generated by the same distribution that generated the data matrix. Within this context it becomes possible to carry out generalization analysis of co-clustering, model order selection, and to compare co-clustering with other approaches to solving this problem. We find it convenient to use a PAC-Bayesian framework to carry out a formal analysis of generalization properties of co-clustering. This work also makes several contributions to the domain of PAC-Bayesian generalization bounds, which up to now have solely been used in the context of general- ization analysis of classification approaches. Here we derive PAC-Bayesian generalization bound for density estimation. We also show that the obtained bounds for co-clustering are actually generalization bounds for a particular form of graphical models, where each hidden variable is connected to a sin- gle observable variable. The bounds suggest that the generalization power of such graphical models is optimized by a tradeoff between empirical per- formance and mutual information preserved by the hidden variables on the observed variables. The regularization of the models by the mutual infor- mation comes as a result of the use of combinatorial priors in PAC-Bayesian bounds and is yet another contribution of our work. Our combinatorial pri- ors can be compared to Gaussian and Laplacian priors that were used in other works on PAC-Bayesian bounds and result in L 2 -norm and L 1 -norm regularization, respectively. In the applications chapter we show that the tradeoff between empirical performance and mutual information preserved on the observed variables that was derived from the bounds enables a complete control over regular- ization of the co-clustering model. It further achieves state-of-the-art results on the MovieLens collaborative filtering dataset. In an excursus, we show that the bounds developed for co-clustering can be applied in classifications by a single parameter (e.g., prediction of a probability of a disease based on country visited). In the applications chapter we demonstrate that the bounds are especially tight in this task and are less than 15% away from the test error. We suggest that the bounds can be applied to feature ranking. We futher show that such an approach to feature ranking, where features are ranked by their generalization potential, is much more successful than feature ranking based on empirical mutual information or normalized correlation with the label. As a continuation of the main thrust of the thesis, we show that it is possible to extend the analysis of co-clustering to more complex domains. Specifically, we show that the same kind of a tradeoff between empirical performance and mutual information that the hidden variables preserve on the observed variables also holds in graphical models, where hidden variables are connected in a tree shape and the observed variables are at the leaves of the tree. We also demonstrate that PAC-Bayesian bounds are able to exploit the factor form of graphical models and make it possible to derive bounds that depend on the sizes of the cliques of the graph rather than the size of the parameter space. We further suggest viewing the problem of learning graphical models from the point of view of the ability of the graphical model to predict new events generated by the same distribution rather than the frequently applied practice of fitting the train set. Extensions of the results to more complex graphical models as well as the development of efficient algorithms for optimization of structures of graphical models with respect to the bounds are subjects for future research.