# Item

ITEM ACTIONSEXPORT

Released

Thesis

#### A PAC-Bayesian Approach to Structure Learning

##### Fulltext (public)

There are no public fulltexts stored in PuRe

##### Supplementary Material (public)

There is no public supplementary material available

##### Citation

Seldin, Y. (2009). A PAC-Bayesian Approach to Structure Learning. PhD Thesis, Hebrew University of Jerusalem, Jerusalem, Israel.

Cite as: http://hdl.handle.net/11858/00-001M-0000-0013-C341-E

##### Abstract

Structure learning is an important sub-domain of machine learning. Its goal
is a high level understanding of the data. For example, given an image
as a collection of pixels, the goal is to identify the objects present in the
image, such as people, trees, birds, to infer their actions (e.g., standing, fly-
ing) and interactions (e.g. a man is feeding a bird). Structure learning has
multiple applications in a variety of domains, including the analysis of cellu-
lar processes in biology, computer vision, and natural language processing,
to name only a few. In the recent couple of decades enormous progress has
been made in data analysis methods that are not based on explicit structure,
such as Support Vector Machines (SVMs) or other kernel-based methods.
Nevertheless, the importance of learning explicit structures remains crucial.
There are multiple benefits to structure-based learning as opposed to meth-
ods, where structure is present implicitly, e.g. in the form of a kernel. The
primary reason is that it is easier for us as humans to manipulate objects
like cars and people rather than raw pixels or kernel matrices. For exam-
ple, a natural query for a human would be: “find the photos of me picking
mushrooms last summer”. Thus we need a learning algorithm to go all the
way up from pixels to persons, mushrooms, trees, etc., and also actions like
picking, standing, and so on. Another important reason to learn structure
in data is “to understand the world around us” for example, to understand
biological processes or even how our own brain works. It also becomes easier
to control or influence different processes once we learn their structure and
extract simple rules governing their dynamics.
Often, the amount of supervision available for a learning algorithm in a structure learning task is limited or even non-existent. Even when present,
supervision is often at a high level, whereas the data are represented at a
low level. For example, we may infer from an image label that it is an image
of a cow, but the algorithm still has to infer the cow’s location and shape.
However, many studies have shown that even completely unsupervised learn-
ing methods are able to identify meaningful structures present in data and
can facilitate high level decisions. But despite their remarkable success in
practice, our conceptual understanding of structure learning approaches is
highly limited. The issue is so basic that even if we are given two reasonable
solutions to some problem (for example, two possible segmentations of an
image) we are unable to make a well-founded judgment as to which one is
better. Typical forms of evaluation are quite subjective, such as “this seg-
mentation looks more natural” or “this has a higher level of correlation with
human annotation”. However, this form of evaluation is hard to apply in
domains where our own intuition is limited, such as bioinformatics or neuro-
science. Model order selection and a proper choice of evaluation procedures
have remained open questions in structure learning for over half a century.
The lack of solid theoretical foundations has given rise to multiple heuristic
approaches which are hard to compare and in the end are slowing down the
development of the whole field.
The picture is completely different in supervised learning. The first ad-
vantage of supervised learning is that it has a well-defined learning objective
- the prediction of a label. From this point it becomes clear how to con-
duct a formal analysis of different learning approaches (usually in the form
of a derivation of a generalization or sample complexity bounds) and how
to evaluate them. Nowadays, most successful classification algorithms are
accompanied by generalization guarantees and many were derived as algo-
rithms for optimization of generalization guarantees. The existence of a
clear objective and generalization guarantees for most algorithms makes it
possible to compare solutions to the same problem obtained by different ap-
proaches (e.g., SVMs and decision trees) both theoretically and practically.
The ability to make a formal analysis and compare different approaches
accounts for the rapid rise in supervised learning in recent decades.
n this thesis it is claimed that the ill-posed nature of unsupervised
learning approaches and in particular unsupervised learning approaches to
structure learning is caused by the fact that unsupervised learning problems
are usually taken out of context. Here we argue that one does not learn
structure just for the sake of learning structure, but rather in order to facil-
itate solving some higher level task. By identifying that task and looking at
structure learning from the point of view of its utility in solving the higher
level task we return the structure learning problem to its context. This en-
ables a clear and objective comparison of different approaches to structure
learning, similar to the way it is done in the supervised learning. We can
also examine in which situations knowing structure is beneficial to solving
a task, or whether unstructured methods such as SVMs or Principal Com-
ponent Analysis (PCA) would perform better. The problem of model order
selection can be answered naturally in this context. It also improves our
understanding of which questions the structure we have learned can answer
and which it cannot. Thus, it is not only desirable, but necessary to consider
structure learning within the wider context of its possible applications.
We demonstrate our approach to the formulation of structure learning
within the context of a higher level task using the example of co-clustering.
Co-clustering is a widely used approach to the analysis of data matrices
by simultaneous grouping (clustering) of “similar” rows and columns of a
data matrix; for instance, clustering of viewers and movies in collaborative
filtering, genes and conditions in gene expression data analysis, or words
and documents in text analysis. Co-clustering was traditionally considered
an unsupervised approach to data analysis. Various solutions were designed
by approximating different properties of the data at hand, but in most cases
there was no way to compare different solutions and perform model order
selection. This dissertation examines co-clustering solutions in the context
of their ability to predict new events generated by the same distribution
that generated the data matrix. Within this context it becomes possible to
carry out generalization analysis of co-clustering, model order selection, and
to compare co-clustering with other approaches to solving this problem.
We find it convenient to use a PAC-Bayesian framework to carry out a formal analysis of generalization properties of co-clustering. This work also
makes several contributions to the domain of PAC-Bayesian generalization
bounds, which up to now have solely been used in the context of general-
ization analysis of classification approaches. Here we derive PAC-Bayesian
generalization bound for density estimation. We also show that the obtained
bounds for co-clustering are actually generalization bounds for a particular
form of graphical models, where each hidden variable is connected to a sin-
gle observable variable. The bounds suggest that the generalization power
of such graphical models is optimized by a tradeoff between empirical per-
formance and mutual information preserved by the hidden variables on the
observed variables. The regularization of the models by the mutual infor-
mation comes as a result of the use of combinatorial priors in PAC-Bayesian
bounds and is yet another contribution of our work. Our combinatorial pri-
ors can be compared to Gaussian and Laplacian priors that were used in
other works on PAC-Bayesian bounds and result in
L
2
-norm and
L
1
-norm
regularization, respectively.
In the applications chapter we show that the tradeoff between empirical
performance and mutual information preserved on the observed variables
that was derived from the bounds enables a complete control over regular-
ization of the co-clustering model. It further achieves state-of-the-art results
on the MovieLens collaborative filtering dataset.
In an excursus, we show that the bounds developed for co-clustering
can be applied in classifications by a single parameter (e.g., prediction of
a probability of a disease based on country visited). In the applications
chapter we demonstrate that the bounds are especially tight in this task
and are less than 15% away from the test error. We suggest that the bounds
can be applied to feature ranking. We futher show that such an approach to
feature ranking, where features are ranked by their generalization potential,
is much more successful than feature ranking based on empirical mutual
information or normalized correlation with the label.
As a continuation of the main thrust of the thesis, we show that it is
possible to extend the analysis of co-clustering to more complex domains.
Specifically, we show that the same kind of a tradeoff between empirical
performance and mutual information that the hidden variables preserve on
the observed variables also holds in graphical models, where hidden variables
are connected in a tree shape and the observed variables are at the leaves of
the tree. We also demonstrate that PAC-Bayesian bounds are able to exploit
the factor form of graphical models and make it possible to derive bounds
that depend on the sizes of the cliques of the graph rather than the size of
the parameter space. We further suggest viewing the problem of learning
graphical models from the point of view of the ability of the graphical model
to predict new events generated by the same distribution rather than the
frequently applied practice of fitting the train set. Extensions of the results
to more complex graphical models as well as the development of efficient
algorithms for optimization of structures of graphical models with respect
to the bounds are subjects for future research.