# Item

ITEM ACTIONSEXPORT

Released

Thesis

#### A PAC-Bayesian Approach to Structure Learning

##### External Resource

No external resources are shared

##### Fulltext (restricted access)

There are currently no full texts shared for your IP range.

##### Fulltext (public)

There are no public fulltexts stored in PuRe

##### Supplementary Material (public)

There is no public supplementary material available

##### Citation

Seldin, Y. (2009). A PAC-Bayesian Approach to Structure Learning. PhD Thesis, Hebrew University of Jerusalem, Jerusalem, Israel.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0013-C341-E

##### Abstract

Structure learning is an important sub-domain of machine learning. Its goal

is a high level understanding of the data. For example, given an image

as a collection of pixels, the goal is to identify the objects present in the

image, such as people, trees, birds, to infer their actions (e.g., standing, fly-

ing) and interactions (e.g. a man is feeding a bird). Structure learning has

multiple applications in a variety of domains, including the analysis of cellu-

lar processes in biology, computer vision, and natural language processing,

to name only a few. In the recent couple of decades enormous progress has

been made in data analysis methods that are not based on explicit structure,

such as Support Vector Machines (SVMs) or other kernel-based methods.

Nevertheless, the importance of learning explicit structures remains crucial.

There are multiple benefits to structure-based learning as opposed to meth-

ods, where structure is present implicitly, e.g. in the form of a kernel. The

primary reason is that it is easier for us as humans to manipulate objects

like cars and people rather than raw pixels or kernel matrices. For exam-

ple, a natural query for a human would be: “find the photos of me picking

mushrooms last summer”. Thus we need a learning algorithm to go all the

way up from pixels to persons, mushrooms, trees, etc., and also actions like

picking, standing, and so on. Another important reason to learn structure

in data is “to understand the world around us” for example, to understand

biological processes or even how our own brain works. It also becomes easier

to control or influence different processes once we learn their structure and

extract simple rules governing their dynamics.

Often, the amount of supervision available for a learning algorithm in a structure learning task is limited or even non-existent. Even when present,

supervision is often at a high level, whereas the data are represented at a

low level. For example, we may infer from an image label that it is an image

of a cow, but the algorithm still has to infer the cow’s location and shape.

However, many studies have shown that even completely unsupervised learn-

ing methods are able to identify meaningful structures present in data and

can facilitate high level decisions. But despite their remarkable success in

practice, our conceptual understanding of structure learning approaches is

highly limited. The issue is so basic that even if we are given two reasonable

solutions to some problem (for example, two possible segmentations of an

image) we are unable to make a well-founded judgment as to which one is

better. Typical forms of evaluation are quite subjective, such as “this seg-

mentation looks more natural” or “this has a higher level of correlation with

human annotation”. However, this form of evaluation is hard to apply in

domains where our own intuition is limited, such as bioinformatics or neuro-

science. Model order selection and a proper choice of evaluation procedures

have remained open questions in structure learning for over half a century.

The lack of solid theoretical foundations has given rise to multiple heuristic

approaches which are hard to compare and in the end are slowing down the

development of the whole field.

The picture is completely different in supervised learning. The first ad-

vantage of supervised learning is that it has a well-defined learning objective

- the prediction of a label. From this point it becomes clear how to con-

duct a formal analysis of different learning approaches (usually in the form

of a derivation of a generalization or sample complexity bounds) and how

to evaluate them. Nowadays, most successful classification algorithms are

accompanied by generalization guarantees and many were derived as algo-

rithms for optimization of generalization guarantees. The existence of a

clear objective and generalization guarantees for most algorithms makes it

possible to compare solutions to the same problem obtained by different ap-

proaches (e.g., SVMs and decision trees) both theoretically and practically.

The ability to make a formal analysis and compare different approaches

accounts for the rapid rise in supervised learning in recent decades.

n this thesis it is claimed that the ill-posed nature of unsupervised

learning approaches and in particular unsupervised learning approaches to

structure learning is caused by the fact that unsupervised learning problems

are usually taken out of context. Here we argue that one does not learn

structure just for the sake of learning structure, but rather in order to facil-

itate solving some higher level task. By identifying that task and looking at

structure learning from the point of view of its utility in solving the higher

level task we return the structure learning problem to its context. This en-

ables a clear and objective comparison of different approaches to structure

learning, similar to the way it is done in the supervised learning. We can

also examine in which situations knowing structure is beneficial to solving

a task, or whether unstructured methods such as SVMs or Principal Com-

ponent Analysis (PCA) would perform better. The problem of model order

selection can be answered naturally in this context. It also improves our

understanding of which questions the structure we have learned can answer

and which it cannot. Thus, it is not only desirable, but necessary to consider

structure learning within the wider context of its possible applications.

We demonstrate our approach to the formulation of structure learning

within the context of a higher level task using the example of co-clustering.

Co-clustering is a widely used approach to the analysis of data matrices

by simultaneous grouping (clustering) of “similar” rows and columns of a

data matrix; for instance, clustering of viewers and movies in collaborative

filtering, genes and conditions in gene expression data analysis, or words

and documents in text analysis. Co-clustering was traditionally considered

an unsupervised approach to data analysis. Various solutions were designed

by approximating different properties of the data at hand, but in most cases

there was no way to compare different solutions and perform model order

selection. This dissertation examines co-clustering solutions in the context

of their ability to predict new events generated by the same distribution

that generated the data matrix. Within this context it becomes possible to

carry out generalization analysis of co-clustering, model order selection, and

to compare co-clustering with other approaches to solving this problem.

We find it convenient to use a PAC-Bayesian framework to carry out a formal analysis of generalization properties of co-clustering. This work also

makes several contributions to the domain of PAC-Bayesian generalization

bounds, which up to now have solely been used in the context of general-

ization analysis of classification approaches. Here we derive PAC-Bayesian

generalization bound for density estimation. We also show that the obtained

bounds for co-clustering are actually generalization bounds for a particular

form of graphical models, where each hidden variable is connected to a sin-

gle observable variable. The bounds suggest that the generalization power

of such graphical models is optimized by a tradeoff between empirical per-

formance and mutual information preserved by the hidden variables on the

observed variables. The regularization of the models by the mutual infor-

mation comes as a result of the use of combinatorial priors in PAC-Bayesian

bounds and is yet another contribution of our work. Our combinatorial pri-

ors can be compared to Gaussian and Laplacian priors that were used in

other works on PAC-Bayesian bounds and result in

L

2

-norm and

L

1

-norm

regularization, respectively.

In the applications chapter we show that the tradeoff between empirical

performance and mutual information preserved on the observed variables

that was derived from the bounds enables a complete control over regular-

ization of the co-clustering model. It further achieves state-of-the-art results

on the MovieLens collaborative filtering dataset.

In an excursus, we show that the bounds developed for co-clustering

can be applied in classifications by a single parameter (e.g., prediction of

a probability of a disease based on country visited). In the applications

chapter we demonstrate that the bounds are especially tight in this task

and are less than 15% away from the test error. We suggest that the bounds

can be applied to feature ranking. We futher show that such an approach to

feature ranking, where features are ranked by their generalization potential,

is much more successful than feature ranking based on empirical mutual

information or normalized correlation with the label.

As a continuation of the main thrust of the thesis, we show that it is

possible to extend the analysis of co-clustering to more complex domains.

Specifically, we show that the same kind of a tradeoff between empirical

performance and mutual information that the hidden variables preserve on

the observed variables also holds in graphical models, where hidden variables

are connected in a tree shape and the observed variables are at the leaves of

the tree. We also demonstrate that PAC-Bayesian bounds are able to exploit

the factor form of graphical models and make it possible to derive bounds

that depend on the sizes of the cliques of the graph rather than the size of

the parameter space. We further suggest viewing the problem of learning

graphical models from the point of view of the ability of the graphical model

to predict new events generated by the same distribution rather than the

frequently applied practice of fitting the train set. Extensions of the results

to more complex graphical models as well as the development of efficient

algorithms for optimization of structures of graphical models with respect

to the bounds are subjects for future research.

is a high level understanding of the data. For example, given an image

as a collection of pixels, the goal is to identify the objects present in the

image, such as people, trees, birds, to infer their actions (e.g., standing, fly-

ing) and interactions (e.g. a man is feeding a bird). Structure learning has

multiple applications in a variety of domains, including the analysis of cellu-

lar processes in biology, computer vision, and natural language processing,

to name only a few. In the recent couple of decades enormous progress has

been made in data analysis methods that are not based on explicit structure,

such as Support Vector Machines (SVMs) or other kernel-based methods.

Nevertheless, the importance of learning explicit structures remains crucial.

There are multiple benefits to structure-based learning as opposed to meth-

ods, where structure is present implicitly, e.g. in the form of a kernel. The

primary reason is that it is easier for us as humans to manipulate objects

like cars and people rather than raw pixels or kernel matrices. For exam-

ple, a natural query for a human would be: “find the photos of me picking

mushrooms last summer”. Thus we need a learning algorithm to go all the

way up from pixels to persons, mushrooms, trees, etc., and also actions like

picking, standing, and so on. Another important reason to learn structure

in data is “to understand the world around us” for example, to understand

biological processes or even how our own brain works. It also becomes easier

to control or influence different processes once we learn their structure and

extract simple rules governing their dynamics.

Often, the amount of supervision available for a learning algorithm in a structure learning task is limited or even non-existent. Even when present,

supervision is often at a high level, whereas the data are represented at a

low level. For example, we may infer from an image label that it is an image

of a cow, but the algorithm still has to infer the cow’s location and shape.

However, many studies have shown that even completely unsupervised learn-

ing methods are able to identify meaningful structures present in data and

can facilitate high level decisions. But despite their remarkable success in

practice, our conceptual understanding of structure learning approaches is

highly limited. The issue is so basic that even if we are given two reasonable

solutions to some problem (for example, two possible segmentations of an

image) we are unable to make a well-founded judgment as to which one is

better. Typical forms of evaluation are quite subjective, such as “this seg-

mentation looks more natural” or “this has a higher level of correlation with

human annotation”. However, this form of evaluation is hard to apply in

domains where our own intuition is limited, such as bioinformatics or neuro-

science. Model order selection and a proper choice of evaluation procedures

have remained open questions in structure learning for over half a century.

The lack of solid theoretical foundations has given rise to multiple heuristic

approaches which are hard to compare and in the end are slowing down the

development of the whole field.

The picture is completely different in supervised learning. The first ad-

vantage of supervised learning is that it has a well-defined learning objective

- the prediction of a label. From this point it becomes clear how to con-

duct a formal analysis of different learning approaches (usually in the form

of a derivation of a generalization or sample complexity bounds) and how

to evaluate them. Nowadays, most successful classification algorithms are

accompanied by generalization guarantees and many were derived as algo-

rithms for optimization of generalization guarantees. The existence of a

clear objective and generalization guarantees for most algorithms makes it

possible to compare solutions to the same problem obtained by different ap-

proaches (e.g., SVMs and decision trees) both theoretically and practically.

The ability to make a formal analysis and compare different approaches

accounts for the rapid rise in supervised learning in recent decades.

n this thesis it is claimed that the ill-posed nature of unsupervised

learning approaches and in particular unsupervised learning approaches to

structure learning is caused by the fact that unsupervised learning problems

are usually taken out of context. Here we argue that one does not learn

structure just for the sake of learning structure, but rather in order to facil-

itate solving some higher level task. By identifying that task and looking at

structure learning from the point of view of its utility in solving the higher

level task we return the structure learning problem to its context. This en-

ables a clear and objective comparison of different approaches to structure

learning, similar to the way it is done in the supervised learning. We can

also examine in which situations knowing structure is beneficial to solving

a task, or whether unstructured methods such as SVMs or Principal Com-

ponent Analysis (PCA) would perform better. The problem of model order

selection can be answered naturally in this context. It also improves our

understanding of which questions the structure we have learned can answer

and which it cannot. Thus, it is not only desirable, but necessary to consider

structure learning within the wider context of its possible applications.

We demonstrate our approach to the formulation of structure learning

within the context of a higher level task using the example of co-clustering.

Co-clustering is a widely used approach to the analysis of data matrices

by simultaneous grouping (clustering) of “similar” rows and columns of a

data matrix; for instance, clustering of viewers and movies in collaborative

filtering, genes and conditions in gene expression data analysis, or words

and documents in text analysis. Co-clustering was traditionally considered

an unsupervised approach to data analysis. Various solutions were designed

by approximating different properties of the data at hand, but in most cases

there was no way to compare different solutions and perform model order

selection. This dissertation examines co-clustering solutions in the context

of their ability to predict new events generated by the same distribution

that generated the data matrix. Within this context it becomes possible to

carry out generalization analysis of co-clustering, model order selection, and

to compare co-clustering with other approaches to solving this problem.

We find it convenient to use a PAC-Bayesian framework to carry out a formal analysis of generalization properties of co-clustering. This work also

makes several contributions to the domain of PAC-Bayesian generalization

bounds, which up to now have solely been used in the context of general-

ization analysis of classification approaches. Here we derive PAC-Bayesian

generalization bound for density estimation. We also show that the obtained

bounds for co-clustering are actually generalization bounds for a particular

form of graphical models, where each hidden variable is connected to a sin-

gle observable variable. The bounds suggest that the generalization power

of such graphical models is optimized by a tradeoff between empirical per-

formance and mutual information preserved by the hidden variables on the

observed variables. The regularization of the models by the mutual infor-

mation comes as a result of the use of combinatorial priors in PAC-Bayesian

bounds and is yet another contribution of our work. Our combinatorial pri-

ors can be compared to Gaussian and Laplacian priors that were used in

other works on PAC-Bayesian bounds and result in

L

2

-norm and

L

1

-norm

regularization, respectively.

In the applications chapter we show that the tradeoff between empirical

performance and mutual information preserved on the observed variables

that was derived from the bounds enables a complete control over regular-

ization of the co-clustering model. It further achieves state-of-the-art results

on the MovieLens collaborative filtering dataset.

In an excursus, we show that the bounds developed for co-clustering

can be applied in classifications by a single parameter (e.g., prediction of

a probability of a disease based on country visited). In the applications

chapter we demonstrate that the bounds are especially tight in this task

and are less than 15% away from the test error. We suggest that the bounds

can be applied to feature ranking. We futher show that such an approach to

feature ranking, where features are ranked by their generalization potential,

is much more successful than feature ranking based on empirical mutual

information or normalized correlation with the label.

As a continuation of the main thrust of the thesis, we show that it is

possible to extend the analysis of co-clustering to more complex domains.

Specifically, we show that the same kind of a tradeoff between empirical

performance and mutual information that the hidden variables preserve on

the observed variables also holds in graphical models, where hidden variables

are connected in a tree shape and the observed variables are at the leaves of

the tree. We also demonstrate that PAC-Bayesian bounds are able to exploit

the factor form of graphical models and make it possible to derive bounds

that depend on the sizes of the cliques of the graph rather than the size of

the parameter space. We further suggest viewing the problem of learning

graphical models from the point of view of the ability of the graphical model

to predict new events generated by the same distribution rather than the

frequently applied practice of fitting the train set. Extensions of the results

to more complex graphical models as well as the development of efficient

algorithms for optimization of structures of graphical models with respect

to the bounds are subjects for future research.