Inference with the Universum

In this paper we study a new framework introduced by Vapnik (1998) and Vapnik (2006) that is an alternative capacity concept to the large margin approach. In the particular case of binary classification, we are given a set of labeled examples, and a collection of "non-examples" that do not belong to either class of interest. This collection, called the Universum, allows one to encode prior knowledge by representing meaningful concepts in the same domain as the problem at hand. We describe an algorithm to leverage the Universum by maximizing the number of observed contradictions, and show experimentally that this approach delivers accuracy improvements over using labeled data alone.


Introduction and Motivation
In this article we study the following task: construct a function y = f (x) given a set of labeled examples L = {(x i , y i ) i=1,...,m } ∈ R d ×{±1} drawn from P (x, y) in order to minimize the risk functional: not to belong to either class The set U is called the Universum. It contains data that belongs to the same domain as the problem of interest and is expected to represent meaningful information related to the pattern recognition task at hand.
In the absence of a Universum, the method suggested by Statistical Learning Theory (Vapnik, 2006) is to minimize the training error whilst controlling the capacity of the set of functions F you use. The Structural Risk Minimization principle (SRM) suggests to construct a structure S 1 ⊂ . . . ⊂ S n on the set of admissible functions, such that smaller indices of S are lower capacity sets of functions. One then chooses the appropriate S k by minimizing a probabilistic upper bound on the test error of a classification model, e.g. using the VC dimension. Such a scheme justifies popular algorithms such as Support Vector Machines (SVMs) (Boser et al., 1992). SVMs perform regularization which is somehow agnostic to the data distribution, as the VC dimension is a measure of capacity that holds for all possible distributions. The structure that one constructs contains no prior knowledge of the problem.
In this article we analyse a new method for encoding prior knowledge, following Vapnik (1998) and Vapnik (2006). It works by constructing a data-dependent structure S 1 ⊂ . . . ⊂ S n on the set of admissible functions by using the Universum examples. These examples implicitly specify a prior distribution on the set of functions F, relating our approach to Bayesian approaches (Bernardo & Smith, 1994). Supplying a set of Universum examples, rather than defining such a distribution explicitly, can be a far easier task.
Universum examples should be collected to reflect information about the domain of our problem of interest. For example, if we are solving problem of digit recognition, the Universum could be objects written approximately in the same style as digits (but not necessarily digits), e.g. letters or mathematical symbols (see Figure 2). In this case, the distribution of the Universum in the feature space characterizes a domain where the dominant concepts are not pixels, but pens and strokes. Hence, the structure we construct is related to the problem at hand, and not to the arbitrary choice of a feature space.
To motivate our approach, let us first return to capacity control in the absence of a Universum, by way of the maximal margin principle.

Maximal Margin Principle
The simplest description of SLT can be obtained for the case when the set of admissible functions contains N elements f (x, α 1 )), ..., f (x, α N ). For this situation with probability 1 − η simultaneously for all N functions the following bound holds true where ν(α i |L) is the fraction of examples in L incorrectly classified by function α i .
The Structural Risk Minimization principle suggests to construct a structure on the set of admissible func- One then chooses the element S r and the function f ∈ S r (from the N r possible functions) that minimizes the right hand side of equation (2).
The generalization of this scheme for an infinite number of functions is more technical than conceptual. It replaces the capacity concepts N i with more advanced ones such as the VC dimension. Using the VC dimension capacity concept in the infinite case one obtains the same type of bounds where these concepts of capacity just replace the number of functions.
SRM is a very general scheme. The only requirement is to construct the structure before the training data appear. Using more sophisticated mathematics, one can obtain similar bounds without this requirement (Shawe-Taylor et al., 1998). Alternatively, using the transductive setup (Vapnik, 1998) this requirement is no longer needed.
Ignoring this technical point, the justification of the maximum margin principle can be achieved as follows. Suppose we choose F as the set of hyperplanes. Given our training set x 1 , ..., x m we can factorise the infinite set of hyperplanes into a finite number of equivalence classes Γ 1 , ..., Γ . Two functions belong to the same equivalence class if they give the same labeling on the training data.
Let us associate with each equivalence class Γ i the margin ρ i of the decision boundary f i ∈ Γ i that separates the patterns x 1 , . . . , x n with the largest margin.
Therefore we obtain N pairs Now let us create the following structure: we include in the element of the structure S r all functions f for which ρ i ≥ a r , a 1 > a 2 , ... > a N > 0 The motivation for maximal margin is that the set of hyperplanes separating data in the sphere with margin larger than a has a VC dimension less than h < φ(a). That is, by maximizing the margin we are minimizing the VC dimension, effectively the second term in equation (2).
Practically, this motivates the following algorithm, the Support Vector Machine (SVM). Support Vector Machines utilize linear discriminant functions f w,b (x) = (w · x) + b, and penalize poorly recognized patterns with the Hinge loss function (Figure 1, left), which is a convex approximation to the step function. They thus minimize: where H θ [t] = max{0, θ − t } is the Hinge Loss. The regularization term w 2 2 causes the maximization of the margin between the two classes (Vapnik, 1998).
SVMs perform regularization which (despite being from a data-dependent class) is still somehow agnostic to the particular distribution that generates the training data, as the VC dimension is a measure of capacity that holds for all possible distributions. Indeed, the SVM regularizer w 2 2 bounds the norm of the gradient of the discriminant function, and hence favors "smooth" discriminant functions.
In the next section we consider another idea of assigning value to the equivalence class Γ r using instead the Universum set. The motivation for constructing such a structure is the ability to encode prior knowledge into the capacity control mechanism of our resulting algorithm.

Maximal Contradiction on Universum Principle
Suppose that along with training data we are given another set of data, called the Universum The novelty in our procedure will now be how we construct the pairs (f 1 , ρ 1 ), ..., (f N , ρ N ). Let us consider the set of equivalence classes Γ r as before. We say that an element x * t from the Universum makes a contradiction on the equivalence class Γ r if in Γ r there are two functions f (x, α 1 ) and f (x, α 2 ) such that f (x * t , α 2 ) > 0 We will count the total number of contradictions on the Universum and use this value to replace the value ρ k from before. By defining a relevant Universum set, this will give a measure of complexity of a structure that can be related to the problem at hand, and not to the arbitrary choice of a feature space, as with the margin-based principle.
The motivation of this idea is that the number of contradictions connects the SRM principle with the use of Bayesian priors (Bernardo & Smith, 1994).
A natural way to encode prior knowledge into an algorithm is to define a prior distribution on the functions in F. Suppose we know a prior distribution P (w) on the set of hyperplanes.
We could use this to build a structure on our set of functions as follows 1 . Let us factorise our set of functions into equivalence classes as before, and define w ∈ Ω r as the coefficients of hyperplanes in the equivalence class Γ r . We can now measure the quality of an equivalence class using our prior knowledge: The problem with this approach is that defining the distribution P (w) is very hard. Using the Universum solves this problem by replacing it with an easier one: it allows the user to encode prior knowledge via a set of examples, rather than a distribution on parameters. However, defining a Universum set is approximately equivalent to choosing a distribution P (w).
Taking into account the duality of x space and w space for any measure P (w) there is a measure ν(x) such that the fraction of contradictive points in x approximates (4). The points in the Universum are samples from this distribution.

Universum Algorithm
We can now describe the algorithm for learning with a Universum.
We will use the ε-insensitive loss (Figure 1, middle): For ε = 0 we have the L 1 loss. Other loss functions are possible, e.g. the L 2 loss as in Figure 1, right. This loss measures the real-valued output of our classifier f (x) on Universum points x * 1 , . . . , x * |U| and penalize outputs that are far from zero. We then wish to minimize the total loss: This approximates our goal of finding an equivalence class with a large number of contradictions on the Universum, as if f (x * i ) is close to zero, then only a small change in f will cause a contradiction on x * i . There are many implementations possible, but we choose to add this term to the standard SVM objective function. That is, we minimize: We call this algorithm U-SVM. The loss on the Universum points enters the SVM-type optimization problem via convex constraints |f w,b (x)| ≤ ε + η. This optimization problem is convex, and just like SVMs the solution can also be computed in dual variables. The only difference is that the Universum loss corresponds to adding the Universum points twice with opposite label and changing the linear part of the objective function, because the Universum cost function in Figure  Figure 1. From left to right, the Hinge loss and the εinsensitive and L2 losses. The ε-insensitive loss is a linear combination Here it is shown with ε = 0.25. The L2 loss is a simple quadratic function. For i = 1 . . . |U|, let us define: After some algebra, the problem becomes: We note that a similar optimization problem is considered in Zhong and Fukushima (2006), but with quite different motivations.
Collecting Universum examples We believe it should be easy to collect or construct Universum data for many different types of problems. We already gave the example of optical character recognition. Some other examples include 3D object detection (any set of objects could be used to learn about stereo vision), speech recognition (languages other than the one of interest could be Universum examples) and so on.
In cases where Universum data is not easily available in abundance, one can instead use a priori domain knowledge to construct purely artificial examples. One could potentially construct fake handwritten symbols by simulating pens and strokes, fake 3D object by simulating the stereo mapping, or synthesize fake sounds. We explore both real and synthesized Universum examples in our experiments.

Regularization with the Universum
The term in U-SVM that takes into account the Universum points can be seen as a regularizer defined by the Universum data.
This section explores how we can recover a wide range of known regularizers by defining special sets of Universum points. We also describe some novel regularization strategies.
Isotropic L 2 regularization. Let us consider a linear classifier without threshold, f w (x) = w · x, and apply the L 2 Universum loss U Figure 1, right) to n Universum examples x * k whose coefficients are all zero apart for the k th which is 1.
One recognizes the standard L 2 regularizer.
Anisotropic L 2 regularization. More generally, take a linear classifier f w,b (x) = w · x + b and apply the L 2 Universum loss to Universum examples with mean 0 and covariance matrix M .
This regularizer uses the L 2 metric weighted by the covariance matrix M of the Universum examples. When the covariance matrix is diagonal, this amounts to whitening this covariance matrix by rescaling the features. This is remininiscent of the TF/IDF normalization in text processing where common features are downweighted as rare features prove more useful for discrimination.
This regularizer also penalises the threshold b. Intuitively, the center of mass of the Universum points must be located at a specific position in feature space. The Universum cannot implement a translation invariant regularizer. This is connected to the fact that one cannot define a uniform distribution on the whole affine space. This relates to the use of "improper priors" in Bayesian setups.
L 1 regularization -linear case It is also possible to implement the L 1 regularization that is com-monly used for feature selection (Mangasarian, 1965). Consider a linear classifier without threshold f w (x) and apply the L 1 Universum loss, U [t] = H0[t] + H0[−t], (Figure 1 center) to n Universum points x * k whose coefficients are all zero apart from the k th which is 1.
One recognizes the standard L 1 regularizer.
L 1 regularization -kernel case Usually one cannot implement the L 1 regularizer with nonlinear kernel classifiers because the dual formulation does not apply and because the high dimension of the kernel induced feature space makes the primal formulation too costly to compute.
Nevertheless the Universum formulation suggests a practical way to implement a form of input selection in the nonlinear case: simply take the Universum described in the section above. This will still perform input selection even for nonlinear kernels. Consider for instance a polynomial kernel of degree d defined on binary input variables. In this case this corresponds to an L 1 regularizer that applies only to the coefficients corresponding to the linear part of the decision function.

Experimental Analysis
In this section we test experimentally whether inference using Universum points is beneficial compared to standard supervised learning. In all cases, we compare a standard SVM to U-SVM, that is, to an SVM that also leverages Universum data. Some of our experiments also try to explore the question: what kind of Universum is useful, and when? Unless described otherwise, we employ an RBF kernel, with the width and soft-margin hyperparameter C tuned using a validation set. For U-SVM, we also tune the regularization parameters ε and C U . online before the conference.

MNIST
We first took the MNIST digits 5 vs 8 as a two-class classification problem, to see the performance of U-SVM on a standard dataset. For this problem we considered four kinds of Universum: (i) U N oise -images of "random noise" by generating uniformly distributed pixel features, (ii) U Rest -the other digits 0-9 excluding 5 and 8, (iii) U Gen -create an artificial image by generating each pixel according to its discrete empirical distribution on the training set.
(iv) U M ean -create an artificial image by first selecting a random 5 and a random 8 from the training set, and then constructing the mean of these two digits.
U N oise was included as a kind of "null" hypothesis to show that not just any Universum helps -it has to be related to the problem of interest. The results for different training set subset sizes are reported in Figure  1 2 . They show an improvement of U-SVM over SVM for every Universum apart from U N oise .
Which part of the Universum is useful? Next, we tried to ascertain which digits from the Universum U Rest were the most useful in improving the classification accuracy. The initial intuition is that the digits that are close to 5 and 8 should help most. We report the best test error on a test set of 1865 digits for algorithms trained on the whole training set of 11271 digits and averaged over ten training sets of size 1000 and 200 that we sampled from the original training set. We always used the whole set of digits from one class as the Universum. (This set up is slightly different from before.) The sizes of these sets are around 6000 examples for each digit. The results are given in Table 3. They indicate that digits "3" and "6" are the most useful. This seems to match our intuition, as these digits seem somehow "in between" the digits "5" and "8", whereas  Table 3). Roughly speaking one can say that the Universum loss penalizes features that have high values on the Universum points. The digit "3" covers most of the parts that appear in the digit "5" as well as in "8" which can therefore be considered less discriminative. Taking class 3 as Universum is a good choice for improving accuracy by assigning less relevance to those less discriminant features than any of the other classes. Beyond that, the digits in a class are not perfectly aligned but rather subject to transformations like rotation or translation. Ideally a classifier should be invariant against those. But since the Universum points are subject to those transformations as well, the features that are affected are also assigned lower importance. This could also explain why an SVM using other digits as Universum also improves the performance over that of a plain SVM in most cases.

Reuters RCV1-V2
The Reuters dataset consists of over 800,000 news articels in English languages written by Reuters journalists between August 20, 1996 and August 19, 1997. We used the freely accessible preprocessed version of Lewis et al. (2004).
The task was to separate the category C15 from the remaining categories at the same level of the hierarchy, category CCAT (CORPORATE/INDUSTRIAL). The data was represented as a bag of words weighted by a TF/IDF scheme and normalized to Euclidean length one (see Lewis et al. (2004) for details).
We split the set of 13310 examples into a training set We generated subsets of sizes 50, 100, 200, 500 and 1000. Ten sets for each size were randomly selected from the 6000 points.
We chose two kind of Universums, a real and an artificial one. We chose the category M14 (COMMODITY MARKETS) with 2540 examples as the real Universum. For the artificial Universum, we selected N = 10 examples from each class of the training set and added the mean of the closest examples from to different classes to the Universum. Altogether, we generated 1000 points in this manner. In the following text we call that Universum the MoC Universum (mean of closest). It might be worth noting that we generated a MoC Universum for each single split in order not to use additional information from other training examples.
We used an RBF kernel, since preliminary tests showed that it perfoms slightly better than the linear kernel. The tuning of the regularization constants C, C U and the kernel parameter was done on the first of the ten sets with a model selection using the validation set. For the best set of parameters we trained an SVM with and without Universum on each of the ten subsets for the different training set sizes and tested each on the test set. Table 4 shows the averaged results. Both U-SVMs perform better than a plain SVM. For a dataset size of 50 the improvement is in the order of 1% for the MoC Universum and in the order of 5% for the M14 Universum. With increasing dataset size the effect of the Universum vanishes. These results suggest that the prior knowledge from the M14 Universum really is important for the classifier. Especially for small dataset sizes, the M14 Universum can exhibit features that are not discriminative for the classification problem since they seem to occur throughout the dataset. As soon as the dataset has an adequate size, it provides enough information itself and the effect of the Universum disappears.
Similar results can be found on other text datasets as well, such as the WinMac dataset, a collection of newsgroup articles in two categories. Here, we took 10 random splits of training subset sizes 10, 25, 50, 75, 100, using a linear kernel. The results are given in Table 5. For larger training set sizes, the improvement again becomes negligible.

The ABCDETC Dataset
Most standard machine learning datasets of course do not come equipped with Universum data. For example, MNIST only contains digits of interest, forcing us to run somewhat artificial experiments on that dataset. Despite this, Universum data is in fact quite easy to collect. We therefore decided to collect our own handwritten symbol-based dataset comprising of digits, upper and lower case letters, and a selection of symbols: , . ! ? ; : = − + / ( ) $ % " @ Thus we collected 78 classes in all. Subjects wrote in pen 5 versions of each symbol on a single gridded sheet. The sheets were scanned at 300dpi, and the symbols were stored as 100 x 100 patches, which were automatically extracted and then centered using the center of mass of the pixels. In the following experiments, there are 51 subjects resulting in a dataset of 19,646 examples, after outlier removal. Figure 2 shows part of a typical sheet entered by a subject. The current dataset, and updates as we plan to expand it, will be available online before the ICML conference.
We performed experiments on predicting whether a letter is a lower case "a" or "b", using training sets of various sizes (20, 50, 100, 150 and 200), a validation set of size 200, and the remaining data as the test set (between 100-300 examples, depending on the size of the training set). We report results averaged over 10 random splits. We normalized the examples to have length 1, and chose to use polynomial kernels, K(x, y) = (x · y + 1) d . We compare standard SVMs to U-SVMs using four different Universum sets: (i) the set of remaining lower case letters, (ii) the set of upper case letters C-Z, (iii) the set of digits; and (iii) the set of symbols. In all cases we randomly sampled 1500 points so all the Universum sets are the same size. The   Table 6. They show that as the Universum set becomes intuitively "less relevant" to the problem at hand, the gain one gets from using it decreases.

Feature Selection Toy Datasets
We next tested the feature selection regularization of taking a Universum set x * i = (0, . . . , 0, 1, 0, . . . , 0), where there is a 1 in the i th dimension. In the linear case this is equivalent to adding a 1-norm regularizer, in the nonlinear case it also penalizes using many input features. We constructed two toy problems: a linear one with 2 relevant features in an AND problem and 18 noise inputs, and a nonlinear one with 2 relevant features in an XOR problem with 4 noise inputs. All input features are generated from a uniform distribution. We generated 50 training points, a validation set of size 200, and a test set of size 1000, for 10 separate splits. We report error rates for linear, polynomial and RBF kernels for both SVMs and U-SVMs, where we tuned kernel hyperparameters, C and C U on the validation set. The results given in Table 7 show a considerable improvement of U-SVMs over SVMs. However, we note that many other feature selection algorithms exist. We do not claim that this is the best one, but show it as another illustrative example of how constructing a Universum set can realize many different types of regularization.

Discussion
The idea of adding new data to an existing training set in order to get better performance was used in several different settings of the pattern recognition problem. In transductive and semi-supervised learning one leverages unlabeled data from the same distribution.
In the virtual examples methods (Baird, 1990;Leen, 1995;Schölkopf et al., 1996;Niyogi et al., 1998) and noise injection (Grandvalet et al., 1997), on the other hand, one creates labeled synthetic data that may not come from the same distribution.
The idea of using a Universum is also about the use of additional data. However here we do not require either the same distribution or labelling.
The Universum idea is close to the Bayesian idea: the attempt to use prior knowledge. However there is a conceptual difference between the two approaches. Our experiments show that the obtained performance depends on the quality of the Universum. The methodology of choosing the appropriate Universum is the subject of research. However our results confirm that the Universum can be an important instrument for boosting performance, especially in the small sample size regime.