CS5487 Machine Learning (1)

1. Introduction

This is the notes of CS5487 Machine Learning (Antoni Chan)

Textbook: Pattern Recognition And Machine Learning - Bishop - Springer 2006

1. Introduction

The field of pattern recognition is concerned with the automatic disvery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

training set is used to tune the parameters of an adaptive model.

target vector t, which represents the identity of the corresponding digit.

The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as target vectors.

The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data.

Once the model is trained it can then determine the identity of new digit images, which are said to comprise a test set.

The ability to categorize correctly new examples that differ from those used for training is known as generalization. generalization is a central goal in pattern recognition.

For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. This pre-processing stage is sometimes also called feature extraction. Note that new test data must be pre-proessed using the same steps as the training data.

Pre-processing might also be performed in order to speed up computation. The aim is to find useful features that are fast to compute, and yet that also preserve useful discriminatory information. Because the number of such features is smaller than the number of pixels, this kind of pre-processing represents a form of dimensionality reduction. Care must be taken during pre-processing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer.

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.

In other pattern recognition problems, the training data consists of a set of input vector x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error. A major challenge of reinforcement learning is credit assignment problem. A general feature of reinforcement learning is the trade-off between exploration and exploitation.

three important tools that will be used throughout the book, namely probability theory, decision theory, and information theory.

1.1 Example: Polynomial Curve Fitting

individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being source of variability that are themselves unobserved.

Probability theory, discussed in Section 1.2, probides a framework for expressing such uncertainty in a precise and quantitative manner, and decision theory, discussed in Section 1.5, allows us to expoit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria.

Model comparison or model selection

the reason of over-fitting is becoming increasingly tuned to the random noise on the target values.

the over-fitting problem become less severe as the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data.

One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However,as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity.

Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the complexity of the problem being solved.

We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5), and that the over-fitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set.

One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regularizer is called ridge regression. In the context of neural networks, this approach is known as weight decay.

Note that often the coefficient w_0 is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable, or it may be included but with its own regularization coefficient.

Here we simply note that, if we were trying to solve a practical application using this approach of minimizing an error function, we would have to find a way to determine a suitable value for the model complexity. The results above suggest a simple way of achieving this, namely by taking the avaliable data and partitioning it into a training set, used to determine the coefificients w, and a separate validation set, also called a hold-out set, used to optimize the model complexity (either M or lamda). In many cases, however, this will prove to be too wasteful of valuable training data, and we have to seek more sophisticated approaches.

1.2 Probability Theory

A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets.

Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition.

When combined with decision theory, it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous.

Random variable. This random variable can take some possible values

two elementary rules of probability, known as sum rule and the product rule.

conditional probability

Bayes’ theorem

joint distribution

marginal distribution and conditional distribution

We can provide an important interpretation of Bayes’ theorem as follows. If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability p(B). We call this the prior probability because it is the probability available before we observe the identity of the fruit. Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability p(B|F), which we shall call the posterior probability because it is the probability obtained after we have observed F.

1.2.1 Probability densities

measure theory

1.2.2 Expectations and covariances

One of the most important operations involving probabilities is that of finding weighted averages of functions. The average value of some function f(x) under a probability distribution p(x) is called the expectation of f(x) and will be denoted by E[f].

continuous variables

probability density

conditional expectation

variance

covariance

1.2.3 Bayesian probabilities

So far in this chapter, we have viewed probabilities in terms of the frequencies of random, repeatable events. We shall refer to this as the classical or frequentist interpretation of probability. Now we turn to the more general Bayesian view, in which probabilities provide a quantification of unceratinty.

Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data.

We would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very general, Bayesian interpretation of probability.

we would like to address and quantify the uncertainty that surrounds the appropriate choice for the model parameters w. We shall see that, from a Bayesian perspective, we can use the machinery of probability theory to describe the uncertainty in model parameters such as w, or indeed in the choice of model itself.

we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve fitting example. We capture our assumptions about w, before observing the data, in the form of a prior probability distribution p(w). The effect of the observed data D = {t_1, …, t_N} is expressed through the conditional probability p(D|w), and we shall see later, in Section 1.2.5, how this can be represented explicitly. Bayes’ theorem, which takes the form p(w|D) = p(D|w)p(w)/p(D). Then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p(w|D).

The quantity p(D|w) on the right-hand side of Bayes’ theorem is evaluated for the observed data set D and can be viewed as a function of the parameter vector w, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w. Note that the likelihood is not a probability distribution over w, and its integral with respect to w does not (necessarily) equal one.

Given this definition of likelihood, we can state Bayes’ theorem in words: posterior = likelihood * prior. where all of these quantities are viewed as functions of w.

In both the Bayesian and frequentist paradigms, the likelihood function p(D|w) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, w is considered to be a fixed parameter, whose value is determined by some form of ‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D. By contrast, from the Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.

A widely used frequentist estimator is maximum likelihood, in which w is set to the value that maximizes the likelihood function p(D|w). This corresponds to choosing the value of w for which the probability of the observed data set is maximized. In the machine learning literature, the negative log of the likelihood function is called an error function. Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error.

One approach to determining frequentist error bars is the bootstrap (Efron, 1979; Hastie et al., 2001), in which multiple data sets are created as follows. Suppose our original data set consists of N data points X = {x_1, …, x_N}. We can create a new data set X_B by drawing N points at random from X, with replacement, so that some points in X may be replicated in X_B, whereas other points in X may be absent from X_B. This process can be repeated L times to generate L data sets each of size N and each obtained by sampling from the original data set X. The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets.

One advantage of Bayesian viewpoint is that the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any resonable prior will lead to a much less extreme conclusion.

There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint. For instance, one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the prior is one motivation for so-called noinformative priors. However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such problems, and techniques such as cross-validation remain useful in areas such as model comparison.

This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required.

Although the Bayesian framework has its origins in the 18th century,the practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall see, is required in order to make predictions or to compare different models. The development of sampling methods, such as Markov chain Monte Carlo (discussed in Chapter 11) along with dramatic improve in the speed and memory capacity of computers, opened the door to the practical use of Bayesian techniques in an impressive range of problem domains. Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.

More recently, highly efficient deterministic approximation schemes such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications.

1.2.4 The Gaussian distribution (Normal distribution)

Maximizing the likelihood involves adjusting the mean and variance of the Gaussian so as to maximize this product.

One common criterion for determining the parameters in a probability distribution using an observed data set is to find the parameter values that maximize the likelihood function. This might seem like a strange criterion because, from our fore-going discussion of probability theory, it would seem more natural to maximize the probability of the parameters given the data, not the probability of the data given the parameters. In fact, these two criteria are related, as we shall discuss in the context of curve fitting.

Note that the bias of the maximum likelihood solution becomes less significant as the number N of data points increases.

The issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting.

1.2.5 Curve fitting re-visited

The goal in the curve fitting problem is to be able to make predictions for the target variable t given some new value of the input variable x on the basis of a set of training data comprising N input values X = (x_1, …, x_N)T and their corresponding target values T = (t_1, …, t_N)T. We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of x, the corresponding value of t has a Gaussian distribution with a mean equal to the value y(x, w) of the polynomial curve. Thus we have

p(t|x, w, beta) = N (t| y(x, w), beta^-1)

where, for consistency with the notation in later chapters, we have defined a precision parameter beta corresponding to the inverse variance of the distribution.

We now use the training data {x, t} to determine the values of the unknown parameters w and beta by maximum likelihood.

It is convenient to maximize the logarithm of the likelihood function.

1.2.6 Bayesian cureve fitting

1.3 Model Selection

In our example of polynomial curve fitting using least squares, we saw that there was an optimal order of polynomial that gave the best generalization. The order of the polynomial controls the number of free parameters in the model and thereby governs the model complexity.

With regularized least squares, the regularization coefficient lamda also controls the effective complexity of the model, whereas for more complex models, such as mixture distributions or neural networks there may be multiple parameters governing complexity.

If data is plentiful, then one approach is simply to use some of the available data to train a range of models, or a given model with a range of values for its complexity parameters, and then to compare them on independent data, sometimes called a validation set, and select the one having the best predictive performance. If the model design is iterated many times using a limited size data set, then some over-fitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated.

If the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution to this dilemma is use cross-validation. This allows a proportion (S-1)/S of the available data to be used for training while making use of all of the data to assess performance. When data is particularly scarce, it may be appropriate to consider the case S = N, where N is the total number of data points, whcih gives the leave-one-out technique.

One major drawback of cross-validation is that the number of training runs that must be performed is increased by a factor of S, and this can prove problematic for models in which the training is itself computationally expensive. A further problem with techniques such as cross-validation that sue separate data to assess performance is that we might have multiple complexity parameter for a single model (for instance, there might be several regularization parameters).

Historically various “information criteria” have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models. For example, the Akaike information criterion, or AIC (Akaike, 1974). Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favour overly simple models.

1.4 The Curse of Dimensionality

For practical applications of pattern recognition, however, we will have to deal with spaces of high dimensionality comprising many input variables.

The problem with an exponentially large number of cells is that we would need an exponentially large quantity of training data in order to ensure that the cells are not empty.

If we have D input variables, for a polynomial of order M, the growth in the number of coefficients is like D^M.

In spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface!

The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality (Bellman, 1961).

The curse of dimensionality does not prvent us from finding effective techniques applicable to high-dimensional spaces. The reasons for this are twofold. First, real data will often be confined to a region of the space having lower effective dimensionality, and in particular the directions over which important variations in the target variables occur may be so confined. Second, real data will typically exhibit some smoothness properties (at least locally) so that for the most part small changes in the input variables will produce small changes in the target variables, and so we can exploit local interpolation-like techniques to allow us to make predictions of the target variables for new values of the input variables.

1.5 Decision Theory

The joint probability distribution p(x, t) provides a complete summary of the uncertainty associated with these variables. Determination of p(x,t) from a set of training data is an example of inference and is typically a very difficult problem whose solution forms the subject of much of this book.

In a practical application, however, we mush often make a specific prediction for the value of t, or more generally take a specific action based on our understanding of the value t is likely to take, and this aspect is the subject of decision theory.

It is the subject of decision theory to tell us how to make optimal decisions given the appropriate probabilities.

Note that any of the quantities appearing in Bayes’ theorem can be obtained from the joint distribution p(x, t) by either marginalizing or conditioning with respect to the appropriate variables.

1.5.1 Minimizing the misclassification rate

Suppose that our goal is simply to make as few misclassifications as possible. We need a rule that assigns each value of x to one of the available classes. Such a rule will divide the input space into region R_k called decision regions, one for each class, such that all points in R_k are assigned to class C_k. The boundaries between decision regions are called decision boundaries or decision surfaces.

1.5.2 Minimizing the expected loss

loss function, also called a cost function, which is a single, overall measure of loss incurred in taking any of the available descision or actions. Note that some authors consider instead a utility function, whose value they aim to maximize. These are equivalent concepts if we take the utility to be simply the negative of the loss, and throughout this text we shall use the loss function convention.

1.5.3 The reject option

In some applications, it will be appropriate to avoid making decisions on the difficult cases in anticipation of a lower error rate on those examples for which a classification decision is made. This is known as the reject option.

We can achieve this by introducing a threshold theta and reject those inputs x for which the largest of posterior probabilities p(C_k|x) is less than or equal to theta

1.5.4 Inference and decision

We have broken the classification problem down into two separate stages, the inference stage in which we use training data to learn a model for p(C_k|x), and the subsequent decision stage in which we use these posterior probabilities to make optimal class assignment.

An alternative possibility would be to solve both problems together and simply learn a function that maps inputs x directly into decisions. Such a function is called a discriminant function.

1.5.5 Loss functions for regression

1.6 Information Theory

We begin by considering a discrete random variable x and we ask how much information is received when we observe a specific value for this variable. The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x.

If we are told that a highly improbable event has just occurred, we will have received more inofrmation than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information. Our measure of information content will therefore depend on the probability distribution p(x), and we therefore look for a quantity h(x) that is a monotonic function of the probability p(x) and that expresses the information content.

The entropy of the random variable x.

The nonuniform distribution has a smaller entropy than the uniform one.

The noiseless coding theorem (Shannon, 1948) states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.