2. Probability Distributions

For parameter estimation, in a frequentist treatment, we choose specific values for the parameters by optimizing some criterion, such as the likelihood function. By contrast, in a Bayesian treatment we introduce prior distributions over the parameters and then use Byes’ theorem to compute the corresponding posterior distribution given the observed data.

One limitation of the parametric approach is that it assumes a specific functional form for the distribution, which may turn out to be inappropriate for a particular application. An alternative approach is given by nonparametric density estimation methods in which the form of the distribution typically depends on the size of the data set. Such models still contain parameters, but these control the model complexity rather than the form of the distribution. We end this chapter by considering three nonparametric methods based respectively on histograms, nearest-neighbours, and kernels.

2.1 Binary Variables

It is worth noting that the log likelihood function depends on the N observations x_n only through their sum. This sum provides an example of a sufficient statistic for the data under this distribution.

2.1.1 The beta distribution

As we have already noted, this can give severely over-fitted results for small data sets. In order to develop a Bayesian treatment for this problem, we need to introduce a prior distribution p(μ) over the parameter μ. We therefore choose a prior, called the beta distribution.

The parameters a and b are often called hyperparameters because they control the distribution of the parameter μ.

2.2 Multinomial Variables

Often, we encounter discrete variables that can take on one of K possible mutually exclusive states.

2.2.1 The Dirichlet distribution

We now introduce a family of prior distributions for the parameters {μk} of the multinomial distribution.

2.3 The Gaussian Distribution

The Gaussian, also known as the normal distribution.

Multivariate Gaussian distribution

The central limit theorem (due to Laplace) tells us that, subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases.

2.3.1 Conditional Gaussian distributions

An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similaryly, the marginal distribution of either set is also Guassian.

2.3.2 Marginal Gaussian distributions

2.3.3 Bayes’ theorem for Gaussian variables

2.3.4 Maximum likelihood for the Gaussian