Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
by Michael Gutmann, Aapo Hyv ̈arinen
tl;dr The paper introduces a new approach to estimate parameters of a statistical model. It learns by comparing between the given data samples, and samples drawn from a known (noise) distribution.
The paper can be found here.
My sample github implementation is here.
Introduction
In unsupervised learning, one can take a probabilistic approach. Hence, given a set of data samples, we assume that they are drawn from an unknown distribution, and it is of interest to find out this distribution, given the samples we have.
Specifically, we are given a set of samples $X := \{ x_i ; i= 1,2…N \} $, drawn from an unknown distribution $P_d(x)$. We do not know the functional form of this distribution. However, we assume a functional form of this distribution denoted by $P_m(x; \alpha)$, where $\alpha$ is the parameter of this distribution. For example, for a Gaussian distribution, $\alpha := \{\mu, \sigma \}$ where the parameters are mean $\mu$ and variance $\sigma$.
Hence, as $\alpha$ changes, we get different distributions $P_m(x; \alpha)$. We further assume that there exists a value $\alpha^* $, for which $P_d(x) = P_m(x; \alpha^*)$ (here equality is in the sense of equality of functions, maybe?).
With this setup, we have changed the problem of finding out $P_d(x)$ to that of finding out the parameter $\alpha$ for $P_m(x; \alpha)$.
This is the problem that the paper is looking to address: statistical estimation of parameters of a model.
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a common, and possibly the simplest paradigm for estimating the parameters of a statistical model. In the words of my professor, it is applied when no other assumptions on the data and priors on the parameters can be made. In simple terms, we just maximize the (log) probability that atleast the observed data is highly likely under the statistical model. The parameter $\alpha = \alpha_{MLE}$ that achieves this requirement is the Maximum Likelihood Estimate.
In technical terms the objective to be maximized is: \begin{equation} J(\alpha) := log(P_m(x_1, x_2, … x_N; \alpha)) = log \prod_{i} P_m(x_i; \alpha) \end{equation}
Here, we make the commonly used i.i.d assumption on given data.
This estimation is not easy. Specially, note that the solution must also satisfy $\int P_m(u;\alpha) du = 1$. A closed form solution is only available for simple distributions. Even for a mixture of gaussions (Gaussian Mixture Models), we need the iterative Expectation-Maximization algorithm, that can converge to a local maxima. For complex distributions, the problem becomes intractable.
In this (and related) work, the density constraint is dealt with using the following re-formulation: \begin{equation} P_m(x; \alpha) := \frac{P^0_m(x;\alpha)}{Z(\alpha)} \end{equation}
where $Z(\alpha) := \int P_m(u;\alpha) du$ is the partition function. Different works maximize $P^0_m(x;\alpha)$ without any constraints, and find ways to adjust for $Z(\alpha)$.
Noise Contrastive Estimation
The paper first takes the re-formulation above and includes the partition function as one of the parameters to be estimated. Partition function being estimated along with the parameters is a contribution in this paper. So we have: \begin{equation} log (P_m(x; \theta)) := log (P^0_m(x;\alpha)) + c \end{equation}
where $\theta := \{ \alpha, c \}$ and $c = -log(Z(\alpha))$. Note that “integrate to 1” constraint for $P_m(x;\theta)$ will only be satisfied for a specific value of $c$.
Now, we already have $N$ samples of interest from the unknown distribution. We further sample the set $Y := \{y_i; i= 1,2,…N \}$ from a known noise distribution $P_n(y)$. Then, we maximize the following objective function: \begin{equation} J(\theta) := \frac{\sum_i log[h(x_i; \theta)] + log[1 - h(y_i; \theta)]}{2N} \end{equation}
where $h(u;\theta) := \sigma(log \frac{P_m(u;\theta)}{P_n(u)})$ and $\sigma(x)$ is the sigmoid function.
Hence, at each iteration we take two samples from our overall dataset $(x_i, y_i) \in X \times Y$. Since both $\sigma()$ and $log()$ are increasing functions of their arguments, we ask (jointly) that $\theta$ should be such that:
- At location $x_i$, the probability $P_m(x_i; \theta)$ must be higher than $P_n(x_i)$.
- At location $y_i$, it should be the opposite.
Hence, at $2N$ points in our data space, we require the above condition to be met to the maximum extent possible.
NCE and Supervised Learning
Note that: \begin{equation} h(u;\theta) = \frac{P_m(u;\theta)}{P_m(u;\theta) + P_n(u)} \end{equation}
If we consider $X$ and $Y$ to come from two classes with class conditional densities $P(u | C=1, \theta) := P_m(u;\theta)$ and $P(u | C=0, \theta) := P_n(u)$, then the posterior probability of each class is given by (assuming priors to be equal): \begin{equation} P(C = 1 | u, \theta) = h(u;\theta) \end{equation} \begin{equation} P(C = 0 | u, \theta) = 1 - h(u;\theta) \end{equation}
With this, one sees that the overall NCE objective is essentially a binary classification problem (logistic regression).
Neural Networks?
So where is the neural network? It is $h(u;\theta)$, as is clear from looking at the objective function. If we threshold the output $h(u;\theta^* )$ assuming equal costs and performance trade-offs, we do binary classification. If we take off the classification head from the learnt network, we have a representation space which separates samples from $X$ and $Y$ well.
However, there is a catch. The parameter $\theta$ that we intend to estimate, is not explicit when we compute $h(u; \theta)$ using a neural net. The net directly models the ratio in $(5)$ and parameters are implicit in the neurel net parameters. This is exactly like in case of supervised learning, where the neural net is called discriminative and not generative.
Salient Points
First, the paper comments on the properties of the estimator $\hat\theta_N$ (Is it an estimator, or an estimate?) computed from $N$ samples each of sample data and noise. It says that in the large sample limit, we have $\hat \theta_N$ reaches $\theta^*$, where $\theta^ *$ is such that $P_d(x) = P_m(x; \theta^ *)$. For this we require $P_n(x)$ to be non-zero wherever $P_d(x)$ is non-zero.
This is intuitive. In regions of space where there are enough adjacent noise samples around a data sample, the optimisation requires more work to satisfy the objective. The paper also makes this point while talking about the choice of noise distribution. Infact, if the noise distribution is “far removed” from the (possible) data distribution, the objective is already satisfied without learning anything (in binary classification parlance, classes are separable).
However, in high dimensions, the challenge is that data may actually lie in a very low dimensional subspace (see Manifold Hypothesis). In such a case, we may not have a noise distribution which generates sufficient points near observed data samples. Anyhow, simple distributions like Gaussians which are non-zero at all points and easy to sample from could possibly be used.
Conclusion
The first time I read this paper, it was interesting because it (and also some other works) present the idea of “learning by comparison”. In fact, the novelty in this paper is simple and makes intuitive sense. Moreover, if we take the distribution of some other set of data to be the noise distribution, we get the ideas of contrastive learning (I think, I will try and make that connection later on).