Word2Vec

June 10, 2020

This post introduces some methods to create vector embeddings for words that are/were widely used: Skip-Gram, CBOW, and GloVe.

One-hot vector
Word2Vec
GloVe (Global Vectors for Word Representation)
References

When solving any real-world problems with computer algorithms, one crucial to consider is how to represent real-world data with bits and bytes. In the field of language processing, our major concern is words. How to efficiently represent words?

This post is based on the lectures of Stanford cs224n (2019 Winter).

One-hot vector

Perhaps the most intuitive way is to simply label each work with a unique number. Say our vocabulary is:

['I', 'want', 'to', 'represent', 'words']

We can define a look-up table pairing up each work with a unique index:

Word	Representation
I	1
want	2
to	3
represent	4
words	5

We assign each word $w_i$ an index $c_i$ . But this embedding scheme has several disadvantages. (1) Since each $c_i$ has a different magnitude, which signifies words have importances. (2) Also, $c_i$ of different words have no correlations between their meanings. Ideally, we want for embeddings for correlated words to also have correlated representations.

The first problem can be solved by representing words with one-hot vectors. We represent each word $w_i$ with a vector $o_i \in \mathbb{R}^{n}$ , where $n$ is the vocabulary size. Every entry of $o_i$ is zero except that the i-th entry is one. So now for our mini vocabulary we have:

Word	Representation
I	$[1,0,0,0,0]^T$
want	$[0,1,0,0,0]^T$
to	$[0,0,1,0,0]^T$
represent	$[0,0,0,1,0]^T$
words	$[0,0,0,0,1]^T$

This solves the problem (1) and offers us vectors to work with, which is pretty convenient. But the problem (2) remains. Moreover, we now have a problem (3): the curse of dimensionality. With a small $n$ this is still viable, but imagine trying to incorporate all possible words we have using this representation…

Word2Vec

This motivates a vector representation of words with smaller dimensionality vectors and some kinds of correlations with their meanings.

Suppose each word $w_i$ is represented by some vector $v_i \in \mathbb{R}^k, ||v_i||_2 = 1$ , where $k$ is the embedding size, a ‘small’ number. We can make use of dot product to measure how close two vectors are. So ideally, if $w_i$ and $w_j$ are correlated, $v_i \cdot v_j$ will be big (close to 1). Conversely, if $w_i$ and $w_j$ are not correlated, $v_i \cdot v_j$ will be small (close to -1). We will take this as a basis of intuition when we measure correlation in the following derivations.

Before that, you may notice that ‘correlated’ is ill-defined here. It can mean that two words are close in meanings, or it can mean that two words often appear together. The two interpretations sometimes align, but we need to treat them differently. The two algorithms I will introduce, Skip-Gram and CBOW, use the second definition.

The idea behind both algorithms is simple. We want our vector embeddings of correlated words $w_i$ and $w_j$ to produce high conditional probability $\mathbb{P}(w_i|w_j)$ .

Question 1: how do we know $w_i$ and $w_j$ are actually correlated?

Answer: we find many real sentences and see what words are presented together. In reality, a sentence can be arbitrarily long, and we take a sliding window approach. For a sliding window of size $d$ , we only look $d$ words coming before or after a word.
Question 2: how do we measure $\mathbb{P}(w_i|w_j)$ ?
$\mathbb{P}(w_i|w_j) = \frac{exp(v_i^Tv_j)}{\sum_{k=1}^n exp(v_k^Tv_j)}$
Given the intuition about the dot product above, this should somewhat make sense. Dot product measures the correlation; by taking the exponentials, we pick out the bigger values and normalize all the values to sum to one to make a probability distribution. This comes from a cross-entropy loss and softmax function (more on this later).

CBOW (Continuous Bag of Words)

Given a center word $w_c$ , we refer to the words around $w_c$ as the context words for $w_c$ . CBOW seeks to maximize the probability of having a center word $w_c$ its context words within a particular window size $m$ , so there are $2m$ context words ( $m$ words on the left and $m$ on the right) for a center word. (The center words at the beginning or end have fewer context words.) To make the algorithm easier in practice, we often assign two vectors to each word $w_i$ , one vector $v_i$ for the word when it is a context, and the other $u_i$ when it is a center word. In the end, we average out the two vectors. You will see how this will simply implementations.

In short, for a single center word, we want to maximize:

\mathbb{P}(w_c|w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m})

We haven’t talked about how to model the probability of a word given many words. CBOW takes a rather straightforward approach. We average out the word vectors of the words and conditioned on that averaged value instead.

Equivalenty, we want to minimize the negative log probability:

\begin{aligned} \text{minimize } J &= - \text{log } \mathbb{P}(w_c|w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}) \\ &= - \text{log } \frac{exp(u_c^T\hat{v})}{\sum_{k=1}^n exp(u_k^T\hat{v})} \\ &= - u_c^T\hat{v} + \text{log } \sum_{k=1}^n exp(u_k^T\hat{v}) \end{aligned}

where

\hat{v} = \frac{1}{2m}(v_{c-m}+ \dots+ v_{c-1}+ v_{c+1}+ \dots+ v_{c+m})

If we concatenate all vectors $v_i$ together, we get a matrix $V \in \mathbb{R}^{n \times k}$ , where $n$ is the vocabulary size and $k$ is the embedding size or the dimension of each word vector. The ith column of $V$ is $v_i$ for word $w_i$ . Similarly, we have a $U \in \mathbb{R}^{n \times k}$ for all $u_i$ . Now we again find one-hot vectors extremely useful. For a word $w_i$ with one-hot vector $o_i$ , we get $v_i = V \cdot o_i$ and $u_i = U \cdot o_i$ . This is non-trivial as it enables a vector-based implementation (without nasty for-loops). More importantly, this gives another interpretation of $V, U$ other than the concatenation of vectors: they can be understood as two weight matrices to be learned.

Now we have some vector representations, some probability models, and some loss functions, and some weights to be learned. It should be clear now this is a standard machine learning problem.

Algorithm

Previously we get the loss from the negative log probability, but we didn’t explain how we get that probability (though we use the intuition of dot product to infer it). The steps 3 and 4 explain how we get the loss from a softmax function and cross-entropy loss. You may need to write out the equations to see how they equate. Another thing to mention is that in practice, we take a gradient step for several center words instead of a single word as it may imply.

Skip-Gram

Skip-Gram is very similar to CBOW. Instead of considering the probability of a center word given some context words, Skip-Gram seeks to maximize the probability of context words given a center word.

\begin{aligned} \text{maximize } J &= - \text{log } \mathbb{P}(w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}|w_c) \\ &= - \sum_j^{2m} \text{log } \frac{exp(v_j^Tu_c)}{\sum_{k=1}^n exp(v_k^Tu_c)} \\ \end{aligned}

To do gradient descent, we need to calculate partial derivatives of $J$ w.r.t. $U, V$ . For a single context word $w_o$ :

\begin{aligned} J = -\text{log } \mathbb P(w_o|w_c) = - v_o^Tu_c + \text{log } \sum_{k=1}^n exp(v_k^Tu_c) \end{aligned}

\begin{aligned} \nabla_{u_c} J &= - v_o + \frac{1}{\sum_{k=1}^n exp(v_k^Tu_c)}\sum_{k=1}^n v_k exp(v_k^Tu_c) \\ &= - v_o + \sum_{k=1}^n v_k \mathbb P(w_k | w_c) \end{aligned}

This is actually intuitive, since with each update,

u_c := u_c - \alpha \nabla_{u_c} J = u_c + \alpha (v_o -\sum_{k=1}^n v_k \mathbb P(w_k | w_c) )

we are essentially pushing $u_c$ towards $v_o$ with discount of $v_k$ that already has large $\mathbb P(w_k | w_c)$ with some learning rate $\alpha$ .

\begin{aligned} \nabla_{v_k} J &= u_c \mathbb{I}[k=o] - u_c \mathbb P(w_k | w_c) \end{aligned}

where $\mathbb{I}$ is the indicator function.

Negative Sampling: The Problem with Softmax and Solutions

To get a probability distribution over every word (given a center word or some context words), we need to calculating softmax. But this can be problematic, as again we need to output an entry for every word in the vocabulary. To avoid the curse of dimensionality, there are two widely-used approaches: Negative Sampling and Hierarchical Softmax ([2]). This post will breifly introduce negative sampling, and you may look up hierarchical softmax for further exploration.

Rather than directly optimizing a conditional probability $\mathbb P(w|c)$ where $(w, c)$ is a pair of word that appears together, negative sampling seeks to train a classifier $D$ that predicts whether $(w, c)$ is a valid pair. For example, (“play”, “game”) appears to be valid and we want the classifier $D$ to predict true, while (“eat”, “game”) appears nonsensical and we want the classifier $D$ to predict false.

During the training, the validity of a pair is defined as whether the pair appears in the corpus data $D$ . We will denote by $\mathbb P(D=1 | w, c, \theta)$ the probability of $(w, c)$ coming from $D$ , and $\mathbb P(D=0 | w, c, \theta) = 1- \mathbb P(D=0 | w, c, \theta)$ vice versa, where $\theta$ is the parameters of the model which in this case are the word vectors $U$ and $V$ .

We model this probability with the sigmoid function of the dot-product between word vectors:

\mathbb P(D=1 | w, c, \theta) = \sigma(u_w^Tv_c) = \frac{1}{1+e^{-u_w^Tv_c}}

We train our model by taking a maximum likelihood approach:

\begin{aligned} \theta = (U,V) &= \text{argmax}_{\theta} \prod_{(w,c) \in D} \mathbb P(D=1 | w, c, \theta) \prod_{(w,c) \in \tilde D} \mathbb P(D=0 | w, c, \theta) \\ &= \text{argmax}_{\theta} \sum_{(w,c) \in D} \text{log } \sigma(u_w^Tv_c) + \sum_{(w,c) \in \tilde D} \text{log } \sigma(-u_w^Tv_c) \end{aligned}

where $\tilde D$ is assumed to be some “false” corpus which contains nonsensical word pairs. But we don’t really have such a corpus, and we will model such a “false“ corpus by randomly sampling $k$ times from a probability distribution $P(w)$ given a word $c$ .

Thus, given a true word pair $(w, c)$ , a center word and a context word, seen in the corpus data, we can define a loss function by nagative log likelihood:

J = - \text{log } \sigma(u_w^Tv_c) - \sum_{i=1}^k \text{log } \sigma(-u_i^Tv_c)

where $u_i \sim P(w)$ .

The researchers have found that the unigram distribution $U(w)$ (the distribution based on each word’s frequency) raised to 3/4rd power has the best performance:

P(w) = \frac{U(W)^{3/4}}{Z}

where $Z$ is a normalization factor.

GloVe (Global Vectors for Word Representation)

Both CBOW and Skip-Gram have problems of locality: We are attending to co-occurred words within a limited window size. What if we want to make use of global cooccurrence? Suppose we have a global cooccurrence matrix $X \in \mathbb{R}^{n\times n}$ , where $X_{ij}$ represents the cooccurrence number of words $w_i$ and $w_j$ . How can we make use of them?

Another idea is that we don’t just want some conditional probability $\mathbb{P}(w_x|w_a)$ . Instead, we want some comparative measure of

\frac{\mathbb{P}(w_x|w_a)}{\mathbb{P}(w_x|w_b)}

GloVe is motivated by the ideas of using global cooccurrence and the need to represent a relative cooccurrence. For words $w_i, w_j$ and their corresponding vectors $v_i, v_j$ , $u_i, u_j$ (every word again has two vectors), we model the conditional probability as:

\text{log } \mathbb{P}(w_i|w_j) = v_i \cdot u_j

Now we have a nice property of:

v_x \cdot (u_a-u_b) = \text{log } \frac{\mathbb{P}(w_x|w_a)}{\mathbb{P}(w_x|w_b)}

We define the loss function $J$ :

J = \sum_{i, j=1}^{n} f(X_{ij}) (v_i \cdot u_j + b_i + b_j - \text{log } X_{ij} )^2

where $f$ is some clipping function that clips large $X_{ij}$ , and $b_i, b_j$ are biases.

References

[1] Mikolov et, al. Efficient Estimation of Word Representations in Vector Space. (CBOW and Skip-Gram)

[2] Mikolov et, al. Distributed Representations of Words and Phrases and their Compositionality. (negative sampling and hierarchical softmax)

[3] Penning et, al. GloVe: Global Vectors for Word Representation. (GloVe)