NowhereLog

Word2Vec

June 10, 2020

This post introduces some methods to create vector embeddings for words that are/were widely used: Skip-Gram, CBOW, and GloVe.

When solving any real-world problems with computer algorithms, one crucial to consider is how to represent real-world data with bits and bytes. In the field of language processing, our major concern is words. How to efficiently represent words?

This post is based on the lectures of Stanford cs224n (2019 Winter).

One-hot vector

Perhaps the most intuitive way is to simply label each work with a unique number. Say our vocabulary is:

['I', 'want', 'to', 'represent', 'words']

We can define a look-up table pairing up each work with a unique index:

Word Representation
I 1
want 2
to 3
represent 4
words 5

We assign each word wiw_i an index cic_i. But this embedding scheme has several disadvantages. (1) Since each cic_i has a different magnitude, which signifies words have importances. (2) Also, cic_i of different words have no correlations between their meanings. Ideally, we want for embeddings for correlated words to also have correlated representations.

The first problem can be solved by representing words with one-hot vectors. We represent each word wiw_i with a vector oiRno_i \in \mathbb{R}^{n}, where nn is the vocabulary size. Every entry of oio_i is zero except that the i-th entry is one. So now for our mini vocabulary we have:

Word Representation
I [1,0,0,0,0]T[1,0,0,0,0]^T
want [0,1,0,0,0]T[0,1,0,0,0]^T
to [0,0,1,0,0]T[0,0,1,0,0]^T
represent [0,0,0,1,0]T[0,0,0,1,0]^T
words [0,0,0,0,1]T[0,0,0,0,1]^T

This solves the problem (1) and offers us vectors to work with, which is pretty convenient. But the problem (2) remains. Moreover, we now have a problem (3): the curse of dimensionality. With a small nn this is still viable, but imagine trying to incorporate all possible words we have using this representation…

Word2Vec

This motivates a vector representation of words with smaller dimensionality vectors and some kinds of correlations with their meanings.

Suppose each word wiw_i is represented by some vector viRk,vi2=1v_i \in \mathbb{R}^k, ||v_i||_2 = 1, where kk is the embedding size, a ‘small’ number. We can make use of dot product to measure how close two vectors are. So ideally, if wiw_i and wjw_j are correlated, vivjv_i \cdot v_j will be big (close to 1). Conversely, if wiw_i and wjw_j are not correlated, vivjv_i \cdot v_j will be small (close to -1). We will take this as a basis of intuition when we measure correlation in the following derivations.

Before that, you may notice that ‘correlated’ is ill-defined here. It can mean that two words are close in meanings, or it can mean that two words often appear together. The two interpretations sometimes align, but we need to treat them differently. The two algorithms I will introduce, Skip-Gram and CBOW, use the second definition.

The idea behind both algorithms is simple. We want our vector embeddings of correlated words wiw_i and wjw_j to produce high conditional probability P(wiwj)\mathbb{P}(w_i|w_j).

  • Question 1: how do we know wiw_i and wjw_j are actually correlated?

    Answer: we find many real sentences and see what words are presented together. In reality, a sentence can be arbitrarily long, and we take a sliding window approach. For a sliding window of size dd, we only look dd words coming before or after a word.

  • Question 2: how do we measure P(wiwj)\mathbb{P}(w_i|w_j)?

    P(wiwj)=exp(viTvj)k=1nexp(vkTvj)\mathbb{P}(w_i|w_j) = \frac{exp(v_i^Tv_j)}{\sum_{k=1}^n exp(v_k^Tv_j)}

    Given the intuition about the dot product above, this should somewhat make sense. Dot product measures the correlation; by taking the exponentials, we pick out the bigger values and normalize all the values to sum to one to make a probability distribution. This comes from a cross-entropy loss and softmax function (more on this later).

CBOW (Continuous Bag of Words)

Given a center word wcw_c, we refer to the words around wcw_c as the context words for wcw_c. CBOW seeks to maximize the probability of having a center word wcw_c its context words within a particular window size mm, so there are 2m2m context words (mm words on the left and mm on the right) for a center word. (The center words at the beginning or end have fewer context words.) To make the algorithm easier in practice, we often assign two vectors to each word wiw_i, one vector viv_i for the word when it is a context, and the other uiu_i when it is a center word. In the end, we average out the two vectors. You will see how this will simply implementations.

In short, for a single center word, we want to maximize:

P(wcwcm,,wc1,wc+1,,wc+m)\mathbb{P}(w_c|w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m})

We haven’t talked about how to model the probability of a word given many words. CBOW takes a rather straightforward approach. We average out the word vectors of the words and conditioned on that averaged value instead.

Equivalenty, we want to minimize the negative log probability:

minimize J=log P(wcwcm,,wc1,wc+1,,wc+m)=log exp(ucTv^)k=1nexp(ukTv^)=ucTv^+log k=1nexp(ukTv^)\begin{aligned} \text{minimize } J &= - \text{log } \mathbb{P}(w_c|w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}) \\ &= - \text{log } \frac{exp(u_c^T\hat{v})}{\sum_{k=1}^n exp(u_k^T\hat{v})} \\ &= - u_c^T\hat{v} + \text{log } \sum_{k=1}^n exp(u_k^T\hat{v}) \end{aligned}

where

v^=12m(vcm++vc1+vc+1++vc+m)\hat{v} = \frac{1}{2m}(v_{c-m}+ \dots+ v_{c-1}+ v_{c+1}+ \dots+ v_{c+m})

If we concatenate all vectors viv_i together, we get a matrix VRn×kV \in \mathbb{R}^{n \times k}, where nn is the vocabulary size and kk is the embedding size or the dimension of each word vector. The ith column of VV is viv_i for word wiw_i. Similarly, we have a URn×kU \in \mathbb{R}^{n \times k} for all uiu_i. Now we again find one-hot vectors extremely useful. For a word wiw_i with one-hot vector oio_i, we get vi=Voiv_i = V \cdot o_i and ui=Uoiu_i = U \cdot o_i. This is non-trivial as it enables a vector-based implementation (without nasty for-loops). More importantly, this gives another interpretation of V,UV, U other than the concatenation of vectors: they can be understood as two weight matrices to be learned.

Now we have some vector representations, some probability models, and some loss functions, and some weights to be learned. It should be clear now this is a standard machine learning problem.

Algorithm CBOW

Previously we get the loss from the negative log probability, but we didn’t explain how we get that probability (though we use the intuition of dot product to infer it). The steps 3 and 4 explain how we get the loss from a softmax function and cross-entropy loss. You may need to write out the equations to see how they equate. Another thing to mention is that in practice, we take a gradient step for several center words instead of a single word as it may imply.

Skip-Gram

Skip-Gram is very similar to CBOW. Instead of considering the probability of a center word given some context words, Skip-Gram seeks to maximize the probability of context words given a center word.

maximize J=log P(wcm,,wc1,wc+1,,wc+mwc)=j2mlog exp(vjTuc)k=1nexp(vkTuc)\begin{aligned} \text{maximize } J &= - \text{log } \mathbb{P}(w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}|w_c) \\ &= - \sum_j^{2m} \text{log } \frac{exp(v_j^Tu_c)}{\sum_{k=1}^n exp(v_k^Tu_c)} \\ \end{aligned}

To do gradient descent, we need to calculate partial derivatives of JJ w.r.t. U,VU, V. For a single context word wow_o:

J=log P(wowc)=voTuc+log k=1nexp(vkTuc)\begin{aligned} J = -\text{log } \mathbb P(w_o|w_c) = - v_o^Tu_c + \text{log } \sum_{k=1}^n exp(v_k^Tu_c) \end{aligned} ucJ=vo+1k=1nexp(vkTuc)k=1nvkexp(vkTuc)=vo+k=1nvkP(wkwc)\begin{aligned} \nabla_{u_c} J &= - v_o + \frac{1}{\sum_{k=1}^n exp(v_k^Tu_c)}\sum_{k=1}^n v_k exp(v_k^Tu_c) \\ &= - v_o + \sum_{k=1}^n v_k \mathbb P(w_k | w_c) \end{aligned}

This is actually intuitive, since with each update,

uc:=ucαucJ=uc+α(vok=1nvkP(wkwc))u_c := u_c - \alpha \nabla_{u_c} J = u_c + \alpha (v_o -\sum_{k=1}^n v_k \mathbb P(w_k | w_c) )

we are essentially pushing ucu_c towards vov_o with discount of vkv_k that already has large P(wkwc)\mathbb P(w_k | w_c) with some learning rate α\alpha.

vkJ=ucI[k=o]ucP(wkwc)\begin{aligned} \nabla_{v_k} J &= u_c \mathbb{I}[k=o] - u_c \mathbb P(w_k | w_c) \end{aligned}

where I\mathbb{I} is the indicator function.

Negative Sampling: The Problem with Softmax and Solutions

To get a probability distribution over every word (given a center word or some context words), we need to calculating softmax. But this can be problematic, as again we need to output an entry for every word in the vocabulary. To avoid the curse of dimensionality, there are two widely-used approaches: Negative Sampling and Hierarchical Softmax ([2]). This post will breifly introduce negative sampling, and you may look up hierarchical softmax for further exploration.

Rather than directly optimizing a conditional probability P(wc)\mathbb P(w|c) where (w,c)(w, c) is a pair of word that appears together, negative sampling seeks to train a classifier DD that predicts whether (w,c)(w, c) is a valid pair. For example, (“play”, “game”) appears to be valid and we want the classifier DD to predict true, while (“eat”, “game”) appears nonsensical and we want the classifier DD to predict false.

During the training, the validity of a pair is defined as whether the pair appears in the corpus data DD. We will denote by P(D=1w,c,θ)\mathbb P(D=1 | w, c, \theta) the probability of (w,c)(w, c) coming from DD, and P(D=0w,c,θ)=1P(D=0w,c,θ)\mathbb P(D=0 | w, c, \theta) = 1- \mathbb P(D=0 | w, c, \theta) vice versa, where θ\theta is the parameters of the model which in this case are the word vectors UU and VV.

We model this probability with the sigmoid function of the dot-product between word vectors:

P(D=1w,c,θ)=σ(uwTvc)=11+euwTvc\mathbb P(D=1 | w, c, \theta) = \sigma(u_w^Tv_c) = \frac{1}{1+e^{-u_w^Tv_c}}

We train our model by taking a maximum likelihood approach:

θ=(U,V)=argmaxθ(w,c)DP(D=1w,c,θ)(w,c)D~P(D=0w,c,θ)=argmaxθ(w,c)Dlog σ(uwTvc)+(w,c)D~log σ(uwTvc)\begin{aligned} \theta = (U,V) &= \text{argmax}_{\theta} \prod_{(w,c) \in D} \mathbb P(D=1 | w, c, \theta) \prod_{(w,c) \in \tilde D} \mathbb P(D=0 | w, c, \theta) \\ &= \text{argmax}_{\theta} \sum_{(w,c) \in D} \text{log } \sigma(u_w^Tv_c) + \sum_{(w,c) \in \tilde D} \text{log } \sigma(-u_w^Tv_c) \end{aligned}

where D~\tilde D is assumed to be some “false” corpus which contains nonsensical word pairs. But we don’t really have such a corpus, and we will model such a “false“ corpus by randomly sampling kk times from a probability distribution P(w)P(w) given a word cc.

Thus, given a true word pair (w,c)(w, c), a center word and a context word, seen in the corpus data, we can define a loss function by nagative log likelihood:

J=log σ(uwTvc)i=1klog σ(uiTvc)J = - \text{log } \sigma(u_w^Tv_c) - \sum_{i=1}^k \text{log } \sigma(-u_i^Tv_c)

where uiP(w)u_i \sim P(w).

The researchers have found that the unigram distribution U(w)U(w) (the distribution based on each word’s frequency) raised to 3/4rd power has the best performance:

P(w)=U(W)3/4ZP(w) = \frac{U(W)^{3/4}}{Z}

where ZZ is a normalization factor.

GloVe (Global Vectors for Word Representation)

Both CBOW and Skip-Gram have problems of locality: We are attending to co-occurred words within a limited window size. What if we want to make use of global cooccurrence? Suppose we have a global cooccurrence matrix XRn×nX \in \mathbb{R}^{n\times n}, where XijX_{ij} represents the cooccurrence number of words wiw_i and wjw_j. How can we make use of them?

Another idea is that we don’t just want some conditional probability P(wxwa)\mathbb{P}(w_x|w_a). Instead, we want some comparative measure of

P(wxwa)P(wxwb)\frac{\mathbb{P}(w_x|w_a)}{\mathbb{P}(w_x|w_b)}

GloVe is motivated by the ideas of using global cooccurrence and the need to represent a relative cooccurrence. For words wi,wjw_i, w_j and their corresponding vectors vi,vjv_i, v_j, ui,uju_i, u_j (every word again has two vectors), we model the conditional probability as:

log P(wiwj)=viuj\text{log } \mathbb{P}(w_i|w_j) = v_i \cdot u_j

Now we have a nice property of:

vx(uaub)=log P(wxwa)P(wxwb)v_x \cdot (u_a-u_b) = \text{log } \frac{\mathbb{P}(w_x|w_a)}{\mathbb{P}(w_x|w_b)}

We define the loss function JJ:

J=i,j=1nf(Xij)(viuj+bi+bjlog Xij)2J = \sum_{i, j=1}^{n} f(X_{ij}) (v_i \cdot u_j + b_i + b_j - \text{log } X_{ij} )^2

where ff is some clipping function that clips large $XijX_{ij} ,and, andbi,bjb_i, b_j$ are biases.

References

[1] Mikolov et, al. Efficient Estimation of Word Representations in Vector Space. (CBOW and Skip-Gram)

[2] Mikolov et, al. Distributed Representations of Words and Phrases and their Compositionality. (negative sampling and hierarchical softmax)

[3] Penning et, al. GloVe: Global Vectors for Word Representation. (GloVe)


By NowhereMan who goes nowhere.


© 2024, NowhereLog by NowhereMan