Word2Vec

This chapter will go over some different methods of converting the words that we are using into vectors that we can manipulate and do calculations with.

One Hot

The first method is the simplest and oldest way of representing words as vectors: one hot

An easy way to understand this is you have a dataset of 10,000 different words. You could encode this as a 10,000 dimension vector where each dimension represents a different one of your 10,000 words. We would set all values to 0 except for the dimension that represents our selected word which we would set to 1.

Small example below if we have 4 words {'car','what','be','at'} we could represent these as 4d vectors in the following way:

[1,0,0,0] = 'car'

[0,1,0,0] = 'what'

[0,0,1,0] = 'be'

[0,0,0,1] = 'at'

The one hot method of representing words as a vector is not a great idea as it does give us any notion of the relationships between different words which is something we need often. For example we can't say how similar 'car' is to 'at'.

As a result it is really only useful as a way of storing our raw data.

word2vec

We want a better way of creating our vectors so that if we do a dot product between two different vectors the result will tell us how similar two words are.

To do this we are going to look at the context of words. When we say the context of a word it simply means the words that are found next to it.

We are going to come up with some way to represent each word as a vector so that it will be good at predicting correctly the other words that appear in its context. Remember context just means the words that appear next to it.

We are going to do this by building a model that will learn the best way to convert our words into vectors. This will be done by minimising a loss function. The loss function in this case will be telling us how well we can predict context words for a given input word.

How we are going to do this is use a piece of software called word2vec. Within word2vec are several algorithms that will do what we have described above.

One is called Skip-grams and the other is called continuous bag of words. We will look at the skip-grams algorithm mainly.

Looking at the example above we have a centre word (in this example: 'banking') and try to predict within some window size the words that occur in the centre words window (in this example it would be: 'turning', 'into', 'crises', 'as').

skip-grams is going to choose word vectors such that we maximise the likelihood of predicting the correct context words given some centre word.

The skipgram model overview:

Answer: Our good tools gradient descent + backpropagation!

As a result of using these word2vec embedings we are able to group similar words together and these groupings even transfer to 2D space if we do some dimensionality reduction on them.

Summary:

Go through each word on your corpus (body of text)

Predict the surrounding words of each word

This will capture co-occurrences of words one at a time

Continuous Bag of Words

Continuous Bag of Words (CBOW) works in a very similar way to skip grams however the main difference is that we try to predict the centre word from the vector sum of the surrounding words, kind of like an inverse of skip-gram.

Co-ocurrence matrices (count based method)

The skip-gram model captures co-occurences one word at a time. For example we go through our corpus one word at a time and see that 'deep' and 'learning' occur together so we do an update to these vectors and carry on and then see it occurs together again so we do another update.

Doing this doesn't seem very efficient though. Why not go through the corpus just once and count how many times they occur together rather than going one at a time?

Well we can do this using something called a co-occurence matrix! That will look like the matrix below:

Going across each row we then have a vector for all of our words and we see for example that 'like' and 'enjoy' have some overlap so they are probably similar to each other which is nice. However not only will the vectors keep changing size the more words in our corpus but very large corpus then these vectors will be huge which is not nice as you will run into sparsity issues when training anything using these vectors.

The answer is to reduce dimensionality to a fixed number of only the important ones.

One method to reduce is by using SVD ( singular value decomposition). We then use the result of SVD as our word vectors.

This method is actually older than skip-grams coming out in 2005 but still produces some nice results.

Global Vectors for word representation - GloVe model

So there is good and bad points for using word2vec type prediction methods or count based methods. These are shown below:

The next step is to combine the best of both of these two methods together. The result is the GloVe model.

In the GloVe model we begin with our big co-occurence matrix as a starting point. Now rather than reduce this using SVD like above we instead try to create word vectors similar to skip-grams by minimising a new loss function:

In this loss function P is the co-occurence matrix while u and v are the word vectors.

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.

Last updated