search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Comprehensive Guide on Text Vectorization

schedule Aug 12, 2023
Last updated
local_offer
Machine Learning
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Colab Notebook

Please log in or sign up to access the colab notebook

You can run all the code snippets throughout this guide in my Colab Notebook

lock

What is text vectorization?

Text vectorization involves transforming non-numeric data such as text into numerical vectors. There are two reasons for doing so:

  • machine learning models require numerical input and therefore we need to transform non-numeric data (e.g. text and categories) into vectors first.

  • we can visualize words (refer to the section on Visualizing words in our Comprehensive Guide on PCA).

There are many ways to vectorize text:

  • One-hot encoding

  • Dummy encoding

  • Bag-of-words

  • TF-IDF (Terms Frequency - Inverse Document Frequency)

  • Word embedding

In this guide, we will go through the theory as well as the Python implementation of these techniques. As always, feel free to hop onto our Discord to ask questions or leave feedback - much appreciated!

One-hot encoding

One-hot encoding is perhaps the easiest way to vectorize text. We begin with a quick example.

Consider the following example:

You say goodbye and I say hello

There are 6 unique words in our tiny corpus. We assign an unique incremental ID to each token:

ID

Token

1

You

2

say

3

goodbye

4

and

5

I

6

hello

Here, the vocabulary size is 6, that is, there are 6 unique terms in our corpus. This allows us to form one-hot vector representation of each word that appears in the corpus. For instance, the one-hot vector for the word You is:

$$\begin{pmatrix} 1&0&0&0&0&0 \end{pmatrix}$$

The one-hot vector for the word goodbye is:

$$\begin{pmatrix} 0&0&1&0&0&0 \end{pmatrix}$$

The approach of one-hot vectorisation is naive in the sense that the semantics of the input is completely discarded, that is, the "meaning" of the tokens are not encoded. This is because, mathematically, each unique token is merely an orthogonal representation in another dimension.

WARNING

The application of one-hot vector is not unique to text - categorical data (e.g. economy class, business class, first class) are often converted to one-hot vectors as well. Moreover, since one-hot vector creates a new dimension for every unique token, the dimensions of the one-hot vector will become extremely large for a large corpus - thereby resulting in the curse of dimensionality.

Implementing one-hot vectors using Python

We can easily perform one-hot encoding using the method get_dummies(~) in Python's Pandas library. Please refer to our in-depth documentation about this method here.

As an example, consider the data:

import pandas as pd

df = pd.DataFrame({
'name':['alex','bob','cathy','doge'],
'nationality':['korean','canadian','french','canadian']})
df
name nationality
0 alex korea
1 bob canadian
2 cathy french
3 doge canadian

To encode the nationality feature as one-hot vectors:

pd.get_dummies(df, columns=['nationality'])
name nationality_canadian nationality_french nationality_korean
0 alex 0 0 1
1 bob 1 0 0
2 cathy 0 1 0
3 doge 1 0 0

Here, notice how we now have 3 new columns - one for each category.

Dummy encoding

Dummy encoding is just like one-hot encoding except we use one less column to perform the encoding.

Consider the same dataset again:

import pandas as pd

df = pd.DataFrame({
'name':['alex','bob','cathy','doge'],
'nationality':['korean','canadian','french','canadian']})
df
name nationality
0 alex korean
1 bob canadian
2 cathy french
3 doge canadian

To perform dummy encoding, use Pandas' get_dummies(~) again but with the argument drop_first=True:

pd.get_dummies(df, columns=['nationality'], drop_first=True)
name nationality_french nationality_korea
0 alex 0 1
1 bob 0 0
2 cathy 1 0
3 doge 0 0

Notice how we only have 2 columns instead of 3 as in the one-hot encoding case. Bob's nationality is filled with 0, which means that he is neither French nor Korean - he is Canadian by default. We say that Canadian is a reference category. The key here is that even with with one less column, we are not losing any information at all!

The advantage of dummy encoding is that we can reduce the number of features by one without information loss!

Bag-of-words

Bag-of-words, or BoW, is another numerical representation of textual data that captures the number of times a word occurs within a document.

Just like for one-hot vector, an unique token is represented by the position within the vector. Consider the following corpus once more:

You say goodbye and I say hello
hello world

Once again, here is the table of unique words and their ID:

ID

Token

0

You

1

say

2

goodbye

3

and

4

I

5

hello

6

world

Since the vocabulary size is 7, each word will be represented by a vector of size 7.

The bag-of-words representation for each word will be as follows:

You say goodbye and I say hello: [1, 2, 1, 1, 1, 1, 0]
hello world: [0, 0, 0, 0, 0, 1, 1]

Here, we have 2 for the second element of the first data item because the word say occurs twice in this data item.

NOTE

The term "bag" in "bag-of-words" means that the order of words are discarded. For instance, consider the following two text:

hello world
world hello

The two vector representation would be equivalent, that is, the information about the ordering of the words is not encoded:

hello world: [0, 0, 0, 0, 0, 1, 1]
world hello: [0, 0, 0, 0, 0, 1, 1]

Implementing bag-of-words using Python's Scikit-learn

In Python, we can obtain the bag-of-words representation by using the library CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

text = ["You say goodbye and I say hello", "hello world"]
vectorizer = CountVectorizer()
vectorizer.fit(text)
vector = vectorizer.transform(text)

The vocabulary is represented by a simple map where the key is the token, and the value is the index of the token in the vector:

print(vectorizer.vocabulary_)
{'you': 5, 'say': 3, 'goodbye': 1, 'and': 0, 'hello': 2, 'world': 4}

Since we have two data items, and each data item is represented using a data size of 7, we have:

print(vector.shape)
(2, 6)
WARNING

The reason why the data size is 6 instead of 7 is that CountVectorizer only treats words that contain at least 2 alphanumerics as a token. Therefore, the word "I" is not counted as a token.

We can view the actual bag-of-words representation like so:

print(vector.toarray())
[[1 1 1 2 0 1]
[0 0 1 0 1 0]]

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF, or terms frequency-inverse document frequency, is a statistical measure of how relevant a word is with respect to a document in a collection of documents. As the name suggests, the TF-IDF of a word is comprised of two components:

  • Term frequency (TF) - the number of times the word occurs in the document

  • Inverse Document Frequency (IDF) - a weight to make rare words more prominent and ignore commonly occurring terms (e.g. stop words such as "a" and "is").

Mathematically, TF-IDF is the product of TF and IDF:

$$\begin{align*} \mathrm{TF-IDF}&=\mathrm{tf}(t,d)\cdot{\mathrm{idf}(t,D)} \end{align*}$$

Where $\mathrm{tf}(t,d)$ computes the term frequency:

$$\mathrm{tf}(t,d)=\frac{\text{(Number of times the term t appears in document d)}} {\text{(Total number of terms in the document d)}}$$

And $\mathrm{idf}(t,D)$ computes the inverse document frequency:

$$\mathrm{idf}(t,D)=\log\Big(\frac{\text{Total number of documents}}{\text{Number of documents containing the term t in collection of documents D}}\Big)$$

Let's now go through a simple example together to intuitively understand TF-IDF at a deeper level.

Simple example of computing TF-IDF

Consider the following collection of documents $D$:

1: "I say hello and you say goodbye"
2: "hello world hello"
3: "hello goodbye"

I'll show the calculation of TF, IDF and TF-IDF using the table below:

TF

TF

TF

IDF

TF-IDF

TF-IDF

TF-IDF

Term

Doc 1

Doc 2

Doc 3

Doc 1

Doc 2

Doc 3

I

1/7

0

0

log(3/1)=0.477

0.068

0

0

say

2/7

0

0

log(3/1)=0.477

0.136

0

0

hello

1/7

2/3

1/2

log(3/3)=0

0

0

0

and

1/7

0

0

log(3/1)=0.477

0.068

0

0

you

1/7

0

0

log(3/1)=0.477

0.068

0

0

goodbye

1/7

0

1/2

log(3/2)=0.176

0.025

0

0.088

world

0

1/3

0

log(3/1)=0.477

0

0.159

0

Let's understand how the numbers in the table are computed, and in particular how we can interpret them. To compute the term frequency (TF) of the term say for document one:

$$\begin{align*} \mathrm{tf}(\text{say},\text{Doc 1})&=\frac{\text{(Number of times the term "say" appears in Doc 1)}} {\text{(Total number of terms in the Doc 1)}}\\ &=\frac{2}{7} \end{align*}$$

From the formula of $\mathrm{tf}$, we can see that a high TF value of a term is obtained if:

  • the term occurs more frequently in the document

  • the number of terms in the document is small

For example, suppose an article contains the word cars many times throughout the article. If the length of the article is short (high TF value), then it is reasonable to assume that the article is about cars. However, if the length of the article is long (low TF value), then perhaps the article simply mentioned the term cars but the article's topic or theme is actually about something else entirely. A high TF value of a term therefore suggests that the theme of the document revolves around this term. Obviously, words that do not appear at all in the document (e.g. the term world in document one) are definitely not related to the theme of the document, and hence receive a TF value of 0.

Now, let's move on to IDF. To compute the IDF of the term say:

$$\begin{align*} \mathrm{idf}(\text{say},D)&= \log\Big(\frac{\text{Total number of documents}}{\text{Number of documents containing the term "say" in collection of documents D}}\Big)\\ &=\log\Big(\frac{3}{1}\Big)\\ &\approx0.477 \end{align*}$$

From the formula, we can see that a term will receive a high IDF value if:

  • the number of documents is large

  • the number of documents containing the term is low

IDF is extremely useful in penalising stop-words (e.g. a, is). For example, suppose we had a collection of articles. You would expect the term frequency of commonly occurring words such as a and is to be extremely high. Even if an article is about cars, the term frequency of these stop-words will most likely be much higher than keywords such as cars. Does this mean that a is more important and relevant than cars in guessing what the article is about? Obviously not - so here is where IDF comes into play.

You would expect stop-words such as a and is to appear in just about any article regardless of its theme. This means that the denominator of IDF (i.e. the number of documents containing the term a and is in our list of articles) is extremely high. In fact, you would expect these stop-words to appear in every single article, which essentially means that:

$$\begin{align*} \mathrm{idf}(a,D)&=\log(1)\\ &=0 \end{align*}$$

Since TF-IDF is a product of TF and IDF, the TF-IDF would also end up being 0 for the stop-words! However, if an article is exclusively about cars, then the term car would most likely only appear in this article and not other articles. In such cases, the number of documents (articles) containing the term car is low, and therefore the IDF value of car would be relatively high.

The key take-away from this simple example is that a term's TF-IDF in a document measures how relevant or important that term is in capturing the theme of the document.

Finally, with the TF-IDF computed for each term of each document, we can now convert each term into a vector. For instance, the term goodbye can be represented by the following 3-dimensional vector:

$$\text{goodbye}=\begin{pmatrix} 0.025\\ 0\\ 0.088 \end{pmatrix}$$

Implementing TF-IDF using Python's Scikit-learn

Scikit-learn's implementation of TF-IDF makes two slight modifications to the official formula (sourceopen_in_new). Firstly, IDF is calculated like so:

$$\mathrm{idf}=\frac{1+\text{Total number of documents}} {1+\text{Number of documents containing the term t in collection of documents D}}$$

Here, you can see that we're just adding 1 to the numerator and the denominator. This is a common technique to avoid the computational problem of dividing by 0. We also need to add one to the numerator to balance the effect of adding 1 to the denominator.

The second modification made is as follows:

$$\text{TF-IDF}=\mathrm{TF}\cdot{(\mathrm{IDF}+1)}$$

They are adding one to IDF such that a zero value of IDF will not suppress TF-IDF completely.

The good news is that, despite these minor adjustments, the interpretation of TF-IDF remains exactly the same.

Consider the following corpus (collection of documents $D$):

documents = [
'You say hello and I say goodbye',
'I think you are right',
'I love cars but you dont right'
]

We can vectorize each term via TF-IDF with the TfidfVectorizer module in sklearn:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(documents)
tf_idf
<3x12 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Compressed Sparse Row format>

The tf_idf here is a sparse NumPy array. This is because tf_idf usually contains many zeros, so using a sparse matrix is memory-efficient. To pretty-print our results, we can convert this NumPy matrix into a Pandas DataFrame like so:

import pandas as pd

pd.set_option('precision', 2) # show up to 2 decimal places
df = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names())
df.head()
and are but cars dont goodbye hello love right say think you
0 0.37 0.00 0.00 0.00 0.00 0.37 0.37 0.00 0.00 0.74 0.00 0.22
1 0.00 0.58 0.00 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.58 0.35
2 0.00 0.00 0.45 0.45 0.45 0.00 0.00 0.45 0.34 0.00 0.00 0.27

Here, notice the following:

  • some terms such as I is missing. This is because, by default, TfidfVectorizer ignores characters of length one (e.g. a).

  • all the tokens have been lower-cased by default.

The TfidfVectorizer has over 10 parameters that you can tinker with - please check out the official docs hereopen_in_new.

WARNING

Sklearn also has a related module called TfidfTransformer (as opposed to TfidfVectorizer), but the transformer version requires you to feed in the fitted output of CountVectorizer explained above. In almost all cases, you can directly use TfidfVectorizer.

Word embedding

The problem with one-hot vectorization is that the semantics of the token are not encoded in its vector representation. For instance, the words "hello" and "hi" share a similar meaning, but one-hot encoding does not take this into account and ignores meaning of the words.

In contrast, the approach of word embedding vectorises a token in which tokens that have a similar meaning will be closer to each other. For instance, the tokens "mobile" and "smartphone" would share a similar vector representation. This means that word embedding encodes the meaning of the token in the vector.

Implementing word embedding using Python's Genism library

In order to obtain word embeddings, we must first train a neural network using a large corpus such as Wikipedia. Fortunately, Python's Genism library provides a number of word embeddings that you can use directly without any training. For the list of available word embeddings, please visit hereopen_in_new.

For our demonstration, we will be using the word embedding obtained from neural network trained on Wikipedia (as of 2014) and Gigaword, which is the world's largest corpus of English news documents. The word embedding contains the vector representation of over 6 billion tokens.

Firstly, let's load our word-embedding

import gensim.downloader as api

model = api.load('glove-wiki-gigaword-50')
model

Here, note the following:

  • even though the size of the corpus on which the neural network was trained is huge, the word embedding itself is only 64MB

  • running this code for the first time will download the word embedding onto your local machine. Running this code again from the second time onwards would use the downloaded word embedding instead.

  • the -50 means that each word is represented by a 50-length vector. Genism offers -100, -200 and -300.

We can obtain the vector representation of a word like so:

model['hello']
array([-0.38497 , 0.80092 , 0.064106, -0.28355 , -0.026759, -0.34532 ,
-0.64253 , -0.11729 , -0.33257 , 0.55243 , -0.087813, 0.9035 ,
0.47102 , 0.56657 , 0.6985 , -0.35229 , -0.86542 , 0.90573 ,
0.03576 , -0.071705, -0.12327 , 0.54923 , 0.47005 , 0.35572 ,
1.2611 , -0.67581 , -0.94983 , 0.68666 , 0.3871 , -1.3492 ,
0.63512 , 0.46416 , -0.48814 , 0.83827 , -0.9246 , -0.33722 ,
0.53741 , -1.0616 , -0.081403, -0.67111 , 0.30923 , -0.3923 ,
-0.55002 , -0.68827 , 0.58049 , -0.11626 , 0.013139, -0.57654 ,
0.048833, 0.67204 ], dtype=float32)

Note the following:

  • the word 'hello' is represented by a vector of size 50.

  • this numerical vector captures the semantics of the word 'hello'

With these word embeddings, we can perform some interesting NLP tasks. For instance, to find the top 5 similar words to the word 'hello':

model.most_similar('hello', topn=5)
[('goodbye', 0.8537959456443787),
('hey', 0.8074296116828918),
('!', 0.7951388359069824),
('kiss', 0.7892292737960815),
('wow', 0.7641353011131287)]

Note the following:

  • the numbers represent the similarity score (a number from 0 to 1)

  • 'goodbye' is the antonym of 'hello', but is still considered to be the most similar. This is because similarity in terms of Word2Vec is measured by the context in which the word appears. The reason 'goodbye' is similar to 'hello' is that the words surrounding 'goodbye' and 'hello' are alike.

To obtain the similarity score of two words:

model.similarity(w1='hello', w2='hey')
0.8074296

If two words rarely appear in the same context, they would have a low similarity score:

model.similarity(w1='hello', w2='car')
0.22838566

To obtain the word that is least similar to the other words in a list:

model.doesnt_match(['hi','hello','car','goodbye'])
'car'
NOTE

Word embeddings capture the semantics of words such that the vector representation of similar words (e.g. 'hi' and 'hello') are close to each other, while unrelated words (e.g. 'hi' and 'car') are far away from each other. For a visual demonstration, please refer to the section on Visualizing words in our Comprehensive Guide on PCA.

Arithmetics with word embedding

Consider the following classic example:

(king - man) + woman = queen

We can intuitively understand that the above arithmetics make sense, and what's remarkable is the fact that the above holds true when performing the arithmetics on word embeddings!

To demonstrate this, we will be using the word embedding trained using the Google News dataset. The embedding can be downloaded from this official Google Driveopen_in_new. The file that contains the embeddings is 1.5GB in size, so you might just want to follow along here without trying on your own machine.

After downloading this data, we can initialize our model with Genism again like so:

import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
model['hello'].shape
(300,)

Here, we can see that each word is represented by a vector of size 300.

To perform the classic (king - man) + woman:

model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.7118192911148071),
('monarch', 0.6189674735069275),
('princess', 0.5902431011199951),
('crown_prince', 0.5499460697174072),
('prince', 0.5377321243286133),
('kings', 0.5236844420433044),
('Queen_Consort', 0.5235945582389832),
('queens', 0.5181134343147278),
('sultan', 0.5098593235015869),
('monarchy', 0.5087411403656006)]

We can see that the best result for the arithmetics is indeed queen! This result is astonishing because the vector representation of the words look completely random to us humans, but they somehow manage to capture the essence and the semantics of the words!

Connection with one-hot vector

Word embedding as a matrix

Word embeddings are typically structured in the form of a matrix like so:

$$\boldsymbol{W}=\begin{pmatrix} w_{1,1}&w_{1,2}&\cdots&w_{1,50}\\ w_{2,1}&w_{2,2}&\cdots&w_{2,50}\\ \vdots&\vdots&\ddots&\vdots\\ w_{1000,1}&w_{1000,2}&\cdots&w_{1000,50}\\ \end{pmatrix}$$

Here, note the following:

  • each row is a vector representation of a word

  • in this case, each word is represented by a 50-dimensional vector - just like in our above example

  • the number of rows is equal to the number of tokens. In our Genism example, we had 6 billion tokens so the number of rows would be 6 billion. I've only shown 1000 here for brevity.

Product of one-hot vector and word embedding

Consider the following text once again:

You say goodbye and I say hello

Recall the one-hot vector representation involves assigning an incremental unique ID to each unique token:

ID

Token

1

You

2

say

3

goodbye

4

and

5

I

6

hello

For instance, the one-hot vector for the word say is:

$$\begin{pmatrix} 0&1&0&0&0&0 \end{pmatrix}$$

The word embedding matrix for this example might look like the following:

$$\begin{pmatrix} w_{11}&w_{12}&w_{13}\\ w_{21}&w_{22}&w_{23}\\ w_{31}&w_{32}&w_{33}\\ w_{41}&w_{42}&w_{43}\\ w_{51}&w_{52}&w_{53}\\ w_{61}&w_{62}&w_{63}\\ \end{pmatrix}$$

Here, we are assuming that each word is represented by a vector in $\mathbb{R}^3$. The first row contains the word embedding for the first word 'You', the second row contains the word embedding for the second word 'say', and so on.

The product between the one-hot vector and the word embedding matrix results in simply knocking out one row of the weight matrix. For instance, consider the product between the one-hot vector of 'say' (the second word) and the word embedding matrix:

$$\begin{pmatrix} 0&1&0&0&0&0 \end{pmatrix}\cdot \begin{pmatrix} w_{11}&w_{12}&w_{13}\\ \color{blue}w_{21}&\color{blue}w_{22}&\color{blue}w_{23}\\ w_{31}&w_{32}&w_{33}\\ w_{41}&w_{42}&w_{43}\\ w_{51}&w_{52}&w_{53}\\ w_{61}&w_{62}&w_{63}\\ \end{pmatrix} =\begin{pmatrix} \color{blue}w_{21}&\color{blue}w_{22}&\color{blue}w_{23} \end{pmatrix}$$

You may be wondering how we can obtain these word embeddings in the first place. They are actually obtained by training a neural network with one hidden layer with the objective of either:

  • using surrounding words - the context - to predict the target word (CBOW approach)

  • using the target word to predict the context (Skip-Gram approach)

I will write another comprehensive guide on these approaches, and how exactly we can obtain these word embeddings that somehow magically capture the semantics of the words. To be notified when I publish this guide, please either register an accountopen_in_new or join our Discord.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...