Foundations of NLU – Word Representations

Jayanth Srinivasa
Jayanth Srinivasa

Tuesday, May 17th, 2022

Natural Language Understanding (NLU) is a branch of Machine Learning (ML) that deals with a machine's ability to understand human language. Human language is made up of words, whereas machines and ML algorithms require words to be represented as numbers or vectors. This blog explores how words are 'represented' in ML algorithms.

One-hot encoded vector

One method is to represent words using one-hot encoded vectors where the length of the vector is the number of words in the dictionary. Each word in our dictionary maps to a vector component. The corresponding component is marked as 'one' if a particular word is present, and hence every sentence in the document is a sum of one-hot encoded vectors for each word.

For example:
Here is our dictionary: [cat, bat, rat, sat, mat, on, the]. The sentence 'the cat sat on the mat' can be represented as [1, 0, 0, 1, 1, 1, 1].

The drawback of this method is that the position of the words in the dictionary is random, and it does not signify any meaning. Further, if the words are sparse, the resultant vector is also sparse, i.e., a large number of components in the vector would be zero. Moreover, any addition to the vocabulary changes the length of the vectors.

Word vector

"You shall know a word by the company it keeps" – John R Firth.

The meaning of words can be effectively captured by observing the words around them. This insight and deep learning helped develop the concept of word vectors. In word vectors, the words are represented by vectors of fixed length. The values of the individual components of the vectors are acquired by a deep learning (DL) network in the training phase.

Following are the tasks the DL network algorithm is trained on:
a) learn to predict a word, given the neighbors of the word (called 'continuous bag of words').
b) learn to predict the neighbors, given a word (called a 'skip-gram').

Word vectors are randomly initialized and become semantically meaningful during the training phase. Two words that are similar in meaning will also have corresponding word vectors close to each other. These word vectors are dense vectors, useful as inputs in downstream tasks such as classification and word generation. Well-known word vectors such as Word2Vec and Glove were created using different vocabularies and are available in open source.

Context-aware word embeddings:

Word Vectors ensure that the word representations are not dependent on the words used in the dictionary. But they still have a limitation of context. Words can have different meanings based on the context they appear in a text. For example, the word 'bank' can be used as a noun to reference a place where we transact money or the land by a river or as a verb to indicate a turn in three-dimensional space. Humans understand the meaning of words based on the context. Having a fixed vector representation of these words causes one vector for a word to represent all the meanings of that word. This single vector representation across different contexts can create confusion when used in NLU tasks like question answering. Researchers have developed context-aware embeddings using the recently developed concepts of Attention and Transformers. The original word vector is affected by surrounding word vectors resulting in context-aware embeddings. Using these word vectors in downstream NLU tasks has significantly improved the state-of-the-art results for these tasks.

Beyond Word Vectors:

Word Vectors and Context-aware vectors are used in various General Language Understanding and Evaluation (GLUE) applications (ex: question answering). But sometimes, it may become necessary to classify sentences or documents as a whole. An approach to doing that is to create a new embedding vector that represents the sentence. The easiest way to generate a sentence vector is to generate a vector by averaging all the word vectors in the sentence to form a single vector. Another approach is to use a special tag at the beginning of the sentence; when that tag is passed through a context-aware transformer network along with the sentence, it can capture the context-aware meaning of not just the individual words in the sentence but the entire sentence as well.

Similarly, we can create vectors that represent the whole document. The easiest way of generating a document vector is by averaging the individual sentence vectors of sentences present in the document. Document vectors can help capture the average meaning of a document, and vector metrics like distance/similarity can be used on these vectors to get the 'semantic distance' between two documents.


The vector representations of words (sentences or documents) form the basis of a lot of the current applications of NLU, but they do have some shortcomings. Knowledge Graphs are known to overcome these shortcomings. In the upcoming blog posts, we will explore knowledge graphs and how they are used in NLU and look at applications/downstream tasks that use word representations such as word vectors or context-aware word vectors as the building block.

Vist the Cisco Research site to learn about other initiatives.