From Words to Vectors: Understanding OHE, BoW, and TF-IDF

Introduction

Imagine trying to teach a computer to understand a book, one page at a time. Sounds challenging, right? That's because machines excel at crunching numbers, not interpreting the complex human language. So how do we bridge this gap? The answer lies in transforming text into a language that computers understand: the numerical language of vectors.

In this blog, we'll explore the science of text representation in machine learning. Just as vectors in high school math represented magnitudes and directions, here, they'll represent the essence of language. From the simplicity of One-Hot Encoding (OHE) to the nuanced insights of Term Frequency-Inverse Document Frequency (TF-IDF), we'll unpack these concepts and showcase how they turn sentences into numerical features that machines can work with.

One-Hot Encoding (OHE)

One-Hot Encoding (OHE) is a fundamental technique used to convert categorical data into a numerical form that machine learning models can digest. At its core, OHE creates a 'presence' vector for each unique word in a sentence, indicating its occurrence with a '1' and the absence of all other words with '0's.

Consider the sentence: “Karthik is a very nice guy” . The OHE process starts by creating a dictionary of all unique words in our corpus (the collection of all sentences we want to analyze). Let's say our dictionary for a simple corpus that includes this sentence looks like this: ["Karthik", "is", "a", "very", "nice", "guy"] .

For the word "Karthik", which is the first in our dictionary, the one-hot encoded vector would be:

** Karthik: [1, 0, 0, 0, 0, 0] **

This vector represents the entire sentence in the context of our dictionary, showing that "Karthik" is present and all other words are absent.

  • High Dimensionality: Leads to very large vectors if the vocabulary is big, which can be computationally inefficient.
  • Sparsity: The vectors are mostly made up of 0s, which is not data-efficient and increases memory usage.
  • Inflexibility with New Words: Cannot handle words that are not included in the initial dictionary (known as out-of-vocabulary words).
  • Lack of Semantic Understanding: OHE does not capture the semantic relationships between words, failing to convey the true meanings and nuances of sentences.
  • Computational Expense: Due to sparsity and high dimensionality, OHE can be computationally expensive for models to process.

Bag of Words (BoW)

Concept Explanation: The Bag of Words (BoW) model is a text representation technique used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Example: Consider two sentences:

  1. "The cat sat on the mat."
  2. "The dog sat on the log."

The BoW representation of these sentences, after removing common stop words like 'the' and 'on', would be:

Vocabulary: {cat, sat, mat, dog, log}

Sentence 1 Vector: [1, 1, 1, 0, 0] Sentence 2 Vector: [0, 1, 0, 1, 1]

The vectors indicate the frequency of each vocabulary word in the sentences. If the word cat appears 2 times the vector count will become 2.

Pros and Cons:

Pros:

  • Simplicity: The BoW model is simple to understand and implement.
  • Efficiency: It allows for efficient comparison of documents by converting text data into numerical features.

Cons:

  • Loss of Context: BoW does not account for the context or order of words, which can alter the meaning.
  • Sparsity: Similar to OHE, BoW can also lead to sparse matrices with large but mostly empty vectors.
  • Frequency Bias: Common words can dominate in the vector representation, overshadowing the rarer but potentially more informative words.

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a corpus. The importance increases proportionally with the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

tfidf calculation

Corpus:

  1. "The quick brown fox jumps over the lazy dog."
  2. "The quick brown dog."
  3. "The lazy dog sleeps."

We want to calculate the TF-IDF score for the word "lazy" across the documents.

Term Frequency (TF) Calculation:

  • Document 1: "lazy" appears once in a document of nine words: TF=1/9
  • Document 2: "lazy" does not appear: TF=0
  • Document 3: "lazy" appears once in a document of four words: TF=1/4

Inverse Document Frequency (IDF) Calculation:

  • "lazy" appears in 2 out of the 3 documents, so: IDF=log(3/2)

TF-IDF Score Calculation:

  • For Document 1: TFlazy×IDFlazy=1/9×log(3/2)
  • For Document 2: TFlazy×IDFlazy=0×log(3/2)=0
  • For Document 3: TFlazy×IDFlazy=1/4×log(3/2)

Assuming the base of the logarithm is 10, the calculations would be:

  • IDF lazy = log10(3/2) ≈ 0.176

Thus, the TF-IDF scores are:

  • For Document 1: 1/9 × 0.176 ≈ 0.0196
  • For Document 2: 0 (since the term "lazy" does not appear in this document)
  • For Document 3: 1/4 × 0.176 ≈ 0.044

These scores indicate that "lazy" is relatively more important in Document 3 than in Document 1, and not at all in Document 2, which is what TF-IDF is intended to reflect.

TechniqueWhen to UseWhen to Avoid
One-Hot Encoding (OHE)- For nominal categorical data with no order.
- When dataset size is small.
- When the model requires binary features.
- With datasets that have many categories (high cardinality).
- When semantic meaning is important.
Bag of Words (BoW)- For quick and simple text-to-numeric conversion.
- When word order is not important.
- For small to medium-sized corpora.
- With large text data (leads to high-dimensional vectors).
- When context and semantics are important.
- If computational resources are limited.
TF-IDF- To emphasize the importance of words within documents in a corpus.
- For tasks like search engine ranking and document classification.
- When preserving word order and context is necessary.