TF-IDF Vectorizer

TF-IDF (Term Frequency-Inverse Document Frequency) is a common vectorization technique for transforming text into numerical vectors, considering both term frequency (TF) and inverse document frequency (IDF).

TF-IDF calculation:

  • TF: The frequency of a term in a document.
  • IDF: The inverse logarithm of the proportion of documents containing the term.
  • TF-IDF: The product of TF and IDF, which gives higher weights to terms that are more frequent in a document but less frequent in the corpus.

Example:

TF-IDF Vectorization Example
from sklearn.feature_extraction.text import TfidfVectorizer
# A list of quotes
quotes = ["May the Force be with you.", "There's no place like home.", 
        "I'm the king of the world!","Carpe diem.",
        "Elementary, my dear Watson.", "It's alive!", 
        "My mama always said life was like a box of chocolates.",
        "I'll be back."]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, max_df=0.8, sublinear_tf=True, use_idf=True)

# Fit the vectorizer to the documents and transform them
tfidf_matrix = vectorizer.fit_transform(quotes)

# Print the resulting TF-IDF matrix
print(tfidf_matrix)
  (0, 0)	0.789331229788491
  (0, 5)	0.6139675966047301
  (1, 1)	0.7071067811865476
  (1, 3)	0.7071067811865476
  (2, 4)	0.5533384527040075
  (2, 5)	0.8329565155271521
  (4, 2)	1.0
  (6, 2)	0.6013393030827452
  (6, 4)	0.5261008317194686
  (6, 1)	0.6013393030827452
  (7, 0)	1.0
  (8, 5)	1.0
  (9, 4)	0.759964339252989
  (9, 3)	0.5130374917227942
  (9, 5)	0.39905730810317397

The TfidfVectorizer class has several attributes (refer to the documentation) that determine its functionality. In the above example, the code:
vectorizer = TfidfVectorizer(min_df=5, max_df=0.8, sublinear_tf=True, use_idf=True)
sets the following parameters:

  • min_df=5: Ignores words that appear in fewer than 5 documents.
  • max_df=0.8: Ignores words that appear in more than 80% of the documents, as they are likely too common and not informative for classification.
  • sublinear_tf=True: Applies sublinear term frequency scaling, replacing raw term frequency with 1 + log(tf).
  • use_idf=True: Utilizes inverse document frequency (IDF) to reduce the impact of frequently occurring words in the corpus.

tfidf_matrix = vectorizer.fit_transform(quotes): This command fits the vectorizer to the data (quotes) and transforms it into a TF-IDF matrix. Each row represents a document, and each column represents a feature (a word). The output is a sparse matrix, where each row represents a quote and each column represents a word. The values in the matrix represent the TF-IDF weights of the corresponding words in the documents.


Previous     Next

Use the Search Bar to find content on MarketingMind.