Okay, here is the blog post rewritten from scratch based on your provided content, translated to English, formatted in Markdown, made SEO-friendly, avoiding first-person perspective and placeholders, and including the requested paragraph about Innovative Software Technology.
Effective Spam Email Detection Using TF-IDF and Logistic Regression in Python
In today’s digital communication landscape, email remains a vital tool. However, alongside essential messages, inboxes are often flooded with unsolicited spam emails. Filtering wanted emails (“ham”) from unwanted ones (“spam”) is a significant challenge. Fortunately, machine learning techniques like Term Frequency-Inverse Document Frequency (TF-IDF) combined with Logistic Regression offer a powerful solution for automating this classification task.
This guide explores how to build a spam detection system using TF-IDF and Logistic Regression in Python.
Understanding the Core Concepts
Before diving into the implementation, let’s briefly understand the key techniques involved:
TF-IDF (Term Frequency – Inverse Document Frequency)
TF-IDF is a numerical statistic used in natural language processing and information retrieval to reflect how important a word is to a document within a collection or corpus. It assigns a weight to each word based on two factors:
- Term Frequency (TF): How often a specific word appears within a single document. Calculated as:
(Number of times term appears in a document) / (Total number of terms in the document)
. - Inverse Document Frequency (IDF): How common or rare a word is across all documents in the corpus. It diminishes the weight of terms that occur very frequently across documents (like “the”, “is”, “a”) and increases the weight of terms that are rarer. Calculated as:
log(Total number of documents / Number of documents containing the term)
.
The TF-IDF score is the product of these two values: TF-IDF = TF × IDF
. A high TF-IDF score indicates that a word appears frequently in a specific document but is relatively rare across the entire set of documents, suggesting it’s a significant term for that particular document.
Logistic Regression
Logistic Regression is a widely-used supervised learning algorithm primarily employed for binary classification problems – tasks where the outcome falls into one of two categories (e.g., Spam vs. Ham, Yes vs. No). Despite its name, it’s a classification algorithm, not a regression one. It works by calculating the probability that a given input belongs to a specific class. A threshold (commonly 0.5) is then used to make the final classification:
- If the predicted probability > 0.5, the model classifies it as one category (e.g., Spam).
- If the predicted probability < 0.5, the model classifies it as the other category (e.g., Ham).
It’s a robust and interpretable algorithm often effective for text classification tasks when combined with feature extraction methods like TF-IDF.
Building the Spam Detector: Step-by-Step Guide
Let’s walk through the process of creating a spam filter using Python and the scikit-learn library. We’ll assume a dataset (mail_data.csv
) exists containing two columns: ‘Category’ (labeled ‘spam’ or ‘ham’) and ‘Message’ (the email text).
Step 1: Import Necessary Libraries
First, import the required Python libraries:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Step 2: Load and Prepare the Data
Load the dataset into a pandas DataFrame and perform initial preprocessing.
# Load the dataset
df = pd.read_csv("mail_data.csv")
# Handle potential missing values by replacing them with empty strings
data = df.where((pd.notnull(df)),'')
# Convert categorical labels ('ham'/'spam') to numerical labels (e.g., 1 for 'ham', 0 for 'spam')
# Note: The original example used 1 for ham, 0 for spam. Ensure consistency.
data['category'] = data['Category'].apply(lambda x: 1 if x == 'ham' else 0)
# Define features (X) and target (y)
X = data['Message']
y = data['category']
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Feature Extraction with TF-IDF
Convert the raw text messages into numerical feature vectors using TF-IDF.
# Initialize the TfidfVectorizer
# - min_df=1: Include words that appear in at least one document
# - stop_words='english': Remove common English stop words (like 'the', 'is', 'in')
# - lowercase=True: Convert all text to lowercase
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
# Fit the vectorizer to the training data and transform it into TF-IDF features
X_train_feature = feature_extraction.fit_transform(X_train)
# Transform the test data using the *same* fitted vectorizer
X_test_feature = feature_extraction.transform(X_test)
# Ensure the target labels are integers
y_train = y_train.astype('int')
y_test = y_test.astype('int')
The X_train_feature
and X_test_feature
now contain numerical representations of the email messages, ready for the machine learning model.
Step 4: Train the Logistic Regression Model
Create an instance of the Logistic Regression model and train it using the TF-IDF features and corresponding labels from the training set.
# Initialize the Logistic Regression model
model = LogisticRegression()
# Train the model
model.fit(X_train_feature, y_train)
Step 5: Evaluate Model Performance
Assess the model’s accuracy on both the training and testing data. Evaluating on the test set provides a measure of how well the model generalizes to unseen data.
# Predict on the training data
pred_on_training_data = model.predict(X_train_feature)
acc_on_training_data = accuracy_score(y_train, pred_on_training_data)
print(f"Accuracy on training data: {acc_on_training_data}")
# Predict on the test data
pred_on_test_data = model.predict(X_test_feature)
acc_on_test_data = accuracy_score(y_test, pred_on_test_data)
print(f"Accuracy on test data: {acc_on_test_data}")
High accuracy on both sets (especially the test set) indicates a well-performing model. For instance, achieving accuracy above 95% is common for this task with sufficient data.
Step 6: Make Predictions on New Emails
Use the trained model and the TF-IDF vectorizer to classify new, unseen email messages.
# Example new email message
input_your_mail = ["Congratulations! You've won a free cruise. Click here to claim your prize now!"]
# input_your_mail = ["Hi team, let's schedule a meeting for next week to discuss project updates."]
# Transform the new email using the *same* fitted vectorizer
input_data_features = feature_extraction.transform(input_your_mail)
# Make a prediction
prediction = model.predict(input_data_features)
# Interpret the prediction (assuming 1 = Ham, 0 = Spam)
if prediction[0] == 1:
print("Prediction: Ham mail (Not Spam)")
else:
print("Prediction: Spam mail")
Adapting for Other Languages
The core methodology of using TF-IDF and Logistic Regression can be applied to spam detection in languages other than English, such as Thai. The process remains largely the same:
- Data Collection: Gather a dataset of emails labeled as spam or ham in the target language.
- Preprocessing: Load the data, handle encoding issues (e.g., using UTF-8), clean the text, and convert labels to numerical format. Language-specific text cleaning might be needed.
- TF-IDF Vectorization: Use
TfidfVectorizer
. While English stop words won’t apply, TF-IDF itself is language-agnostic regarding its core calculation. Consider if language-specific stop words are necessary or ifmin_df
andmax_df
parameters are sufficient. Word tokenization might require language-specific libraries if simple whitespace splitting is inadequate. - Model Training & Evaluation: Train the Logistic Regression model and evaluate its performance using accuracy or other relevant metrics.
- Prediction: Use the trained model and vectorizer to classify new emails in the target language.
Experiments with Thai language datasets have shown that this approach can yield high accuracy, successfully identifying spam messages written in Thai.
Conclusion
Combining TF-IDF for feature extraction and Logistic Regression for classification provides an effective and relatively straightforward approach to building a spam email detection system. This method successfully quantifies word importance and uses it to train a model capable of distinguishing between legitimate emails and unwanted spam, significantly helping to manage inbox clutter. The adaptability of this technique to various languages further highlights its versatility.
How Innovative Software Technology Can Help
At Innovative Software Technology, we harness the power of sophisticated machine learning algorithms and natural language processing techniques, including TF-IDF and classification models like Logistic Regression, to deliver intelligent custom software solutions. Our expertise enables businesses to automate complex text analysis tasks, develop robust spam filtering systems tailored to specific needs, and unlock valuable data-driven insights from unstructured text sources. By partnering with Innovative Software Technology, you gain access to cutting-edge machine learning development, enhancing operational efficiency, improving data accuracy, and driving informed decision-making through powerful, bespoke software.