Data Driven - ML Approach

Written by SPS | Mar 19, 2025 3:56:56 PM

Machine Learning: Solving Problems Through Data Analysis

In modern computing, problems can often be solved using either traditional functional programming or machine learning (ML). While functional programming follows explicit rules and logic, ML takes a data-driven approach, learning patterns from existing data rather than relying on manually written rules.

In this article, we will explore these two approaches with a simple example: validating whether an email address is correctly formatted.

1. Functional Approach: Using Explicit Rules

A functional approach involves defining a set of rules to check if an email is valid. The basic structure of an email consists of:

A local part (before @)
A domain part (after @)
A top-level domain (e.g., .com, .org)

A simple function-based validation in Python might look like this:

import re
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
return bool(re.match(pattern, email))

# Testing the function
emails = ["user@example.com", "invalid-email", "name@domain", "hello@.com"]
for email in emails:
print(f"{email}: {'Valid' if is_valid_email(email) else 'Invalid'}")

Pros of Functional Approach:

✔️ Deterministic – Always produces the same result for the same input.
✔️ Efficient – Fast execution since no learning is required.
✔️ Interpretable – Easy to understand and debug.

Cons of Functional Approach:

❌ Not Adaptive – New email formats may require modifying the rules.
❌ Hard to Scale – Complex validation rules (e.g., checking domain validity) need additional logic.

Machine Learning Approach: Supervised Learning Classification

Instead of defining rules, ML models learn from data. We can frame this as a binary classification problem:

Input (X): A set of email addresses
Output (Y): A label (1 for valid, 0 for invalid)

Steps to Build the ML Model:

Collect Data: Gather examples of valid and invalid email addresses.
Extract Features: Convert email strings into numerical representations.
Train a Model: Use a classification algorithm like Logistic Regression or Random Forest.
Evaluate & Test: Check accuracy on unseen email data.

For a supervised learning approach to email validation, we need a labeled dataset where each email is marked as valid (Yes) or invalid (No).

A real dataset should contain thousands of email samples, covering different formats, domains, and errors like:

Typos (user@exmaple.com)
Missing '@' (userexample.com)
Missing domain (user@.com)
Uncommon but valid formats (first.last+alias@gmail.com)

How Machine Learning Approaches This Problem

Machine learning learns patterns from labeled data rather than relying on fixed rules. Here’s how the ML pipeline works:

Step 1: Data Collection

Gather a diverse dataset of emails, ensuring both valid and invalid examples are included.

Step 2: Feature Extraction

Emails are text-based, so we convert them into numerical features. Common feature extraction techniques include:

Character-level n-grams (e.g., frequency of '@', '.', domain length)
Bag-of-Words or TF-IDF (counts of character sequences)
Regex-based Features (e.g., whether an email contains '@' and a valid TLD)

Step 3: Model Selection

Since this is a binary classification problem, common models include:

Logistic Regression (good for simple pattern learning)
Random Forest (handles a mix of simple and complex cases)
Deep Learning (LSTMs) (useful for more flexible email structures)

Step 4: Training the Model

The dataset is split into training (80%) and testing (20%).
The model learns from the training data by adjusting weights based on patterns.

Step 5: Prediction & Evaluation

The model is tested on unseen emails.
Performance is measured using accuracy, precision, recall, and F1-score.

3. Why ML Instead of Functional Rules?

Factor	Functional Rules	Machine Learning
Scalability	Hard to maintain for evolving formats	Learns new formats from data
Adaptability	Needs manual updates	Improves as more data is added
Edge Cases	Struggles with uncommon emails	Learns from similar patterns
Performance	Fast but rigid	Slightly slower but more flexible

Real-World Use Case

Big email services like Gmail, Outlook, and Yahoo use ML-powered spam detection to filter invalid or fraudulent emails. Instead of using just fixed rules, ML models adapt to new tricks used by spammers.

Here’s an enhanced dataset with additional features extracted from each email to help a machine learning model differentiate between valid and invalid emails.

1. Supervised Learning Dataset for Email Validation

Email Address	Valid (Label)	Contains '@'	Contains '.'	Domain Length	Local Part Length	Special Characters Count	Ends with Common TLD (.com, .org, .net, etc.)
user@example.com	Yes	1	1	7	4	0	1
test.email@domain.com	Yes	1	1	11	10	1	1
hello.world@company.org	Yes	1	1	12	11	1	1
invalid-email	No	0	0	0	13	1	0
justwords	No	0	0	0	9	0	0
missing@domain	No	1	0	6	7	0	0
user@sub.example.com	Yes	1	1	15	4	0	1
user@.com	No	1	1	4	4	0	1
user@domain	No	1	0	6	4	0	0
username@valid.co.uk	Yes	1	1	9	8	0	1

2. Explanation of Features for Machine Learning Model

Feature Name	Description
Contains '@'	Whether the email contains `@` (1 = Yes, 0 = No). Essential for valid emails.
Contains '.'	Checks if the email has at least one `.` (dot), usually needed for domains.
Domain Length	The length of the domain part (e.g., `example.com` has 11 characters).
Local Part Length	Length of the part before `@` (e.g., `user@example.com` → `user` has 4 chars).
Special Characters Count	Number of special characters (`+`, `_`, `-`) in the local part.
Ends with Common TLD	Whether the domain ends in `.com`, `.org`, `.net`, `.co.uk`, etc.

3. How ML Uses These Features to Classify Emails

Step 1: Feature Engineering

Convert the email text into numerical features using the table above.
Each email is now represented as a structured row in a dataset.

Step 2: Model Training

Use a classification algorithm like Logistic Regression, Random Forest, or Neural Networks.
Train the model on 80% of the dataset and test on the remaining 20%.

Step 3: Prediction

Given a new email, the model will analyze its features and assign a probability of being valid or invalid.
Example:
- newuser@website.com → Model predicts: Valid
- invalid-email → Model predicts: Invalid

4. Why Machine Learning Instead of Rules?

Factor	Rules-Based Approach	Machine Learning Approach
Hardcoded Rules	Manually define email structure rules	Learns from data and adapts
New Patterns	Needs updates for new email formats	Automatically recognizes new valid email structures
Scalability	Hard to manage for large datasets	Easily scales with more data
Accuracy	Can fail on complex edge cases	Improves with more training data

Hybrid Approach

A combination of rule-based checks + ML can work best:

First, apply basic regex rules to filter obviously invalid emails.
Then, use ML for edge cases and advanced structures (e.g., unusual but valid emails).

Conclusion

Machine learning makes email validation more flexible and intelligent by learning from real-world data. It reduces manual rule updates and helps in scenarios where traditional regex-based validation fails

Example code: Using a Supervised ML Model in Python

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Sample labeled data
emails = ["user@example.com", "test@domain.com", "invalid-email", "hello@com", "name@domain"]
labels = [1, 1, 0, 0, 1]  # 1 = Valid, 0 = Invalid

# Convert text into numeric features
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,4))  # Character-level feature extraction
X = vectorizer.fit_transform(emails)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a classification model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Test the model
test_email = ["new@example.com"]
test_vectorized = vectorizer.transform(test_email)
prediction = model.predict(test_vectorized)

print(f"{test_email[0]}: {'Valid' if prediction[0] == 1 else 'Invalid'}")

Pros of Machine Learning Approach:

✔️ Adaptive & Scalable – The model improves with more data.
✔️ Pattern Learning – Learns patterns that may be difficult to define explicitly.
✔️ Handles Variability – Works with diverse email formats and potential typos.

Cons of Machine Learning Approach:

❌ Requires Training Data – Needs a dataset of labeled emails.
❌ Computational Overhead – Requires training and inference time.
❌ Lower Explainability – Predictions might not be as transparent as rule-based checks.

Which Approach is Better?

Factor	Functional Approach	Machine Learning Approach
Accuracy	High if rules are well-defined	Improves with more data
Adaptability	Requires manual updates	Learns automatically from data
Scalability	Limited	Can handle large datasets
Interpretability	Easy to understand	Harder to explain predictions
Computation Time	Fast	Slower due to training

Decision Making:

Use Functional Programming when rules are simple and deterministic.
Use Machine Learning when email validation requires handling complex, evolving patterns.
Hybrid Approach: Combine both! Use rules for quick checks and ML for edge cases.

Conclusion

Machine learning offers a powerful alternative to rule-based programming by learning from data rather than relying on manually defined rules. However, its success depends on the availability of quality training data and the problem complexity.

For simple cases like email validation, a functional approach may suffice, but for adaptive validation (e.g., detecting spam emails, phishing attempts), ML is the better choice.

Watch below: Understanding How Machine Learning Operates

View full post