Data Driven - ML Approach

Machine Learning: Solving Problems Through Data Analysis

In modern computing, problems can often be solved using either traditional functional programming or machine learning (ML). While functional programming follows explicit rules and logic, ML takes a data-driven approach, learning patterns from existing data rather than relying on manually written rules.

In this article, we will explore these two approaches with a simple example: validating whether an email address is correctly formatted.

1. Functional Approach: Using Explicit Rules

A functional approach involves defining a set of rules to check if an email is valid. The basic structure of an email consists of:

  • A local part (before @)

  • A domain part (after @)

  • A top-level domain (e.g., .com, .org)

A simple function-based validation in Python might look like this:

import re
def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
    return bool(re.match(pattern, email))



# Testing the function
emails = ["user@example.com", "invalid-email", "name@domain", "hello@.com"]
for email in emails:
    print(f"{email}: {'Valid' if is_valid_email(email) else 'Invalid'}")

Pros of Functional Approach:

✔️ Deterministic – Always produces the same result for the same input.
✔️ Efficient – Fast execution since no learning is required.
✔️ Interpretable – Easy to understand and debug.

Cons of Functional Approach:

Not Adaptive – New email formats may require modifying the rules.
Hard to Scale – Complex validation rules (e.g., checking domain validity) need additional logic.

Machine Learning Approach: Supervised Learning Classification

Instead of defining rules, ML models learn from data. We can frame this as a binary classification problem:

  • Input (X): A set of email addresses

  • Output (Y): A label (1 for valid, 0 for invalid)

Steps to Build the ML Model:

  1. Collect Data: Gather examples of valid and invalid email addresses.

  2. Extract Features: Convert email strings into numerical representations.

  3. Train a Model: Use a classification algorithm like Logistic Regression or Random Forest.

  4. Evaluate & Test: Check accuracy on unseen email data.

For a supervised learning approach to email validation, we need a labeled dataset where each email is marked as valid (Yes) or invalid (No).

A real dataset should contain thousands of email samples, covering different formats, domains, and errors like:

  • Typos (user@exmaple.com)

  • Missing '@' (userexample.com)

  • Missing domain (user@.com)

  • Uncommon but valid formats (first.last+alias@gmail.com)

How Machine Learning Approaches This Problem

Machine learning learns patterns from labeled data rather than relying on fixed rules. Here’s how the ML pipeline works:

Step 1: Data Collection

Gather a diverse dataset of emails, ensuring both valid and invalid examples are included.

Step 2: Feature Extraction

Emails are text-based, so we convert them into numerical features. Common feature extraction techniques include:

  • Character-level n-grams (e.g., frequency of '@', '.', domain length)

  • Bag-of-Words or TF-IDF (counts of character sequences)

  • Regex-based Features (e.g., whether an email contains '@' and a valid TLD)

Step 3: Model Selection

Since this is a binary classification problem, common models include:

  • Logistic Regression (good for simple pattern learning)

  • Random Forest (handles a mix of simple and complex cases)

  • Deep Learning (LSTMs) (useful for more flexible email structures)

Step 4: Training the Model

  • The dataset is split into training (80%) and testing (20%).

  • The model learns from the training data by adjusting weights based on patterns.

Step 5: Prediction & Evaluation

  • The model is tested on unseen emails.

  • Performance is measured using accuracy, precision, recall, and F1-score.


3. Why ML Instead of Functional Rules?

Factor Functional Rules Machine Learning
Scalability Hard to maintain for evolving formats Learns new formats from data
Adaptability Needs manual updates Improves as more data is added
Edge Cases Struggles with uncommon emails Learns from similar patterns
Performance Fast but rigid Slightly slower but more flexible

Real-World Use Case

Big email services like Gmail, Outlook, and Yahoo use ML-powered spam detection to filter invalid or fraudulent emails. Instead of using just fixed rules, ML models adapt to new tricks used by spammers.

Here’s an enhanced dataset with additional features extracted from each email to help a machine learning model differentiate between valid and invalid emails.


1. Supervised Learning Dataset for Email Validation

Email Address Valid (Label) Contains '@' Contains '.' Domain Length Local Part Length Special Characters Count Ends with Common TLD (.com, .org, .net, etc.)
user@example.com Yes 1 1 7 4 0 1
test.email@domain.com Yes 1 1 11 10 1 1
hello.world@company.org Yes 1 1 12 11 1 1
invalid-email No 0 0 0 13 1 0
justwords No 0 0 0 9 0 0
missing@domain No 1 0 6 7 0 0
user@sub.example.com Yes 1 1 15 4 0 1
user@.com No 1 1 4 4 0 1
user@domain No 1 0 6 4 0 0
username@valid.co.uk Yes 1 1 9 8 0 1

2. Explanation of Features for Machine Learning Model

Feature Name Description
Contains '@' Whether the email contains @ (1 = Yes, 0 = No). Essential for valid emails.
Contains '.' Checks if the email has at least one . (dot), usually needed for domains.
Domain Length The length of the domain part (e.g., example.com has 11 characters).
Local Part Length Length of the part before @ (e.g., user@example.comuser has 4 chars).
Special Characters Count Number of special characters (+, _, -) in the local part.
Ends with Common TLD Whether the domain ends in .com, .org, .net, .co.uk, etc.

3. How ML Uses These Features to Classify Emails

Step 1: Feature Engineering

  • Convert the email text into numerical features using the table above.

  • Each email is now represented as a structured row in a dataset.

Step 2: Model Training

  • Use a classification algorithm like Logistic Regression, Random Forest, or Neural Networks.

  • Train the model on 80% of the dataset and test on the remaining 20%.

Step 3: Prediction

  • Given a new email, the model will analyze its features and assign a probability of being valid or invalid.

  • Example:

    • newuser@website.comModel predicts: Valid

    • invalid-emailModel predicts: Invalid


4. Why Machine Learning Instead of Rules?

Factor Rules-Based Approach Machine Learning Approach
Hardcoded Rules Manually define email structure rules Learns from data and adapts
New Patterns Needs updates for new email formats Automatically recognizes new valid email structures
Scalability Hard to manage for large datasets Easily scales with more data
Accuracy Can fail on complex edge cases Improves with more training data

Hybrid Approach

A combination of rule-based checks + ML can work best:

  • First, apply basic regex rules to filter obviously invalid emails.

  • Then, use ML for edge cases and advanced structures (e.g., unusual but valid emails).


Conclusion

Machine learning makes email validation more flexible and intelligent by learning from real-world data. It reduces manual rule updates and helps in scenarios where traditional regex-based validation fails

Example code: Using a Supervised ML Model in Python

python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Sample labeled data
emails = ["user@example.com", "test@domain.com", "invalid-email", "hello@com", "name@domain"]
labels = [1, 1, 0, 0, 1] # 1 = Valid, 0 = Invalid

# Convert text into numeric features
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,4)) # Character-level feature extraction
X = vectorizer.fit_transform(emails)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a classification model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Test the model
test_email = ["new@example.com"]
test_vectorized = vectorizer.transform(test_email)
prediction = model.predict(test_vectorized)

print(f"{test_email[0]}: {'Valid' if prediction[0] == 1 else 'Invalid'}")

 


Pros of Machine Learning Approach:

✔️ Adaptive & Scalable – The model improves with more data.
✔️ Pattern Learning – Learns patterns that may be difficult to define explicitly.
✔️ Handles Variability – Works with diverse email formats and potential typos.

Cons of Machine Learning Approach:

Requires Training Data – Needs a dataset of labeled emails.
Computational Overhead – Requires training and inference time.
Lower Explainability – Predictions might not be as transparent as rule-based checks.


Which Approach is Better?

Factor Functional Approach Machine Learning Approach
Accuracy High if rules are well-defined Improves with more data
Adaptability Requires manual updates Learns automatically from data
Scalability Limited Can handle large datasets
Interpretability Easy to understand Harder to explain predictions
Computation Time Fast Slower due to training

Decision Making: 

  • Use Functional Programming when rules are simple and deterministic.

  • Use Machine Learning when email validation requires handling complex, evolving patterns.

  • Hybrid Approach: Combine both! Use rules for quick checks and ML for edge cases.


Conclusion

Machine learning offers a powerful alternative to rule-based programming by learning from data rather than relying on manually defined rules. However, its success depends on the availability of quality training data and the problem complexity.

For simple cases like email validation, a functional approach may suffice, but for adaptive validation (e.g., detecting spam emails, phishing attempts), ML is the better choice.

Watch below: Understanding How Machine Learning Operates