Machine Learning: Solving Problems Through Data Analysis
In modern computing, problems can often be solved using either traditional functional programming or machine learning (ML). While functional programming follows explicit rules and logic, ML takes a data-driven approach, learning patterns from existing data rather than relying on manually written rules.
In this article, we will explore these two approaches with a simple example: validating whether an email address is correctly formatted.
1. Functional Approach: Using Explicit Rules
A functional approach involves defining a set of rules to check if an email is valid. The basic structure of an email consists of:
A local part (before @
)
A domain part (after @
)
A top-level domain (e.g., .com
, .org
)
A simple function-based validation in Python might look like this:
import re # Testing the function emails = ["user@example.com", "invalid-email", "name@domain", "hello@.com"] for email in emails: print(f"{email}: {'Valid' if is_valid_email(email) else 'Invalid'}") |
✔️ Deterministic – Always produces the same result for the same input.
✔️ Efficient – Fast execution since no learning is required.
✔️ Interpretable – Easy to understand and debug.
❌ Not Adaptive – New email formats may require modifying the rules.
❌ Hard to Scale – Complex validation rules (e.g., checking domain validity) need additional logic.
Instead of defining rules, ML models learn from data. We can frame this as a binary classification problem:
Input (X): A set of email addresses
Output (Y): A label (1 for valid, 0 for invalid)
Collect Data: Gather examples of valid and invalid email addresses.
Extract Features: Convert email strings into numerical representations.
Train a Model: Use a classification algorithm like Logistic Regression or Random Forest.
Evaluate & Test: Check accuracy on unseen email data.
A real dataset should contain thousands of email samples, covering different formats, domains, and errors like:
Typos (user@exmaple.com
)
Missing '@' (userexample.com
)
Missing domain (user@.com
)
Uncommon but valid formats (first.last+alias@gmail.com
)
Machine learning learns patterns from labeled data rather than relying on fixed rules. Here’s how the ML pipeline works:
Gather a diverse dataset of emails, ensuring both valid and invalid examples are included.
Emails are text-based, so we convert them into numerical features. Common feature extraction techniques include:
Character-level n-grams (e.g., frequency of '@', '.', domain length)
Bag-of-Words or TF-IDF (counts of character sequences)
Regex-based Features (e.g., whether an email contains '@' and a valid TLD)
Since this is a binary classification problem, common models include:
Logistic Regression (good for simple pattern learning)
Random Forest (handles a mix of simple and complex cases)
Deep Learning (LSTMs) (useful for more flexible email structures)
The dataset is split into training (80%) and testing (20%).
The model learns from the training data by adjusting weights based on patterns.
The model is tested on unseen emails.
Performance is measured using accuracy, precision, recall, and F1-score.
Factor | Functional Rules | Machine Learning |
---|---|---|
Scalability | Hard to maintain for evolving formats | Learns new formats from data |
Adaptability | Needs manual updates | Improves as more data is added |
Edge Cases | Struggles with uncommon emails | Learns from similar patterns |
Performance | Fast but rigid | Slightly slower but more flexible |
Big email services like Gmail, Outlook, and Yahoo use ML-powered spam detection to filter invalid or fraudulent emails. Instead of using just fixed rules, ML models adapt to new tricks used by spammers.
Here’s an enhanced dataset with additional features extracted from each email to help a machine learning model differentiate between valid and invalid emails.
Email Address | Valid (Label) | Contains '@' | Contains '.' | Domain Length | Local Part Length | Special Characters Count | Ends with Common TLD (.com, .org, .net, etc.) |
---|---|---|---|---|---|---|---|
user@example.com | Yes | 1 | 1 | 7 | 4 | 0 | 1 |
test.email@domain.com | Yes | 1 | 1 | 11 | 10 | 1 | 1 |
hello.world@company.org | Yes | 1 | 1 | 12 | 11 | 1 | 1 |
invalid-email | No | 0 | 0 | 0 | 13 | 1 | 0 |
justwords | No | 0 | 0 | 0 | 9 | 0 | 0 |
missing@domain | No | 1 | 0 | 6 | 7 | 0 | 0 |
user@sub.example.com | Yes | 1 | 1 | 15 | 4 | 0 | 1 |
user@.com | No | 1 | 1 | 4 | 4 | 0 | 1 |
user@domain | No | 1 | 0 | 6 | 4 | 0 | 0 |
username@valid.co.uk | Yes | 1 | 1 | 9 | 8 | 0 | 1 |
Feature Name | Description |
---|---|
Contains '@' | Whether the email contains @ (1 = Yes, 0 = No). Essential for valid emails. |
Contains '.' | Checks if the email has at least one . (dot), usually needed for domains. |
Domain Length | The length of the domain part (e.g., example.com has 11 characters). |
Local Part Length | Length of the part before @ (e.g., user@example.com → user has 4 chars). |
Special Characters Count | Number of special characters (+ , _ , - ) in the local part. |
Ends with Common TLD | Whether the domain ends in .com , .org , .net , .co.uk , etc. |
Convert the email text into numerical features using the table above.
Each email is now represented as a structured row in a dataset.
Use a classification algorithm like Logistic Regression, Random Forest, or Neural Networks.
Train the model on 80% of the dataset and test on the remaining 20%.
Given a new email, the model will analyze its features and assign a probability of being valid or invalid.
Example:
newuser@website.com
→ Model predicts: Valid
invalid-email
→ Model predicts: Invalid
Factor | Rules-Based Approach | Machine Learning Approach |
---|---|---|
Hardcoded Rules | Manually define email structure rules | Learns from data and adapts |
New Patterns | Needs updates for new email formats | Automatically recognizes new valid email structures |
Scalability | Hard to manage for large datasets | Easily scales with more data |
Accuracy | Can fail on complex edge cases | Improves with more training data |
A combination of rule-based checks + ML can work best:
First, apply basic regex rules to filter obviously invalid emails.
Then, use ML for edge cases and advanced structures (e.g., unusual but valid emails).
Machine learning makes email validation more flexible and intelligent by learning from real-world data. It reduces manual rule updates and helps in scenarios where traditional regex-based validation fails
from sklearn.feature_extraction.text import CountVectorizer |
✔️ Adaptive & Scalable – The model improves with more data.
✔️ Pattern Learning – Learns patterns that may be difficult to define explicitly.
✔️ Handles Variability – Works with diverse email formats and potential typos.
❌ Requires Training Data – Needs a dataset of labeled emails.
❌ Computational Overhead – Requires training and inference time.
❌ Lower Explainability – Predictions might not be as transparent as rule-based checks.
Factor | Functional Approach | Machine Learning Approach |
---|---|---|
Accuracy | High if rules are well-defined | Improves with more data |
Adaptability | Requires manual updates | Learns automatically from data |
Scalability | Limited | Can handle large datasets |
Interpretability | Easy to understand | Harder to explain predictions |
Computation Time | Fast | Slower due to training |
Decision Making:
Use Functional Programming when rules are simple and deterministic.
Use Machine Learning when email validation requires handling complex, evolving patterns.
Hybrid Approach: Combine both! Use rules for quick checks and ML for edge cases.
Machine learning offers a powerful alternative to rule-based programming by learning from data rather than relying on manually defined rules. However, its success depends on the availability of quality training data and the problem complexity.
For simple cases like email validation, a functional approach may suffice, but for adaptive validation (e.g., detecting spam emails, phishing attempts), ML is the better choice.
Watch below: Understanding How Machine Learning Operates