What is L1 and L2 regularization
L1 and L2 regularization are two popular techniques used to prevent overfitting in machine learning models.
What is Overfitting?
Overfitting occurs when a model is too complex and learns the noise in the training data, resulting in poor performance on new, unseen data.
L1 Regularization (Lasso Regression)
L1 regularization, also known as Lasso regression, adds a penalty term to the loss function to discourage large weights. The penalty term is proportional to the absolute value of the weights.
Mathematically:
Loss function = (Sum of squared errors) + α * (Sum of absolute values of weights)
where α is the regularization strength.
L2 Regularization (Ridge Regression)
L2 regularization, also known as Ridge regression, adds a penalty term to the loss function to discourage large weights. The penalty term is proportional to the square of the weights.
Mathematically:
Loss function = (Sum of squared errors) + α * (Sum of squares of weights)
where α is the regularization strength.
Key Differences
- Penalty term: L1 regularization uses the absolute value of weights, while L2 regularization uses the square of weights.
- Sparsity: L1 regularization can result in sparse models (i.e., models with some weights set to zero), while L2 regularization cannot.
- Effect on weights: L1 regularization can shrink weights to zero, while L2 regularization can only shrink weights towards zero.
When to Use Each
- L1 Regularization: Use when you want to create a sparse model or when you have a large number of features.
- L2 Regularization: Use when you want to reduce the magnitude of all weights, rather than creating a sparse model.
Hyperparameter Tuning
Both L1 and L2 regularization require tuning the regularization strength (α) to achieve optimal results. This can be done using techniques like cross-validation.
why we add penalty to lasso regression, explain the loss function concept with the penalty?
Let's break down the concept of loss functions and penalties in Lasso regression.
What is a Loss Function?
A loss function measures the difference between the model's predictions and the actual true values. The goal of training a model is to minimize the loss function.
Mean Squared Error (MSE) Loss Function
One common loss function is Mean Squared Error (MSE), which calculates the average squared difference between predicted and actual values:
MSE Loss Function = (1/n) * Σ(y_true - y_pred)^2
where:
- y_true: actual true values
- y_pred: model's predictions
- n: number of data points
What is the Penalty in Lasso Regression?
In Lasso regression, we add a penalty term to the loss function to discourage large weights. The penalty term is proportional to the absolute value of the weights:
Lasso Loss Function = MSE Loss Function + α * Σ|weights|
where:
- α: regularization strength (hyperparameter)
- Σ|weights|: sum of absolute values of model weights
Why Add a Penalty?
We add a penalty to the loss function for several reasons:
- Prevent Overfitting: Large weights can lead to overfitting, where the model memorizes the training data rather than learning generalizable patterns. The penalty discourages large weights, reducing the risk of overfitting.
- Encourage Sparsity: The penalty term can shrink some weights to zero, resulting in a sparse model. This can be beneficial when working with high-dimensional data, as it can reduce the impact of irrelevant features.
- Improve Interpretability: By shrinking some weights to zero, the penalty term can make the model more interpretable, as only the most important features are retained.
How Does the Penalty Affect the Model?
The penalty term affects the model in several ways:
- Weight Shrinkage: The penalty term shrinks the model weights, reducing their magnitude.
- Feature Selection: The penalty term can shrink some weights to zero, effectively selecting only the most important features.
- Reduced Overfitting: The penalty term reduces the risk of overfitting by discouraging large weights.
By adding a penalty term to the loss function, Lasso regression encourages sparse models, reduces overfitting, and improves interpretability.