Lasso regression stands as a noteworthy machine learning algorithm that not only facilitates linear regression but also effectively curtails the array of features incorporated within the model.
Also recognized as L1-norm regularization, Lasso regression integrates a penalty term into the cost function, proportionally linked to the cumulative absolute coefficients. This deliberate inclusion prompts the model to singularly favor the utmost crucial attributes while relegating the coefficients of lesser significance to a value of zero. Lasso regression operates as an augmentation of linear regression, introducing a regularization parameter multiplied by the summation of absolute weight values, which is subsequently amalgamated into the loss function of the conventional least squares technique.
In comparison to alternative regularization approaches like Ridge regression, which employs L2 regularization, Lasso regression claims an edge in yielding sparse solutions—instances where only a subset of features is embraced by the model. This innate trait renders Lasso regression a favored avenue for endeavors involving feature selection and the scrutiny of data entrenched within high-dimensional spaces.
However, a drawback associated with Lasso regression materializes in scenarios where the number of features eclipses the number of samples. The mechanism employed by Lasso regression to nullify certain attributes by relegating their coefficients to zero can be counterproductive when dealing with an expansive set of features.
What is Lasso?
Lasso stands for least absolute shrinkage and selection operator. Pay attention to the words, “least absolute shrinkage” and “selection”. We will refer to it shortly.
Lasso regression is used in machine learning to prevent overfitting. It is also used to select features by setting coefficients to zero.
What is Regression?
Regression, when it comes to statistics and machine learning, is a way to figure out how things are connected. You take some things that might affect something else, and you try to find out how much they actually do. The main point of this kind of math is to see how changes in one thing are connected to changes in another. It's like trying to predict what will happen based on certain factors.
The thing you're trying to figure out or predict is called the "outcome." And the factors that might be influencing it are called "independent variables." This math helps you put numbers on how these things are linked.
- There are different methods to do this math, but two big ones are:
Linear Regression: This is like drawing a straight line that fits the data. The idea is to find the best line that gets really close to the real points. A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. - Logistic Regression: This sounds complicated, but it's just used to tell whether something is one thing or another. Like, if you have data about whether it's sunny or rainy and you want to predict the weather for tomorrow.
Other ways to do this math include using curved lines (polynomial regression), adding some rules to avoid getting too crazy (ridge and lasso regression), and even fancier methods like support vector regression and random forest regression.
In simple terms, regression is a basic tool to understand how things are linked, make guesses about the future, and get some smart insights from numbers.
Lasso Regression Python Example
In Python, Lasso regression can be executed through the employment of the Lasso class found within the sklearn.linear_model library.
Lasso Regression in Python Using Sklearn Library
#imports necessary libraries from scikit-learn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Load the diabetes dataset
diabetes_data = datasets.load_diabetes()
# Split the data into training and test sets
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(diabetes_data.data, diabetes_data.target, test_size=0.3, random_state=42)
# Scale the data using StandardScaler
data_scaler = StandardScaler()
X_train_scaled = data_scaler.fit_transform(X_train_orig)
X_test_scaled = data_scaler.transform(X_test_orig)
# Fit Lasso regression model
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train_scaled, y_train_orig)
# Evaluate model performance on the test set
y_pred = lasso_reg.predict(X_test_scaled)
# Model Score
model_score = lasso_reg.score(X_test_scaled, y_test_orig)
print("Model Score: ", model_score)
# Lasso Coefficients
lasso_coefficients = lasso_reg.coef_
Here, the code imports various modules from scikit-learn: datasets for loading datasets, train_test_split for splitting data into training and test sets, Lasso for creating a Lasso regression model, mean_squared_error for calculating the mean squared error, and StandardScaler for data scaling. The code loads the diabetes dataset using scikit-learn's built-in load_diabetes() function. Then we create a StandardScaler instance to standardize the feature data. The training features (X_train_orig) are fitted to the scaler to compute mean and standard deviation, and then both training and test features are scaled using these statistics. The code predicts the target values using the trained Lasso model on the scaled test features (X_test_scaled). The model's performance is evaluated using the .score() method, which calculates the coefficient of determination (R^2) between predicted and true values. The score is printed to the console.
The code prints the R-squared model score to assess the performance. The Lasso coefficients (regression coefficients) are stored in the lasso_coefficients variable.
So here we showed how to load a dataset, split it into training and test sets, scale the features, train a Lasso regression model, evaluate its performance, and extract the model's coefficients using scikit-learn.
Making Lasso Regression Using Numpy Library and CSV Files
Let’s introduce the housing dataset. The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.
Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.
Mean Absolute Error (MAE) is a common metric used to measure the accuracy of a predictive model, particularly in regression tasks.
The dataset involves predicting the house price given details of the house suburb in the American city of Boston.
Here is an example:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Load the housing dataset
example_data = pd.read_csv("example.csv", header=None)
# Display the shape of the dataset
print(example_data.shape)
# Display the first few rows of the dataset
print(example_data.head())
#Output:
#(475, 14)
# 0 1 2 3 4 5 ... 8 9 10 11 12 13
#0 0.01 18.0 2.31 0 0.54 6.58 ... 1 296.0 15.3 396.90 4.98 24.0
#1 0.03 0.0 7.07 0 0.47 6.42 ... 2 242.0 17.8 396.90 9.14 21.6
#2 0.03 0.0 7.07 0 0.47 7.18 ... 2 242.0 17.8 392.83 4.03 34.7
#3 0.03 0.0 2.18 0 0.46 7.00 ... 3 222.0 18.7 394.63 2.94 33.4
#4 0.07 0.0 2.18 0 0.46 7.15 ... 3 222.0 18.7 396.90 5.33 36.2
The example downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.
Next we provide an implementation of the Lasso penalized regression algorithm via the Lasso class and scikit-learn Python machine learning library.
We can evaluate the Lasso Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import Lasso
# Load the housing dataset
data_df = pd.read_csv("example.csv", header=None)
data = data_df.values
X_features, y_target = data[:, :-1], data[:, -1]
# Define the Lasso regression model
lasso_model = Lasso(alpha=1.0)
# Define the cross-validation strategy
cv_strategy = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Evaluate the model using cross-validation
neg_mean_absolute_errors = cross_val_score(lasso_model, X_features, y_target, scoring='neg_mean_absolute_error', cv=cv_strategy, n_jobs=-1)
# Convert negative errors to positive
pos_mean_absolute_errors = np.absolute(neg_mean_absolute_errors)
# Calculate and print mean and standard deviation of positive MAE scores
mean_mae = np.mean(pos_mean_absolute_errors)
std_mae = np.std(pos_mean_absolute_errors)
print('Mean Absolute Error (MAE): %.3f (%.3f)' % (mean_mae, std_mae))
#Output:
#Mean Absolute Error (MAE): 3.711 (0.549)
Confusingly, the lambda term can be configured via the “alpha” argument when defining the class. The default value is 1.0 or a full penalty.
Running the example evaluates the Lasso Regression algorithm on the dataset and reports the average MAE across the three repeats of 10-fold cross-validation.
Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.
In this case, we can see that the model achieved a MAE of about 3.711.
Lasso Regression Prediction in Python
We may decide to use the Lasso Regression as our final model and make predictions on new data.
This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.
We can demonstrate this with a complete example, listed below.
# Import necessary libraries
from pandas import read_csv as load_csv
from sklearn.linear_model import Lasso as LassoRegression
# Access the dataset from its digital realm
data_table = load_csv("example.csv", header=None)
dataset = data_table.values
input_data, target = dataset[:, :-1], dataset[:, -1]
# Craft the Lasso of Regression
regressor = LassoRegression(alpha=1.0)
# Infuse the model with insights from the dataset
regressor.fit(input_data, target)
# Define new data for a prophecy
new_sample = [0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30, 396.90, 4.98]
# Evoke the predictive powers
prediction = regressor.predict([new_sample])
# Reveal the outcome of the prediction
print('Oracle Predicts: %.3f' % prediction)
#Output:
#Oracle Predicts: 30.998
Running the example fits the model and makes a prediction for the new rows of data.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.
Next, we can look at configuring the model hyperparameters.
Changing Lasso Hyperparameters in Python
We are aware that the alpha hyperparameter's default value is set at 1.0. However, it is considered a prudent approach to experiment with an array of diverse setups and unveil the configuration that optimally suits our dataset.
Changing Config by GridSearchCV in Python
One approach would be to gird search alpha values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.
# Perform a grand quest for optimal hyperparameters with Lasso Regression
from numpy import arange as create_range
from pandas import read_csv as acquire_data
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso as TheLasso
# Unearth the dataset from its digital realm
data_scroll = acquire_data("example.csv", header=None)
data_treasures = data_scroll.values
X_marks_the_features, y_guards_the_target = data_treasures[:, :-1], data_treasures[:, -1]
# Summon the Lasso of Modeling
model_of_choice = TheLasso()
# For the art of evaluation, a method is designated
folded_kingdoms = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Crafting a grid of possibilities
hyperparam_grid = dict()
hyperparam_grid['alpha'] = create_range(0, 1, 0.01)
# Initiating the grand hunt
hyperparam_hunt = GridSearchCV(model_of_choice, hyperparam_grid, scoring='neg_mean_absolute_error', cv=folded_kingdoms, n_jobs=-1)
# Commencing the search across realms
results_of_quest = hyperparam_hunt.fit(X_marks_the_features, y_guards_the_target)
# Unveiling the secrets
print('Mystical MAE: %.3f' % results_of_quest.best_score_)
print('Optimal Configurations: %s' % results_of_quest.best_params_)
#Output:
#Mystical MAE: -3.379
#Optimal Configurations: {'alpha': 0.01}
In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.711. Ignore the sign; the library makes the MAE negative for optimization purposes.
We can see that the model assigned an alpha weight of 0.01 to the penalty.
Changing Alpha Using LassoCV Class in Python
The scikit-learn library also equips us with an integrated version of the algorithm that effortlessly seeks optimal hyperparameters through the LassoCV class.
To employ this class, the model is seamlessly merged with the training dataset in the conventional manner. During this union, the hyperparameters undergo a clandestine metamorphosis, orchestrated by the enigmatic currents of training. The fit model can then be used to make a prediction.
As a default course of action, the LassoCV class embarks on an exhaustive pilgrimage, exploring the model's efficacy across a collection of 100 alpha values. We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did in the previous example by setting the “alphas” argument.
The example below demonstrates this.
# Utilize the Lasso Regression algorithm with automatic configuration
from numpy import arange as create_sequence
from pandas import read_csv as load_data
from sklearn.linear_model import LassoCV as AutoLasso
from sklearn.model_selection import RepeatedKFold as CyclicFolds
# Obtain the dataset from its digital repository
data_table = load_data("example.csv", header=None)
data_store = data_table.values
input_data, target_values = data_store[:, :-1], data_store[:, -1]
# Determine the model evaluation approach
iterating_folds = CyclicFolds(n_splits=10, n_repeats=3, random_state=1)
auto_reg_model = AutoLasso(alphas=create_sequence(0, 1, 0.01), cv=iterating_folds, n_jobs=-1)
auto_reg_model.fit(input_data, target_values)
print('Optimal alpha: %f' % auto_reg_model.alpha_)
# Output:
# Optimal alpha: 0.000000
Executing the illustration involves training the model and unearthing the hyperparameters that yield the finest outcomes through the utilization of cross-validation.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.
In this case, we can see that the model chose the hyperparameter of alpha=0.0. This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.