Table of Contents
URL: https://www.progressiverobot.com/logistic-regression-with-scikit-learn/
Introduction
Machine learning heavily relies on logistic regression as one of its essential classification techniques. The term “regression” appears in its name because of its historical background, yet logistic regression is mainly used for classification purposes. This Scikit-learn logistic regression tutorial thoroughly covers logistic regression theory and its implementation in Python while detailing Scikit-learn parameters and hyperparameter tuning methods.
It demonstrates how logistic regression makes binary classification and multiclass problems straightforward. At the end of this guide, you will have developed a strong knowledge base to use Python logistic regression code with a dataset. You will also learn how to interpret results and enhance model performance.
Prerequisites
- Understanding the fundamental concepts of classification, supervised learning, and model evaluation metrics (accuracy, precision, recall).
- The ability to use Python for data manipulation and model training through libraries such as NumPy, Pandas, and Scikit-learn.
- Understanding linear algebra concepts, basic probability theory, and statistics will provide the foundation to grasp the mathematical formulation of logistic regression.
- A basic grasp of gradient descent and loss functions, as logistic regression minimizes a cost function to optimize model performance.
Understanding Scikit-learn and Its Role in Machine Learning
Scikit-learn is a widely open-source Python library and an essential tool for machine learning tasks. It offers straightforward and powerful data analysis and mining tools based on NumPy, SciPy, and Matplotlib. Its API documentation and algorithms make it an indispensable resource for machine learning engineers and data scientists.
Scikit-learn can be described as a complete package for building machine learning models with minimal coding. These models include linear regression, decision trees, support vector machines, logistic regression, etc.. The library provides tools for data preprocessing, feature engineering, model selection, and hyperparameter tuning. This Python Scikit-learn Tutorial provides an introduction to Scikit-learn.
Mathematical Foundation of Logistic Regression
Understanding the math behind logistic regression will help us understand how it extends a simple linear model into a powerful tool for handling binary classification tasks. The coming sections explore concepts such as the sigmoid function, odds, log-odd interpretations, and the cost function that regulates the logistic regression learning process.
The Sigmoid Function
The sigmoid function is the core of logistic regression. This function takes any real number and maps it to a value between 0 and 1. It can be expressed mathematically as:
!image.png)
where,
!image.png)
Since σ(z) always returns a value between 0 and 1(no matter the input z), it effectively converts a linear combination of input features into a probability. This allows logistic regression to classify inputs into one of two classes.
Model Interpretation: Odds and Log-Odds
Logistic regression looks at the output probability (let's call it p) through the lens of odds and log odds:
- Odds are simply the ratio of the probability that an event occurs compared to the probability that it doesn't:
!image.png)
- If you take the natural logarithm of the odds, you will get the log-odds (or logit), allowing you to form a linear equation:
!image.png)
- (beta)o(The Intercept) represents the log odds when all predictors (xi) are set to zero.
- (beta)n(The coefficient of feature n) represents how much the log odds change when you increase the predictor xn by one unit while keeping the other variables constant.
You can think of odds as the exponential transformation of log odds:
- When the odds are greater than 1, the event is more likely to occur.
- Odds less than 1 indicate that the event is less likely to occur.
Interpreting Logistic Regression Coefficients: Mean Radius and Odds Ratio in Breast Cancer Prediction
In the following code, we have trained a logistic regression model on Scikit-learn’s breast cancer dataset and interpreted the coefficient for the mean radius feature. Next, we computed the odds ratio to measure the effect of each unit increase in mean radius on the probability that a tumor would be classified as malignant.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import numpy as np
dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
model = LogisticRegression().fit(X, y)
# Let's consider the first feature: mean radius
coef = model.coef_[0][0]
oddsratio = np.exp(coef)
print(f"Coeff for mean radius: {coef:.2f}")
print(f"Odds ratio for mean radius: {oddsratio:.2f}")
#Coeff for mean radius: 1.33
#Odds ratio for mean radius: 3.77
The results show that the mean radius coefficient is 1.33. This means that for every unit increase in the mean radius, the odds of being malignant increase by 1.33. An odds ratio of 3.77(The exponential of the coefficient) indicates that as the mean radius increases by one unit, the odds of malignancy nearly triple, specific to about 3.77 times.
These positions mean radius as a key predictive variable in the model. Analyzing these values can assist healthcare professionals in making informed medical decisions while analyzing feature importance.
Key Insights of Logistic Regression
- It keeps the outputs between 0 and 1, ideal for probability classification tasks.
- It captures the relationship between predictors and the log odds of an event instead of looking at the probabilities directly.
- By exponentiating the coefficients, we can better understand how different features affect the probability of an event occurring.
Cost Function (Log Loss)
Unlike linear regression, which focuses on minimizing the mean squared error, logistic regression has its own training method. It aims to minimize a cost function(log loss or binary cross-entropy). This function evaluates how accurately the model's predicted probabilities match the class labels. It rewards accurate predictions with high confidence and penalizes incorrect ones. The log loss is defined as:
where:
- *m* stands for the number of training samples we’re dealing with.
- yi represents the true label for the ith sample, which can be 0 or 1.
- *yi* refers to the predicted probability of the positive class for that same sample.
This loss function penalizes confident but wrong predictions, encouraging the model to provide well-calibrated probability estimates. Using optimization techniques like gradient descent to minimize the log loss, we end up with the parameters β that best fit the data.
Logistic Regression vs. Linear Regression
At first glance, logistic regression might seem pretty similar to linear regression, but they serve different purposes:
- Linear regression predicts continuous values, like home prices or stock market trends. It achieves this by using a linear combination of input features to output a real number.
- Logistic regression predicts discrete outcomes, such as whether an email is spam. Instead of just outputting a real number, it first computes log odds and then uses the logistic function to turn that result into a probability between 0 and 1.
Check out our guide on multiple linear regression in Python to learn more about regression techniques. This tutorial focuses on implementing multiple linear regression in Python and covers important topics like data preprocessing, evaluation metrics, and optimizing performance.
Logistic Regression Scikit-learn Example (Binary Classification)
The following Python logistic regression example uses the Breast Cancer Wisconsin dataset, a standard resource built within Scikit-learn.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# dataset loading
dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=40
)
# Logistic Regression model initialization
model = LogisticRegression(
penalty='l2',
C=2.0,
solver='liblinear',
max_iter=1000
)
# model training
model.fit(X_train, y_train)
# prédictions
y_pred = model.predict(X_test)
# model evaluation
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The script above displays a straightforward machine-learning pipeline using Scikit-learn:
- Importing the libraries and loading the Breast Cancer dataset.
- Splitting the data into features and labels.
- We separate it into training and testing sets, where we set a Logistic regression model with a few settings (like *L2* penalty, *C=2.0*, using '*liblinear*' as the solver, and capping the iterations at 1000).
- After training the model, we make some predictions on the test data.
- Finally, we check how well the model performs by computing the accuracy score to see how effectively it classifies unseen data.
When dealing with imbalanced datasets, you should consider using advanced evaluation metrics, including precision, recall, and F1-score. To explore these evaluation metrics, refer to our guide on deep learning metrics. Although these metrics are described for deep learning purposes, their explanations can be applied to logistic regression.
In real-world projects, you’ll often encounter tasks such as handling missing values and scaling. To understand how to normalize data in Python, look at our article on Normalizing Data in Python Using scikit-learn.
Scikit-learn Logistic Regression Parameters and Solvers
If working with LogisticRegression in Scikit-learn, knowing the right parameters can make a difference in the model performance. The table below displays some of the most important scikit-learn logistic regression parameters and the various solvers you can use:
| Parameter | Description |
|---|---|
| penalty | Defines the type of norm used for regularization. Options include L1, L2, elasticnet, and none. L1 promotes sparsity, while L2 stabilizes coefficients. |
| C | Represents the inverse of regularization strength. Smaller values increase regularization (simpler models), while larger values reduce it (more complex models). The default is 1.0. |
| solver | Algorithm used for optimization. Common solvers: liblinear: Supports L1/L2 penalties. lbfgs: Default solver for Scikit-learn version. sag(Stochastic Average Gradient) & saga(Stochastic Average Gradient Augmented): Variants of stochastic gradient descent. |
| max_iter | Sets the maximum number of iterations for convergence. Increasing it helps when models struggle to converge. |
| fit_intercept | This determines whether the model calculates the intercept. Setting it to False forces the intercept to 0, but It is generally recommended to be set to True. |
Understanding these parameters will help you customize the logistic regression model to fit the dataset and specific needs.
Penalty Types: Selecting the Right Regularization Approach
Scikit-learn provides three regularization techniques: *L1* (Lasso), *L2* (Ridge), and *ElasticNet:*
- *L1* regularization creates sparse models by setting some coefficients to zero. This makes it well-suited for feature selection tasks in datasets with high-dimensional data.
- *L2* regularization shrinks the coefficient towards zero without eliminating them, which leads to better stability and effective multicollinearity management.
- Elastic Net integrates *L1* and *L2* regularization through the *l1_ratio* parameter to achieve feature selection and coefficient stability, particularly in the presence of correlated features.
The proper penalty selection should align with your objectives: using *L1* for better interpretability and feature selection capabilities, *L2* for more stable predictions, and elastic net when both features are required.
Solver Selection: Optimization Algorithms for Different Scenarios
The solver parameter determines which optimization algorithm computes the maximum likelihood estimates for logistic regression. Various solvers display distinct computational properties and compatibility with different penalty types while demonstrating unique performance profiles when handling different dataset sizes.
liblinear Solver *liblinear* was the default solver in older versions of scikit-learn and continues to perform efficiently with smaller datasets. This solver allows for *L1* and *L2* regularization. It works with binary classification and can use the one-vs-rest strategy for multiclass problems.
Usage example:
from sklearn.linear_model import LogisticRegression
liblinear_m = LogisticRegression(solver='liblinear', penalty='l1', C=1.0)
lbfgs Solver Scikit-learn uses the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm as its default solver. The L-BFGS optimization algorithm belongs to the quasi-Newton family. It works by estimating the inverse Hessian matrix to find optimal parameters efficiently. Usage example:
lbfgs_model = LogisticRegression(solver='lbfgs', penalty='l2', C=1.0)
Multiclass problems with multinomial loss functions benefit from the *lbfgs* solver when combined with *L2* regularization. While this solver can complete its process in fewer iterations than other algorithms, it can also be memory-intensive for very large datasets.
saga Solver The SAGA algorithm delivers exceptional performance on large-scale data, particularly when elastic net regularization is used. The solver performs efficiently with L1 and L2 regularization. However, the computational resources required can vary depending on the problem's complexity. Usage example:
saga_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, C=1.0)
Summary
The following table displays the solver comparison:
| Solver | Regularization Supported | Best Use Case | Limitations |
|---|---|---|---|
| liblinear | L1, L2 | Sparse data; suitable for small datasets. | Inefficient for dense data or unscaled datasets; may struggle with large C values. |
| lbfgs | L2 | Medium to large datasets, especially dense ones. | Memory-intensive for very large data |
| saga | L1, L2, Elastic Net | Large-scale or high-dimensional problem | Performance depends on proper scaling; resource-intensive for some cases |
Tips for Solver Selection
- Choose *liblinear* for sparse data or binary classification tasks, especially when working with small datasets.
- Opt for *lbfgs* as your default solver for medium-sized datasets, particularly dense ones.
- Use *saga* when facing large-scale problems or when elastic net regularization is required.
Other Solvers in Scikit-learn
- sag: Large datasets with similarly scaled features work well with this solver. It supports only *L2* regularization. For the best results, it is recommended that feature scaling be applied for optimal convergence.
- newton-cg(Newton conjugate gradient): It can be used for multiclass classification problems since it supports multinomial loss functions. It may be slower than *lbfgs*.
- newton-cholesky: It represents an optimized variant of *newton-cg* designed for highly structured problems.
You can achieve efficient and accurate logistic regression model training by choosing a solver that matches the dataset size and regularization requirements.
Hyperparameter Tuning for Logistic Regression
Tuning hyperparameters like *C* (which controls regularization strength) and choosing the right *penalty* and *solver* can drastically influence performance.
Techniques for Hyperparameter Tuning
Let’s consider some techniques for hyperparameter tuning, such as grid search, randomized search, and grid search:
- Grid Search: Grid Search can help set up various parameters and thoroughly explore to find the best combination. It's pretty straightforward to implement, but if your grid or dataset is large, it can take more computational power.
- Randomized Search: Randomized search randomly samples parameter combinations within selected ranges. This often finds good solutions faster than a grid search. It can be implemented using RandomizedSearchCV.
- Bayesian Optimization: This method takes a more advanced approach, building a probabilistic model of the objective function and selecting new hyperparameter values to evaluate based on previous results.
Implementing Grid Search in Scikit-learn
For example, we will consider the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
# dataset loading
dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=40
)
# Parameter grid to explore
param_grid = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l2'],
'solver': ['lbfgs', 'saga']
}
log_r = LogisticRegression(max_iter=400)
grid_s = GridSearchCV(
estimator=log_r,
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_s.fit(X_train, y_train)
print("Best Parameters:", grid_s.best_params_)
print("Best Score:", grid_s.best_score_)
The script above imports the Breast Cancer dataset and splits it into training and testing sets before tuning the logistic regression model's hyperparameters using GridSearchCV. This process tests various regularization parameter *C* values using *lbfgs* and *saga* solvers with an *L2* penalty. This configuration allows the model to execute for a maximum of 400 iterations.
You may see the warning: "ConvergenceWarning: lbfgs failed to converge." This indicates that the *lbfgs* solver fails to converge within the allocated iteration limit using some combinations of parameters. To fix this issue, increase *max_iter* or adjust the solver and *C* values.
Additionally, you must understand how different parameters work when building the parameter grid. Not all solvers support all penalty types, and some combinations, such as *penalty='elasticnet*' with *solver='lbfgs,'* will result in errors.
The C Parameter: Fine-Tuning Regularization Strength
When *C* is small (like 0.001), the model prioritizes simplicity instead of trying to fit the training data perfectly. This can reduce overfitting, but it might also lead to underfitting. On the other hand, when *C* is quite large (like 100), the model aims to reduce training errors, which might lead to overfitting but can capture more complex patterns in the data.
A systematic approach to tuning C involves:
- Start by exploring on a logarithmic scale (0.001, 0.01, 0.1, 1, 10, 100, 1000.)
- Finds the range where the performance hits its peak.
- Perform a more granular search within the optimal range identified.
Dataset characteristics such as feature count, sample size, and noise level critically influence the optimal *C* value. Datasets with substantial noise require stronger regularization, which can be achieved using lower *C* values.
Practical Tips for Hyperparameter Tuning
This table provides straightforward tips for applying *GridSearchCV* with Logistic Regression. They will help to improve the hyperparameter tuning results and boost the model's performance.
| Tip | Description |
|---|---|
| Use a Small C Range and Simple Penalties | Start with a small set of values for *C* (like 0.01, 0.1, 1, 10) and use penalties like '*l1*' or '*l2*' to keep your first tests simple. |
| Choose the Right Solver | Make sure the solver you choose fits your dataset and penalty. For instance, '*liblinear*' works with *L1* and *L2* but can be slow on larger datasets. On the other hand, *'lbfgs,'* '*saga*,' and '*newton-cg*' are better suited for handling larger data. |
| Handle Convergence Warnings | If you get warnings about the solver not converging, increase the *max_iter* or adjust your solver and *C* values. |
| Standardize Features | Logistic Regression is sensitive to feature magnitude, so apply standardization (e.g., StandardScaler) in a pipeline to help the optimizer converge efficiently. |
| Choose Suitable CV Folds | Depending on your dataset size, use 5- or 10-fold cross-validation. More folds generally provide better hyperparameter estimates and reduce overfitting risk. |
| Handle Imbalanced Data | If the data is imbalanced, consider setting *class_weight='balanced*' or defining custom weights to improve the minority class performance. |
| Use Multiple Metrics | Avoid relying solely on accuracy. Use GridSearchCV's *scoring* feature to track other metrics, such as F1, precision, or recall, especially when working with imbalanced datasets. |
| Inspect Learning Curves | After finding optimal parameters, check out learning or validation curves to ensure the model generalizes and isn’t too simplistic or complex. |
Multiclass Logistic Regression in Scikit-Learn
Although logistic regression is mainly designed for binary outcomes, Scikit-learn provides a way to apply it to multiclass scenarios through two main approaches:
- One-vs-Rest (OvR)
- Multinomial Logistic Regression
One-vs-Rest (OvR) Strategy
The One-vs-Rest (OvR) technique transforms an n-class scenario into n individual binary classification problems.
How It Works:
- A distinct classifier is trained for each class.
- Each classifier learns to separate one class (the positive) from the others combined (the negative).
- When making predictions, all classifiers evaluate the sample, and the one with the highest confidence score is chosen.
To set up the OvR strategy using Scikit-learn, use the following configuration:
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
Advantages of OvR:
- Easy to implement with straightforward binary classifiers.
- Efficient in terms of computation.
- Works smoothly with all Scikit-learn solvers.
Limitations of OvR:
- The OvR method can lead to situations where multiple classifiers predict the same instance as positive, which can make determining the final class assignments confusing.
- Doesn’t take into account the potential correlation between classes.
- Might not perform well when classes are unbalanced.
Multinomial Strategy
Multinomial logistic regression (Softmax regression) extends binary logistic regression to handle all classes simultaneously.
How It Works:
- Instead of training separate models for each binary classification, we focus on training just one model.
- The softmax function comes into play, turning raw scores into probabilities across all classes.
- Probabilities sum to 1, ensuring a valid probability distribution.
- Finally, we select the class that has the highest probability.
To implement this approach in Scikit-learn, use the following configuration:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=400)
Advantages of Multinomial Logistic Regression:
- Models class relationships jointly rather than independently.
- It provides better-calibrated probability estimates overall.
- Tends to perform better when classes overlap.
Limitations of Multinomial Logistic Regression:
- It requires compatible solvers like '*lbfgs,*' '*newton-cg,*' or *'saga.*'
- It's more computationally intensive.
Choosing Between OvR and Multinomial
Choosing the right multiclass strategy depends on a few key factors:
- Dataset size: If you have a smaller dataset, One-vs-Rest might be a better choice due to its simpler model.
- Class relationships: The multinomial method often performs better when classes have strong relationships or tend to overlap.
- Computational resources: For many classes, OvR can be a more resource-efficient option.
- Probability calibration: multinomial can provide better-calibrated probabilities across different classes.
Comparison with Other Classification Models
There are many classification models besides logistic regression, each with strengths and weaknesses. In the following table, we will consider some of them:
| Classification Model | Pros | Cons |
|---|---|---|
| Decision Trees | Simple to interpret, handles non-linear relationships, no need to normalize data | Prone to overfitting if not pruned or regularized |
| Support Vector Machines (SVMs) | Handles complex, high-dimensional data and supports different kernel functions for higher-dimensional mapping | Complex parameter tuning (C, kernel settings), slow on large datasets |
| Random Forests | Reduces overfitting via multiple decision trees, high predictive performance | Less interpretable than logistic regression or single decision trees, slow on large datasets |
| Logistic Regression | Interpretable, suitable for small to medium datasets with linear decision boundaries, provides well-calibrated probabilities | Limited to linear decision boundaries, not effective for highly non-linear problems |
When to Use Logistic Regression?
- When interpretability remains the main priority(For example, in sectors like healthcare and finance).
- For datasets of small to medium size with linear decision boundaries.
- If you need well-calibrated probabilities.
Random forests or neural networks can be a better option when your data exhibits high non-linearity or requires high accuracy without concern for model interpretability.
FAQ SECTION
How Does Logistic Regression Work in Scikit-Learn? Scikit-learn uses algorithms such as *lbfgs* or *liblinear* to determine coefficients that minimize the logistic loss function. To train a logistic regression model, use the *.fit(X, y)* function and let scikit-learn handle the remaining processes automatically.
When Should I Use Logistic Regression Instead of Other Classification Models? Logistic regression is a good starting point when:
- You need a quick and simple classifier.
- You need to interpret the coefficients (log odds).
- Your data tends to follow linear boundaries in the feature space.
- You have a moderate dataset size.
However, if the data is large and highly non-linear, or if you aim for top-tier accuracy instead of simplicity, you can consider advanced models like SVMs or neural networks.
What Is the Best Solver for Logistic Regression?
The right method depends on your data's dimensions and nature. *liblinear* and *lbfgs* perform well with datasets from small to medium sizes. When working with large or sparse datasets, *saga* stands out because it effectively handles *L1* and *L2* regularization.
How Do I Interpret Coefficients in Logistic Regression?
The coefficients in logistic regression models show how a unit increase in a feature affects the log-odds outcome. A positive coefficient means that increasing this feature increases the likelihood of a positive outcome, while a negative one suggests the opposite.
Can Logistic Regression Handle Non-Linear Relationships?
Not inherently. Logistic regression requires the relationship between features and log odds to be linear. However, you can generate polynomial features or interaction terms before model input, which allows the model to detect non-linear patterns. Advanced models such as neural networks, random forests, or SVM with kernels enable direct modeling of non-linear relationships.
Conclusion
The classification model logistic regression stands out as a foundational tool because it provides straightforward implementation and interpretability. It also performs well with linearly separable data. Logistic regression is useful in fields such as healthcare and finance because of the well-calibrated probabilities that deliver the necessary transparency for decision-making. Despite the higher accuracy of random forests and neural networks for certain tasks, logistic regression remains a baseline model for understanding feature importance and decision boundaries.
Machine learning techniques enable researchers to optimize logistic regression through hyperparameter tuning, regularization, and feature engineering techniques. The versatile design of Scikit-learn enables users to test various solvers and multiclass strategies to achieve optimal results. Logistic regression provides essential value in machine learning applications, whether applied to binary classification tasks or adapted for multiclass problems. This bridges the gap between simplicity and effectiveness.
References and resources
- How To Train A Logistic Regression Using Scikit-Learn (Python)
- Multiclass Logistic Regression
- LogisticRegression
- Tuning the hyper-parameters of an estimator
- Decision Boundaries of Multinomial and One-vs-Rest Logistic Regression
- Sklearn Logistic Regression hyperparameter optimization
- Linear Models
- What is the difference between "newton-cg" and "newton-cholesky" solvers in sklearn LogisticRegression?
- How To Solve Logistic Regression Not Converging in Scikit-Learn