Understanding the ins and outs of a logistic regression is non-trivial. Many sources either only touch the theoretical side or the implementation side respectively. In this post, I would like to create a one-stop-shop for the theoretical basis and the practical implementations of the logistic regression.
Imports Link to heading
from functools import partial
from IPython import get_ipython
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas.io.formats.style
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import warnings
%matplotlib inline
pd.set_option("display.precision", 2)
warnings.filterwarnings('ignore')
Some Formatting Link to heading
def add_raw_tag(df):
return "
\n" + df.to_html(classes="nb-table") + "\n\n"
html_formatter = get_ipython().display_formatter.formatters['text/html']
html_formatter.for_type(
pd.DataFrame,
lambda df: add_raw_tag(df)
)
html_formatter.for_type(
pd.io.formats.style.Styler,
lambda df: add_raw_tag(df)
)
Our Dataset Link to heading
We will use the well-known iris dataset as basis for our discussion. It is simple enough to still employ some intuition when trying to understand the logistic regression. When loading the dataset from sklearn, we get 150 observations of measurements of some of the plants’ features. Then botanists have classified the irises into three subtypes with fancy latin names, but since I am no botanist, we will just pretend that - based on those measurements - we can determine if the flower blooms red, green or blue. The measurements will end up in the set of observations $X$, with each observation consisting of 4 measurements each and the classification ends up in the vector $y$ with entries being 0 = red, 1 = green, 2 = blue.
X, y = load_iris(return_X_y=True)
columns = ('Measurement 1', 'Measurement 2', 'Measurement 3', 'Measurement 4')
dataset = pd.DataFrame(X, columns=columns)
dataset["True Classification"] = y
dataset # Abbreviated output
Measurement 1 | Measurement 2 | Measurement 3 | Measurement 4 | True Classification | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
5 | 5.4 | 3.9 | 1.7 | 0.4 | 0 |
6 | 4.6 | 3.4 | 1.4 | 0.3 | 0 |
7 | 5.0 | 3.4 | 1.5 | 0.2 | 0 |
colors = ['red', 'green', 'blue']
for X_el, y_el in zip(X, y):
plt.plot(X_el, colors[y_el], alpha=0.1)
plt.xticks(np.arange(4), columns)
plt.show()
We can clearly see that measurement 3 carries a lot of information. It separates neatly between the red flowers and the rest, and it even seems to be a good guide to separate the greens from the blues. Same goes for measurement 4, while 1 and 2 seem to give no (or very little) hint as to which class a certain specimen belongs to.
⚠️ Starting from version 0.22, scikit-learn uses a different default algorithm here. The following computations are performed using the ‘ovr’ value for the multi_class parameter. However, the ‘multinomial’ variant produces better results.
clf = LogisticRegression(random_state=42, multi_class="ovr").fit(X, y)
After having trained the model, the “predict” function allows us to get a prediction for each input tuple. The classes are numbered not named, but can of course be converted to names using the appropriate mapping.
# 0 = red, 1 = green, 2 = blue
clf.predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Besides a simple class prediction, we can also get the probability for each class for each input data sample.
clf.predict_proba(X)[0:5] # abbreviated output
array([[8.96807569e-01, 1.03191359e-01, 1.07219602e-06],
[7.78979389e-01, 2.21019299e-01, 1.31168933e-06],
[8.34864184e-01, 1.65134802e-01, 1.01485082e-06],
[7.90001986e-01, 2.09996107e-01, 1.90723705e-06],
[9.12050403e-01, 8.79485212e-02, 1.07537143e-06]])
For the first example above, we assign an 88% chance of it belonging to class ‘0’ , a 12% chance of it belonging to class ‘1’ and (almost) 0% chance of it belonging to class ‘2’. using the argmax
function, we could then map the probabilites to the exact outcome of the predict
function above.
So, for the first part, we have examined the dataset and we got an idea, how well it will be able to separate the classes, based on the features given. Also, we have generated the output of the sklearn logistic regression that a training on the complete dataset is providing and how to interpret that.
Evaluating the Model Link to heading
A common metric to evaluate the quality of predictions is the ‘score’, which - according to the sklearn help - is:
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
clf.score(X, y)
0.9533333333333334
Basically that means we count all correctly classified labels as a percentage like so:
correct = []
for y_true, y_pred in zip(y, clf.predict(X)):
if y_true == y_pred:
correct.append(1)
else:
correct.append(0)
sum(correct) / len(correct)
0.9533333333333334
Given that we used the complete dataset for training, that’s just okay. A more detailed approach at evaluating the quality of our model would be based on the classification report.
from sklearn.metrics import classification_report
print(classification_report(y_pred=clf.predict(X), y_true=y))
precision recall f1-score support
0 1.00 1.00 1.00 50
1 0.96 0.90 0.93 50
2 0.91 0.96 0.93 50
accuracy 0.95 150
macro avg 0.95 0.95 0.95 150 weighted avg 0.95 0.95 0.95 150
While the class “0 = red” has perfect precision and recall, because classes “1 = green” and “2 = blue” somewhat overlap in their features, the algorithm mixes them up. This is shown by the recall 0.9 of class “1”, which means only 90% of all green flowers where classified as green and the precision of class “2” which means only 91% of all flowers classified as blue where actually blue.
y_true_with_jitter = y + np.random.rand(y.shape[0]) * 0.25
y_classified_with_jitter = clf.predict(X) + np.random.rand(y.shape[0]) * 0.25
plt.xticks(np.arange(3), ('true red', 'true green', 'true blue'))
plt.yticks(np.arange(3), ('class. red', 'class. green', 'class. blue'))
plt.scatter(y_true_with_jitter,
y_classified_with_jitter,
color=[colors[y_el] for y_el in y])
plt.show()
Is it a good model? Well, on one hand, the bare values of our statistics are really quite good, on the other we have a tiny dataset and we trained on the complete dataset as well, as we did not keep a holdout dataset for testing. Since we do not pay too much attention to the actual result of flower’s colors being predicted correctly though and only want to understand how to arrive at the predictions,we will give it a pass. Just make a mental note never to consider that a good result in a real-world application.
A Look at the Inner Workings of the sklearn Logistic Regression Link to heading
When looking at the variables of the log regression classifier after training, we find three sets of coefficients and three different intercepts. That is because log regression is essentially binary, i.e. does only a yes/no or 1/0 classification. If we have $n > 2$ classes, we need split this problem into $n$ separate “1 vs. rest” classification problems. Each set of coefficients and each intercept belongs to one of these sub-classifications.
clf.__dict__["coef_"] # just running clf.__dict__ spits out all the info about the trained model
array([[-0.44501376, 0.89999242, -2.32353827, -0.97345836], [-0.1792787 , -2.12866718, 0.69665417, -1.27480129], [-0.39444787, -0.5133412 , 2.93087523, 2.41709879]])
clf.__dict__["intercept_"]
array([ 6.69040651, 5.58615272, -14.43121671])
If we now feed a set of features $x^i$ into the trained classifier, we can calculate the probabilities of $x^i$ belonging to a class vs. not belonging to that class via:
$$p(x_i)=\frac{1}{1+e^{-(\beta_{0} + \beta_{1}x^{i}1 + \beta{2}x^{i}2 + \beta{3}x^{i}3 + \beta{4}x^{i}_4)}}$$
where $\beta_{0}$ is an intercept and $\beta_{1}..\beta_{4}$ are the coefficients for each entry in the feature vector $x^i = (x^{i}_1,x^{i}_2,x^{i}_3,x^{i}_4)$. We will later explore why this term is the correct one. Let us calculate the above for our very first observation in the dataset:
# @ is shorthand for matrix multiplication
p_0 = 1 / (1 + np.exp(-(clf.intercept_[0] + X[0] @ clf.coef_[0])))
p_1 = 1 / (1 + np.exp(-(clf.intercept_[1] + X[0] @ clf.coef_[1])))
p_2 = 1 / (1 + np.exp(-(clf.intercept_[2] + X[0] @ clf.coef_[2])))
print('p_0 =', p_0)
print('p_1 =', p_1)
print('p_2 =', p_2)
p_0 = 0.9840648521799207 p_1 = 0.11323163760159996 p_2 = 1.1765181958016876e-06
With this calculation, we have now determined that our $x^0$ has a $0.98$ chance of belonging to class “0 = red” vs. any of the other classes - either green or blue. As we can see though, these three probabilities do not add up to 100% and why should they? These probabilities belong to three mathematically independent problems:
- Does $x^i$ belong to class 0 vs. not to class 0?
- Does $x^i$ belong to class 1 vs. not to class 1?
- Does $x^i$ belong to class 2 vs. not to class 2?
What happens, if we linearly scale those probabilities though so that they add up to 1?
p_sum = p_0 + p_1 + p_2
print("p_0_scaled =", p_0 / p_sum)
print("p_1_scaled =", p_1 / p_sum)
print("p_2_scaled =", p_2 / p_sum)
p_0_scaled = 0.896807568632095 p_1_scaled = 0.10319135917188017 p_2_scaled = 1.0721960247752252e-06
We have seen these exact numbers before and can make our choice for a prediction using argmax
:
clf.predict_proba(X)[0]
array([8.96807569e-01, 1.03191359e-01, 1.07219602e-06])
np.argmax(clf.predict_proba(X)[0]) # 0 = red, 1 = green, 2 = blue
0
Now we have an understanding how the interpret the data generated by the training process of sklearn and we have looked beyond the clf.predict
function to understand how the predictions are picked in the trained model.
The Mathematical Background Link to heading
We used the formula
$$p(x^i)=\frac{1}{1+e^{-(\beta_{0} + \beta_{1}x_{1}^{i} + \beta_{2}x_{2}^{i} + \beta_{3}x_{3}^{i} + \beta_{4}x_{4}^{i})}}$$
above, which is technically a choice, not a mathematical coercion (there are, in fact, others that work as well), but why does that make sense?
In order to understand that, we need to understand odds and their relationship with probabilites first. If an event has a 50% chance of ocurring, the odds of it happening are 1:1 (1 time it happens, 1 time it does not). If an event has a 33,3% chance of happening, the odds are 1:2 (1 time it happens, 2 times it does not), 25% represents odds of 1:3, and 20% represents odds of 1:4. Or as a general formula:
$$ \text{Odds} = \frac{p}{1-p} $$
e.g. for $p = 0.25$ that evaluates to $ \text{Odds} = \frac{0.25}{1-0.25} = 0.333… $, i.e. for every 3 times the event under scrutiny does not happen, there will be 1 time where it happens or 1 success : 3 failures.
Substituting the $p$ in the odds formula with the $p(x^i)$ from above, we get: $$ \text{Odds} = e^{\beta_{0} + \beta_{1} x_{1}^{i} + \beta_{2} x_{2}^{i} + \beta_{3} x_{3}^{i} + \beta_{4} x_{4}^{i}} $$
or
$$ log(Odds) = \beta_{0} + \beta_{1}x_{1}^{i} + \beta_{2}x_{2}^{i} + \beta_{3}x_{3}^{i} + \beta_{4}x_{4}^{i} $$
Thereby we have created a link to our feature space $X$ and we can map any observation of features to a probability $ p \in (0,1)$. All we need then is a somewhat arbitrary cutoff rule, usually $p>.5$.
Furthermore, we get interpretability for free: the coefficient $\beta_i$ describes the change in odds when we increase $x_i$ by one unit. Look again at the spaghetti diagram with the colors above. The greater the value of measurement 3 to smaller the chance we have a red specimen at hand. In fact, the $\beta$ coefficient of measurement 3 in our “red vs. rest” problem is $-2.26$ which means that a unit increase of measurement 3 decreases the odds of the specimen being red by $exp(-2.26) \approx .104$, which is very roughly 1 in 9.5.
Let us draw this function $p(x)$ to see what it looks like:
import matplotlib.pyplot as plt
def map_to_p(log_odds):
odds = np.exp(log_odds)
p = odds / (1 + odds)
return p
lots_of_log_odds_for_drawing = np.linspace(-10, 10, num=1000)
mapped_to_p = list(map(map_to_p, lots_of_log_odds_for_drawing))
plt.xlabel("log(Odds)")
plt.ylabel("p")
plt.plot(lots_of_log_odds_for_drawing, mapped_to_p);
This function is called the sigmoid function, which - in terms of our logistic regression model - is the so called link function, as it links a predictor, the $\log(Odds)$ linear combination, to a response in $p \in (0,1)$.
Fitting the Parameters Link to heading
Fitting the parameters is a bit tricky, as we cannot employ a least squares regression as in a linear case. We have to use numerical methods like the maximum likelihood estimation (MLE). Intuitively we want our $\beta$ in such a way that the linear combinations with the feature vectors $x$ are as far away from the middle of the sigmoid and as far to one side for “successes” (usually the positive) and as far to the other for “failures”. “Success” means being a member of a certain class $i$ and failure means a higher probability for any other class.
This beta, once we have found it, we will call $\hat\beta$.
Let’s have a look at the sigmoid again and use the parameters for the class “0 = red” from the sklearn logistic regression to distribute our $x$s:
colors = ['red', 'green', 'blue']
# this is the result of b_0 + b_1 * x_1 + b_2 * x_2 + ...
# clf is still our trained classifier
log_odds_from_our_dataset = [clf.intercept_[0] + x @ clf.coef_[0] for x in X]
plt.xlabel("log(Odds)")
plt.ylabel("p")
plt.plot(lots_of_log_odds_for_drawing, mapped_to_p)
for x, y_cl in zip(log_odds_from_our_dataset, y):
#plt.scatter(x, 0, s=10, c=colors[y_cl])
plt.scatter(x, map_to_p(x), s=10, c=colors[y_cl])
plt.show()
We see now that all of the red dots, which represent the members of class $0$ fall on the right side of 0 (and therefore have $p > .5$), and the other two classes fall on the left side. So, basically, we need to choose $\hat\beta$ in such a way the sigmoid function is as close to zero for some $x$ and as close to 1 for some other $x$. This is a very awkward problem to solve. Luckily, we are working with the interval of $p \in (0,1)$, which means we know that the maximum is 1. So, we can flip those $x$ that are supposed to result in a value of $p$ close to 0 around by calculating $(1-p)$. Now we have a maximization problem for all $x$ in our domain.
Also, a word on the intercept: while $\beta_1 .. \beta_4$ essentially “stretch” our $x$ out in such a way that there is as little overlap between the groups as possible, $\beta_0$ changes the position of the whole set of dots, so that they can be nicely centered around 0. We do actually not need to give special treatment to the intercept, as we can just augment our feature vector $x$ with a static 1 like so $x^i = (1, x^{i}{1}, x^{i}{2},x^{i}{3},x^{i}{4})$. Using these augmented vectors in the following steps, will simplify things a lot.
Furthermore, as we want to maximize all our individual $p$ and $1-p$ respectively, we can also try to maximize the product of all of them and as we want all the terms of the product to be as close to 1 as possible, that means we want the whole product to be as close to 1 as possible.
In our input data, the class a certain specimen belongs to is denoted by $y_i \in {0,1,2}$ with these numbers representing red, green and blue respectively. Now if we want to translate that into a binary problem, we need a $y_{binary, i} \in {1,0}$, were e.g. for the first classification “red vs. non-red” we denote a success with “1” (flower is red) vs. “0” (flower is not red). I will omit the “binary” for brevity’s sake, but please make a big mental note that the $y_i$s we are working with from now on are not the same ones as above any more.
In mathematical terms, we want to find our estimator $\hat\beta$ that maximizes the likelihood function:
$$ l(\beta) = \prod_{x_i; y_i = 1} p(x_i) \times \prod_{x_i; y_i = 0} (1-p(x_i)) $$
or:
$$ \hat\beta = \arg\max_{\beta} l(\beta) $$
Which can then be simplified as follows:
$$ l(\beta) = \prod_{i} p(x_i)^{y_i} (1-p(x_i))^{1-y_i}$$
We need to introduce the next concept now - in order to tackle this maximization problem, we use the fact that $log(a)$ is increasing monotonically with $a$ so maximizing $log(a)$ is equivalent to maximizing $a$, therefore:
$$ \hat\beta = \arg\max_{\beta} l(\beta) \iff \hat\beta = \arg\max_{\beta} \log l(\beta) $$
and using this property, we can transform the multiplication in $l(\beta)$ to a summation in $\log l(\beta)$ or $ll(\beta)$ for short.
$$ ll(\beta) = \sum_{i} y_i \log(p(x^i)) + (1-y_i)\log(1-p(x^i)) $$
We will now simplify this equation further and we will start with the log probability in the first term (blue highlighting will become clear further down the line):
$$ y_i\log(p(x^i)) = y_i\log\frac{1}{1+e^{-\beta x^i}} = \color{blue}{-y_i\log(1+e^{-\beta x^i})} $$
While that was fairly simple, the second term is a bit more challenging. For now, we will omit the term $(1-y_i)$ and focus on the log inverse probability in the second term:
$$ \log(1-p(x^i)) = \log(1-\frac{1}{1+e^{-\beta x^i}}) = \log(\frac{e^{-\beta x^i}}{1+e^{-\beta x^i}})$$
To proceed, we have to make a choice what to do with the term in the brackets: we can either take the $e^{-\beta x_i}$ in the numerator and bring it down into the denominator or use the log rule for fractions to separate the fraction into a subtraction of two fractionless logarithms. It turns out, we need to do both in order to get to the simplest possible form of whole combined equation and after distributing the terms from the $(1-y_i)$ we need to treat each with a different strategy. For the $1$ term (note that the minus sign in the denominator has disappeared):
$$ 1 \log(\frac{e^{-\beta x^i}}{1+e^{-\beta x^i}}) = \log(\frac{1}{e^{\beta x^i}(1+e^{-\beta x^i})}) = \log(\frac{1}{e^{\beta x^i}+1}) = \color{green}{ -\log(e^{\beta x^i}+1)} $$
and for the $-y_i$ term:
$$ -y_i \log(\frac{e^{-\beta x^i}}{1+e^{-\beta x^i}}) = -y_i (\log(e^{-\beta x^i}) - \log(1+e^{-\beta x^i})) $$
$$ = \color{green}{ -y_i(-\beta x^i)} \color{blue}{ + y_i \log(1+e^{-\beta x^i})} $$
When we now reassemble the puzzle pieces, the blue terms cancel each other out and the green terms are left
$$ ll(\beta) = \sum_{i} \color{blue}{-y_i\log(1+e^{-\beta x^i})} \color{green}{ -\log(e^{\beta x^i}+1)} \color{green}{ -y_i(-\beta x^i)} \color{blue}{ + y_i \log(1+e^{-\beta x^i})} $$
$$ = \sum_{i} y_i\beta x^i-\log(e^{\beta x^i}+1) $$
To jog our memory:
- $y_i \in {1,0}$, representing that an $x^i$ belongs to a certain class with 1 or not with 0
- $x^i \in \mathbb{R}^5$ with the first element set fixed to $1$, the feature vector of an observation
- $\beta \in \mathbb{R}^5$ the vector of coefficients, with the first entry representing the intercept
We have now managed to state our optimization problem in comparatively simple terms, as the only thing that is missing now is the $\beta$ that will maximize the last expression above, but all the other variables are clearly defined. We cannot compute the optimal $\beta$ algebraically though and have to rely on numerical methods.
A Naive Example Link to heading
In order to prepare for the next steps of actually fitting the $\beta$ coefficients, we need translate the theoretical maths into python code. Also translating the three class problem of red, green and blue flowers into multiple binary problems like “flower is red vs. flower is not red” is necessary. First the translation of the log-likelihood function.
def log_likelihood(x, y, beta):
ll = 0
for x_el, y_el in zip(x, y):
ll += y_el * (beta @ x_el) - np.log(np.exp(beta @ x_el) + 1)
return ll
We split the three class problem into 3 binary sub-problems, so we need to modify the class vector in such a way that $y$ only has “1” entries for that single class we are testing for and “0” entries for the other two classes:
y_binary_red = [1 if y_el == True else 0 for y_el in y == 0]
y_binary_green = [1 if y_el == True else 0 for y_el in y == 1]
y_binary_blue = [1 if y_el == True else 0 for y_el in y == 2]
Furthermore, we need to add a “1” in the beginning of each feature vector $x$, in order to account for the intercept.
# make a vector with just "1"s and glue it to the left side of X
X_with_intercept = np.hstack((np.ones((X.shape[0], 1)), X))
Now we leave the cosy realm of algebraic certainty and need to employ numerical methods to get our estimate $\hat\beta$. But before we do that, let us see what we want to accomplish in principle by using a brute force algorithm. We start with random $\beta$ - I cheated here as I already roughly know in which area to find the variables. Given this semi-random beta, we calculate the log-likelihood function, which can assume values between $(-\infty, 0)$ as it is a $log$ of a probability $p \in (0,1)$. If the random $\beta$ increases our likelihood we keep it, otherwise we throw it out and choose another random $\beta$. In order to visualize the results, we program a little helper function first.
# define a plot function to visualize the result
def plot_separation(x, beta, y, color='red'):
color = ['grey', color]
for x_el, y_el in zip(x, y):
log_odds = beta @ x_el
plt.scatter(log_odds, map_to_p(beta @ x_el), c=color[y_el])
plt.plot(lots_of_log_odds_for_drawing, mapped_to_p)
plt.show()
# choose a random, but very small likelihood as basis
ll_hat = -1000000
for step in range(10001):
# choose "random" beta vector 10.000 times
# each entry will be between -3 and 3
beta_random = 6 * np.random.random(5) - 3
ll = log_likelihood(X_with_intercept, y_binary_red, beta_random)
# if our log-likelihood has improved, overwrite old beta, save likelihood for futher iterations
if ll > ll_hat:
beta_hat = beta_random
ll_hat = ll
# draw the result every 5000 steps
if step % 5000 == 0:
print("Step:", step, ", beta_hat:", beta_hat, ", ll_hat:", ll_hat)
plot_separation(X_with_intercept, beta_hat, y_binary_red)
print()
Step: 0 , beta_hat: [-0.06781811 0.82358851 -0.58925447 2.8207669 -1.35588074] , ll_hat: -1496.509430016339
Step: 5000 , beta_hat: [-2.85756301 0.74906527 2.64267461 -2.84641646 -2.83370022] , ll_hat: -0.6742826828096209
Step: 10000 , beta_hat: [-2.85756301 0.74906527 2.64267461 -2.84641646 -2.83370022] , ll_hat: -0.6742826828096209
As we can see, this neatly separates the red from the non-red dots onto the “1” and “0” side of the sigmoid curve. To be honest, that seems to work only because the red specimen are somewhat neatly separated from the rest from the get go. If you try the other colors, the results will not be that good.
But we have seen the general principle. By wildly choosing random $\beta$s and keeping the ones that increase likelihood, we push the log odds of our “success” class as far to the right as possible, while we keep the “failures” on the left. Now the only step that is left is transitioning from random guessing into a process that is more sophisticated.
Fitting the Parameters with Gradient Descent Link to heading
As a next step, we will replace the brute force method with a numerical optimization method like Gradient Descent or Newton-Raphson. Today, we are going to use the Gradient Descent method. In simple terms, we will move $\beta$ in small steps towards the direction that minimizes our error function, which is the true $y$ minus our calculated result for $y$ under our guess for $\beta$. This direction happens to be the negative gradient. The true $y$ is simply our input data for $y$.
def gradient_descent(X, y, steps, learning_rate):
beta = np.zeros(X.shape[1])
for _ in range(1, steps):
# calculate log odds for all x in X at once
log_odds = np.dot(X, beta)
# calculate result based on current beta
tentative_y = list(map(map_to_p, log_odds))
# calculate difference between current estimate and truth
error = np.subtract(y, tentative_y)
# see below for explanation
gradient = np.dot(error, X)
# move beta in opposite direction from error
beta += learning_rate * gradient
return beta
So what is the gradient of $ll(\beta) $ with regards to $\beta$?
$$ \nabla_{\beta} ll(\beta) = \sum_{i} \nabla_{\beta} y_i\beta x_i-\nabla_{\beta}\log(e^{\beta x_i}+1) $$
$$ = \sum_{i} y_i x_i-\nabla_{\beta}\log(e^{\beta x_i}+1) $$
$$ = \sum_{i} y_i x_i - x_i e^{\beta x_i} \frac{1}{e^{\beta x_i} +1} $$
$$ = \sum_{i} y_i x_i - x_i \frac{1}{1 + e^{-\beta x_i}} $$
$$ = \sum_{i} y_i x_i - x_i p(x_i) = \sum_{i} (y_i -p(x_i)) x_i $$
Which is nothing else than the true $y_i$ minus the calculated approximation for $y_i$ which is $p(x_i)$ times the feature vector for each observation, or in matrix form:
$$ \nabla_{\beta} ll(\beta) = (y_{true} - y_{estimate}(\beta))X $$
To mix things up a bit, let us try using the gradient descent method to identify the blue specimen instead of the red ones:
beta_hat = gradient_descent(X_with_intercept,
y_binary_blue,
steps=100001,
learning_rate=10e-5)
plot_separation(X_with_intercept, beta_hat, y_binary_blue, color='blue')
And there we have a separation of the blue dots towards the positive real numbers and the rest towards the negative ones and their respective probabilities going to 1 and 0. As we can see though, the separation does not work as well as in the “red vs. rest” problem.
However, if we try the same with the green specimen, it does not work very well at all. But that was somewhat expected, as we have seen in the very beginning. We should take solace in the fact, that the sklearn implementation does also not fare very well, which can be seen in the confusion matrix, and if we plot the result for the sub-problem “green vs. rest” we can barely differentiate between the green and the grey dots.
# parameters for green from sklearn log-reg
beta_0 = np.hstack((clf.__dict__["intercept_"][0], clf.__dict__["coef_"][0])) # red
beta_1 = np.hstack((clf.__dict__["intercept_"][1], clf.__dict__["coef_"][1])) # green
beta_2 = np.hstack((clf.__dict__["intercept_"][2], clf.__dict__["coef_"][2])) # blue
plot_separation(X_with_intercept, beta_0, y_binary_red, color='red')
plot_separation(X_with_intercept, beta_1, y_binary_green, color='green')
plot_separation(X_with_intercept, beta_2, y_binary_blue, color='blue')
Finally, we can observe that while our solution is not as good as the version implemented in sklearn, it provides results which are quite close already. One difference between the two algorithms is that sklearn penalizes solutions with large coefficients in its optimizer.
Sources: