Logistic regression

Data Science in a Box

datasciencebox.org

1 / 24

Predicting categorical data

2 / 24

Spam filters

Data from 3921 emails and 21 variables on them
Outcome: whether the email is spam or not
Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.

library(openintro)
glimpse(email)

## Rows: 3,921
## Columns: 21
## $ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ to_multiple  <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, …
## $ sent_email   <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, …
## $ time         <dttm> 2012-01-01 01:16:41, 2012-01-01 02:03:59,…
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no…
## $ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09…
## $ line_breaks  <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, …
## $ format       <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, …
## $ re_subj      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1…
## $ number       <fct> big, small, small, small, none, none, big,…

3 / 24

Would you expect longer or shorter emails to be spam?

4 / 24

Would you expect longer or shorter emails to be spam?

## # A tibble: 2 × 2
##   spam  mean_num_char
##   <fct>         <dbl>
## 1 0             11.3 
## 2 1              5.44

4 / 24

Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?

5 / 24

Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?

5 / 24

Modelling spam

Both number of characters and whether the message has "re:" in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?

6 / 24

Modelling spam

Both number of characters and whether the message has "re:" in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
For simplicity, we'll focus on the number of characters (num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.

6 / 24

Modelling spam

This isn't something we can reasonably fit a linear model to -- we need something different!

7 / 24

Framing the problem

We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
- Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted

8 / 24

Framing the problem

We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
- Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
Each Bernoulli trial can have a separate probability of success

$y_{i} \sim B e r n (p)$

8 / 24

Framing the problem

We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
- Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
Each Bernoulli trial can have a separate probability of success

$y_{i} \sim B e r n (p)$

We can then use the predictor variables to model that probability of success, $p_{i}$

8 / 24

Framing the problem

We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
- Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
Each Bernoulli trial can have a separate probability of success

$y_{i} \sim B e r n (p)$

We can then use the predictor variables to model that probability of success, $p_{i}$
We can't just use a linear model for $p_{i}$ (since $p_{i}$ must be between 0 and 1) but we can transform the linear model to have the appropriate range

8 / 24

Generalized linear models

This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)

9 / 24

Generalized linear models

This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
Logistic regression is just one example

9 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

A probability distribution describing a generative model for the outcome variable

10 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

A probability distribution describing a generative model for the outcome variable
A linear model: $η = β_{0} + β_{1} X_{1} + \dots + β_{k} X_{k}$

10 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

A probability distribution describing a generative model for the outcome variable
A linear model: $η = β_{0} + β_{1} X_{1} + \dots + β_{k} X_{k}$
A link function that relates the linear model to the parameter of the outcome distribution

10 / 24

Logistic regression

11 / 24

Logistic regression

Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors

12 / 24

Logistic regression

Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
To finish specifying the Logistic model we just need to define a reasonable link function that connects $η_{i}$ to $p_{i}$ : logit function

12 / 24

Logistic regression

Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
To finish specifying the Logistic model we just need to define a reasonable link function that connects $η_{i}$ to $p_{i}$ : logit function
Logit function: For $0 \leq p \leq 1$

$l o g i t (p) = \log (\frac{p}{1 - p})$

12 / 24

Logit function, visualised

13 / 24

Properties of the logit

The logit function takes a value between 0 and 1 and maps it to a value between $- \infty$ and $\infty$

14 / 24

Properties of the logit

The logit function takes a value between 0 and 1 and maps it to a value between $- \infty$ and $\infty$
Inverse logit (logistic) function: $g^{- 1} (x) = \frac{\exp (x)}{1 + \exp (x)} = \frac{1}{1 + \exp (- x)}$

14 / 24

Properties of the logit

The logit function takes a value between 0 and 1 and maps it to a value between $- \infty$ and $\infty$
Inverse logit (logistic) function: $g^{- 1} (x) = \frac{\exp (x)}{1 + \exp (x)} = \frac{1}{1 + \exp (- x)}$
The inverse logit function takes a value between $- \infty$ and $\infty$ and maps it to a value between 0 and 1

14 / 24

Properties of the logit

The logit function takes a value between 0 and 1 and maps it to a value between $- \infty$ and $\infty$
Inverse logit (logistic) function: $g^{- 1} (x) = \frac{\exp (x)}{1 + \exp (x)} = \frac{1}{1 + \exp (- x)}$
The inverse logit function takes a value between $- \infty$ and $\infty$ and maps it to a value between 0 and 1
This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success -- more on this later

14 / 24

The logistic regression model

Based on the three GLM criteria we have
- $y_{i} \sim Bern (p_{i})$
- $η_{i} = β_{0} + β_{1} x_{1, i} + \dots + β_{n} x_{n, i}$
- $logit (p_{i}) = η_{i}$

15 / 24

The logistic regression model

Based on the three GLM criteria we have
- $y_{i} \sim Bern (p_{i})$
- $η_{i} = β_{0} + β_{1} x_{1, i} + \dots + β_{n} x_{n, i}$
- $logit (p_{i}) = η_{i}$
From which we get

$p_{i} = \frac{\exp (β_{0} + β_{1} x_{1, i} + \dots + β_{k} x_{k, i})}{1 + \exp (β_{0} + β_{1} x_{1, i} + \dots + β_{k} x_{k, i})}$

15 / 24

Modeling spam

In R we fit a GLM in the same way as a linear model except we

specify the model with logistic_reg()
use "glm" instead of "lm" as the engine
define family = "binomial" for the link function to be used in the model

16 / 24

Modeling spam

In R we fit a GLM in the same way as a linear model except we

specify the model with logistic_reg()
use "glm" instead of "lm" as the engine
define family = "binomial" for the link function to be used in the model

spam_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(spam ~ num_char, data = email, family = "binomial")
tidy(spam_fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
## 2 num_char     -0.0621   0.00801     -7.75 9.50e- 15

16 / 24

Spam model

tidy(spam_fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
## 2 num_char     -0.0621   0.00801     -7.75 9.50e- 15

17 / 24

Spam model

tidy(spam_fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
## 2 num_char     -0.0621   0.00801     -7.75 9.50e- 15

$\log (\frac{p}{1 - p}) = - 1.80 - 0.0621 \times num_char$

17 / 24

P(spam) for an email with 2000 characters

$\log (\frac{p}{1 - p}) = - 1.80 - 0.0621 \times 2$

18 / 24

P(spam) for an email with 2000 characters

$\log (\frac{p}{1 - p}) = - 1.80 - 0.0621 \times 2$ $\frac{p}{1 - p} = \exp (- 1.9242) = 0.15 \to p = 0.15 \times (1 - p)$

18 / 24

P(spam) for an email with 2000 characters

$\log (\frac{p}{1 - p}) = - 1.80 - 0.0621 \times 2$ $\frac{p}{1 - p} = \exp (- 1.9242) = 0.15 \to p = 0.15 \times (1 - p)$ $p = 0.15 - 0.15 p \to 1.15 p = 0.15$

18 / 24

P(spam) for an email with 2000 characters

$\log (\frac{p}{1 - p}) = - 1.80 - 0.0621 \times 2$ $\frac{p}{1 - p} = \exp (- 1.9242) = 0.15 \to p = 0.15 \times (1 - p)$ $p = 0.15 - 0.15 p \to 1.15 p = 0.15$ $p = 0.15 / 1.15 = 0.13$

18 / 24

What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?

19 / 24

What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?

2K chars: P(spam) = 0.13
15K chars, P(spam) = 0.06
40K chars, P(spam) = 0.01

19 / 24

Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters?

20 / 24

Sensitivity and specificity

21 / 24

False positive and negative

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

22 / 24

False positive and negative

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

22 / 24

Sensitivity and specificity

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

23 / 24

Sensitivity and specificity

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
- Sensitivity = 1 − False negative rate
Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
- Specificity = 1 − False positive rate

23 / 24

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?

24 / 24

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Logistic regression

Data Science in a Box

datasciencebox.org

Predicting categorical data

Spam filters

Modelling spam

Modelling spam

Modelling spam

Framing the problem

Framing the problem

Framing the problem

Framing the problem

Generalized linear models

Generalized linear models

Three characteristics of GLMs

Three characteristics of GLMs

Three characteristics of GLMs

Logistic regression

Logistic regression

Logistic regression

Logistic regression

Logit function, visualised

Properties of the logit

Properties of the logit

Properties of the logit

Properties of the logit

The logistic regression model

The logistic regression model

Modeling spam

Modeling spam

Spam model

Spam model

P(spam) for an email with 2000 characters

P(spam) for an email with 2000 characters

P(spam) for an email with 2000 characters

P(spam) for an email with 2000 characters

Sensitivity and specificity

False positive and negative

False positive and negative

Sensitivity and specificity

Sensitivity and specificity

Predicting categorical data

Help