+ - 0:00:00
Notes for current slide
Notes for next slide

Logistic regression



Data Science in a Box

1 / 24

Predicting categorical data

2 / 24

Spam filters

  • Data from 3921 emails and 21 variables on them
  • Outcome: whether the email is spam or not
  • Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.
library(openintro)
glimpse(email)
## Rows: 3,921
## Columns: 21
## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, …
## $ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, …
## $ time <dttm> 2012-01-01 01:16:41, 2012-01-01 02:03:59,…
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no…
## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09…
## $ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, …
## $ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, …
## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1…
## $ number <fct> big, small, small, small, none, none, big,…
3 / 24

Would you expect longer or shorter emails to be spam?

4 / 24

Would you expect longer or shorter emails to be spam?

## # A tibble: 2 × 2
## spam mean_num_char
## <fct> <dbl>
## 1 0 11.3
## 2 1 5.44
4 / 24

Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?

5 / 24

Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?

5 / 24

Modelling spam

  • Both number of characters and whether the message has "re:" in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
6 / 24

Modelling spam

  • Both number of characters and whether the message has "re:" in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
  • For simplicity, we'll focus on the number of characters (num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.
6 / 24

Modelling spam

This isn't something we can reasonably fit a linear model to -- we need something different!

7 / 24

Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
    • Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
8 / 24

Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
    • Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
  • Each Bernoulli trial can have a separate probability of success

yiBern(p)

8 / 24

Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
    • Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
  • Each Bernoulli trial can have a separate probability of success

yiBern(p)

  • We can then use the predictor variables to model that probability of success, pi
8 / 24

Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
    • Bernoulli trial: a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted
  • Each Bernoulli trial can have a separate probability of success

yiBern(p)

  • We can then use the predictor variables to model that probability of success, pi
  • We can't just use a linear model for pi (since pi must be between 0 and 1) but we can transform the linear model to have the appropriate range
8 / 24

Generalized linear models

  • This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
9 / 24

Generalized linear models

  • This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
  • Logistic regression is just one example
9 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

  1. A probability distribution describing a generative model for the outcome variable
10 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

  1. A probability distribution describing a generative model for the outcome variable
  2. A linear model: η=β0+β1X1++βkXk
10 / 24

Three characteristics of GLMs

All GLMs have the following three characteristics:

  1. A probability distribution describing a generative model for the outcome variable
  2. A linear model: η=β0+β1X1++βkXk
  3. A link function that relates the linear model to the parameter of the outcome distribution
10 / 24

Logistic regression

11 / 24

Logistic regression

  • Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
12 / 24

Logistic regression

  • Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
  • To finish specifying the Logistic model we just need to define a reasonable link function that connects ηi to pi: logit function
12 / 24

Logistic regression

  • Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
  • To finish specifying the Logistic model we just need to define a reasonable link function that connects ηi to pi: logit function
  • Logit function: For 0p1

logit(p)=log(p1p)

12 / 24

Logit function, visualised

13 / 24

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between and
14 / 24

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between and
  • Inverse logit (logistic) function: g1(x)=exp(x)1+exp(x)=11+exp(x)
14 / 24

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between and
  • Inverse logit (logistic) function: g1(x)=exp(x)1+exp(x)=11+exp(x)
  • The inverse logit function takes a value between and and maps it to a value between 0 and 1
14 / 24

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between and
  • Inverse logit (logistic) function: g1(x)=exp(x)1+exp(x)=11+exp(x)
  • The inverse logit function takes a value between and and maps it to a value between 0 and 1
  • This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success -- more on this later
14 / 24

The logistic regression model

  • Based on the three GLM criteria we have
    • yiBern(pi)
    • ηi=β0+β1x1,i++βnxn,i
    • logit(pi)=ηi
15 / 24

The logistic regression model

  • Based on the three GLM criteria we have
    • yiBern(pi)
    • ηi=β0+β1x1,i++βnxn,i
    • logit(pi)=ηi
  • From which we get

pi=exp(β0+β1x1,i++βkxk,i)1+exp(β0+β1x1,i++βkxk,i)

15 / 24

Modeling spam

In R we fit a GLM in the same way as a linear model except we

  • specify the model with logistic_reg()
  • use "glm" instead of "lm" as the engine
  • define family = "binomial" for the link function to be used in the model
16 / 24

Modeling spam

In R we fit a GLM in the same way as a linear model except we

  • specify the model with logistic_reg()
  • use "glm" instead of "lm" as the engine
  • define family = "binomial" for the link function to be used in the model
spam_fit <- logistic_reg() %>%
set_engine("glm") %>%
fit(spam ~ num_char, data = email, family = "binomial")
tidy(spam_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139
## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15
16 / 24

Spam model

tidy(spam_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139
## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15
17 / 24

Spam model

tidy(spam_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139
## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15

log(p1p)=1.800.0621×num_char

17 / 24

P(spam) for an email with 2000 characters

log(p1p)=1.800.0621×2

18 / 24

P(spam) for an email with 2000 characters

log(p1p)=1.800.0621×2p1p=exp(1.9242)=0.15p=0.15×(1p)

18 / 24

P(spam) for an email with 2000 characters

log(p1p)=1.800.0621×2p1p=exp(1.9242)=0.15p=0.15×(1p)p=0.150.15p1.15p=0.15

18 / 24

P(spam) for an email with 2000 characters

log(p1p)=1.800.0621×2p1p=exp(1.9242)=0.15p=0.15×(1p)p=0.150.15p1.15p=0.15p=0.15/1.15=0.13

18 / 24

What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?

19 / 24

What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?

  • 2K chars: P(spam) = 0.13
  • 15K chars, P(spam) = 0.06
  • 40K chars, P(spam) = 0.01
19 / 24

Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters?

20 / 24

Sensitivity and specificity

21 / 24

False positive and negative

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
22 / 24

False positive and negative

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)

  • False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

22 / 24

Sensitivity and specificity

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
23 / 24

Sensitivity and specificity

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)

    • Sensitivity = 1 − False negative rate
  • Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)

    • Specificity = 1 − False positive rate
23 / 24

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?

24 / 24

Predicting categorical data

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow