library(openintro)glimpse(email)
## Rows: 3,921## Columns: 21## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, …## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, …## $ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, …## $ time <dttm> 2012-01-01 01:16:41, 2012-01-01 02:03:59,…## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no…## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.09…## $ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, …## $ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, …## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1…## $ number <fct> big, small, small, small, none, none, big,…
Would you expect longer or shorter emails to be spam?
## # A tibble: 2 × 2## spam mean_num_char## <fct> <dbl>## 1 0 11.3 ## 2 1 5.44
Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?
Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?
num_char
) as predictor, but the model we describe can be expanded to take multiple predictors as well.This isn't something we can reasonably fit a linear model to -- we need something different!
$$ y_i ∼ Bern(p) $$
$$ y_i ∼ Bern(p) $$
$$ y_i ∼ Bern(p) $$
All GLMs have the following three characteristics:
All GLMs have the following three characteristics:
All GLMs have the following three characteristics:
$$logit(p) = \log\left(\frac{p}{1-p}\right)$$
$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$
In R we fit a GLM in the same way as a linear model except we
logistic_reg()
"glm"
instead of "lm"
as the engine family = "binomial"
for the link function to be used in the modelIn R we fit a GLM in the same way as a linear model except we
logistic_reg()
"glm"
instead of "lm"
as the engine family = "binomial"
for the link function to be used in the modelspam_fit <- logistic_reg() %>% set_engine("glm") %>% fit(spam ~ num_char, data = email, family = "binomial")tidy(spam_fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15
tidy(spam_fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15
tidy(spam_fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -1.80 0.0716 -25.1 2.04e-139## 2 num_char -0.0621 0.00801 -7.75 9.50e- 15
$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char}$$
$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$
$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$``$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$
$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$``$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$``$$p = 0.15 - 0.15p \rightarrow 1.15p = 0.15$$
$$\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2$$``$$\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)$$``$$p = 0.15 - 0.15p \rightarrow 1.15p = 0.15$$``$$p = 0.15 / 1.15 = 0.13$$
What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?
What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?
Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters?
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |