Fitting and interpreting models

Data Science in a Box

datasciencebox.org

1 / 27

Models with numerical explanatory variables

2 / 27

Data: Paris Paintings

pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA"))

Number of observations: 3393
Number of variables: 61

3 / 27

Goal: Predict height from width

${\hat{h e i g h t}}_{i} = β_{0} + β_{1} \times w i d t h_{i}$

4 / 27

5 / 27

Step 1: Specify model

linear_reg()

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

6 / 27

Step 2: Set model fitting engine

linear_reg() %>%
  set_engine("lm") # lm: linear model

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

7 / 27

Step 3: Fit model & estimate parameters

... using formula syntax

linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp)

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Height_in ~ Width_in, data = data)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

8 / 27

A closer look at model output

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Height_in ~ Width_in, data = data)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

${\hat{h e i g h t}}_{i} = 3.6214 + 0.7808 \times w i d t h_{i}$

9 / 27

A tidy look at model output

linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0

${\hat{h e i g h t}}_{i} = 3.62 + 0.781 \times w i d t h_{i}$

10 / 27

Slope and intercept

${\hat{h e i g h t}}_{i} = 3.62 + 0.781 \times w i d t h_{i}$

11 / 27

Slope and intercept

${\hat{h e i g h t}}_{i} = 3.62 + 0.781 \times w i d t h_{i}$

Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches.

11 / 27

Slope and intercept

${\hat{h e i g h t}}_{i} = 3.62 + 0.781 \times w i d t h_{i}$

Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches.
Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. (Does this make sense?)

11 / 27

Correlation does not imply causation

Remember this when interpreting model coefficients

Source: XKCD, Cell phones

12 / 27

Parameter estimation

13 / 27

Linear model with a single predictor

We're interested in $β_{0}$ (population parameter for the intercept) and $β_{1}$ (population parameter for the slope) in the following model:

${\hat{y}}_{i} = β_{0} + β_{1} x_{i}$

14 / 27

Linear model with a single predictor

We're interested in $β_{0}$ (population parameter for the intercept) and $β_{1}$ (population parameter for the slope) in the following model:

${\hat{y}}_{i} = β_{0} + β_{1} x_{i}$

Tough luck, you can't have them...

14 / 27

Linear model with a single predictor

We're interested in $β_{0}$ (population parameter for the intercept) and $β_{1}$ (population parameter for the slope) in the following model:

${\hat{y}}_{i} = β_{0} + β_{1} x_{i}$

Tough luck, you can't have them...
So we use sample statistics to estimate them:

${\hat{y}}_{i} = b_{0} + b_{1} x_{i}$

14 / 27

Least squares regression

The regression line minimizes the sum of squared residuals.

15 / 27

Least squares regression

The regression line minimizes the sum of squared residuals.
If $e_{i} = y_{i} - {\hat{y}}_{i}$ , then, the regression line minimizes $\sum_{i = 1}^{n} e_{i}^{2}$ .

15 / 27

Visualizing residuals

16 / 27

Visualizing residuals (cont.)

17 / 27

Visualizing residuals (cont.)

18 / 27

Properties of least squares regression

The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$ , $(\bar{x}, \bar{y})$ :

$\bar{y} = b_{0} + b_{1} \bar{x} \to b_{0} = \bar{y} - b_{1} \bar{x}$

19 / 27

Properties of least squares regression

The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$ , $(\bar{x}, \bar{y})$ :

$\bar{y} = b_{0} + b_{1} \bar{x} \to b_{0} = \bar{y} - b_{1} \bar{x}$

The slope has the same sign as the correlation coefficient: $b_{1} = r \frac{s_{y}}{s_{x}}$

19 / 27

Properties of least squares regression

The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$ , $(\bar{x}, \bar{y})$ :

$\bar{y} = b_{0} + b_{1} \bar{x} \to b_{0} = \bar{y} - b_{1} \bar{x}$

The slope has the same sign as the correlation coefficient: $b_{1} = r \frac{s_{y}}{s_{x}}$
The sum of the residuals is zero: $\sum_{i = 1}^{n} e_{i} = 0$

19 / 27

Properties of least squares regression

The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$ , $(\bar{x}, \bar{y})$ :

$\bar{y} = b_{0} + b_{1} \bar{x} \to b_{0} = \bar{y} - b_{1} \bar{x}$

The slope has the same sign as the correlation coefficient: $b_{1} = r \frac{s_{y}}{s_{x}}$
The sum of the residuals is zero: $\sum_{i = 1}^{n} e_{i} = 0$
The residuals and $x$ values are uncorrelated

19 / 27

Models with categorical explanatory variables

20 / 27

Categorical predictor with 2 levels

## # A tibble: 3,393 × 3
##    name      Height_in landsALL
##    <chr>         <dbl>    <dbl>
##  1 L1764-2          37        0
##  2 L1764-3          18        0
##  3 L1764-4          13        1
##  4 L1764-5a         14        1
##  5 L1764-5b         14        1
##  6 L1764-6           7        0
##  7 L1764-7a          6        0
##  8 L1764-7b          6        0
##  9 L1764-8          15        0
## 10 L1764-9a          9        0
## 11 L1764-9b          9        0
## 12 L1764-10a        16        1
## 13 L1764-10b        16        1
## 14 L1764-10c        16        1
## 15 L1764-11         20        0
## 16 L1764-12a        14        1
## 17 L1764-12b        14        1
## 18 L1764-13a        15        1
## 19 L1764-13b        15        1
## 20 L1764-14         37        0
## # … with 3,373 more rows

landsALL = 0: No landscape features
landsALL = 1: Some landscape features

21 / 27

Height & landscape features

linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ factor(landsALL), data = pp) %>%
  tidy()

## # A tibble: 2 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          22.7      0.328      69.1 0       
## 2 factor(landsALL)1    -5.65     0.532     -10.6 7.97e-26

22 / 27

Height & landscape features

$\hat{H e i g h t_{i n}} = 22.7 - 5.645 l a n d s A L L$

Slope: Paintings with landscape features are expected, on average, to be 5.645 inches shorter than paintings that without landscape features
- Compares baseline level (landsALL = 0) to the other level (landsALL = 1)
Intercept: Paintings that don't have landscape features are expected, on average, to be 22.7 inches tall

23 / 27

Relationship between height and school

linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ school_pntg, data = pp) %>%
  tidy()

## # A tibble: 7 × 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        14.0       10.0     1.40  0.162  
## 2 school_pntgD/FL     2.33      10.0     0.232 0.816  
## 3 school_pntgF       10.2       10.0     1.02  0.309  
## 4 school_pntgG        1.65      11.9     0.139 0.889  
## 5 school_pntgI       10.3       10.0     1.02  0.306  
## 6 school_pntgS       30.4       11.4     2.68  0.00744
## 7 school_pntgX        2.87      10.3     0.279 0.780

24 / 27

Dummy variables

## # A tibble: 7 × 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        14.0       10.0     1.40  0.162  
## 2 school_pntgD/FL     2.33      10.0     0.232 0.816  
## 3 school_pntgF       10.2       10.0     1.02  0.309  
## 4 school_pntgG        1.65      11.9     0.139 0.889  
## 5 school_pntgI       10.3       10.0     1.02  0.306  
## 6 school_pntgS       30.4       11.4     2.68  0.00744
## 7 school_pntgX        2.87      10.3     0.279 0.780

When the categorical explanatory variable has many levels, they're encoded to dummy variables
Each coefficient describes the expected difference between heights in that particular school compared to the baseline level

25 / 27

Categorical predictor with 3+ levels

school_pntg	D_FL	F	G	I	S	X
A	0	0	0	0	0	0
D/FL	1	0	0	0	0	0
F	0	1	0	0	0	0
G	0	0	1	0	0	0
I	0	0	0	1	0	0
S	0	0	0	0	1	0
X	0	0	0	0	0	1

## # A tibble: 3,393 × 3
##    name      Height_in school_pntg
##    <chr>         <dbl> <chr>      
##  1 L1764-2          37 F          
##  2 L1764-3          18 I          
##  3 L1764-4          13 D/FL       
##  4 L1764-5a         14 F          
##  5 L1764-5b         14 F          
##  6 L1764-6           7 I          
##  7 L1764-7a          6 F          
##  8 L1764-7b          6 F          
##  9 L1764-8          15 I          
## 10 L1764-9a          9 D/FL       
## 11 L1764-9b          9 D/FL       
## 12 L1764-10a        16 X          
## 13 L1764-10b        16 X          
## 14 L1764-10c        16 X          
## 15 L1764-11         20 D/FL       
## 16 L1764-12a        14 D/FL       
## 17 L1764-12b        14 D/FL       
## 18 L1764-13a        15 D/FL       
## 19 L1764-13b        15 D/FL       
## 20 L1764-14         37 F          
## # … with 3,373 more rows

26 / 27

Relationship between height and school

## # A tibble: 7 × 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        14.0       10.0     1.40  0.162  
## 2 school_pntgD/FL     2.33      10.0     0.232 0.816  
## 3 school_pntgF       10.2       10.0     1.02  0.309  
## 4 school_pntgG        1.65      11.9     0.139 0.889  
## 5 school_pntgI       10.3       10.0     1.02  0.306  
## 6 school_pntgS       30.4       11.4     2.68  0.00744
## 7 school_pntgX        2.87      10.3     0.279 0.780

Austrian school (A) paintings are expected, on average, to be 14 inches tall.
Dutch/Flemish school (D/FL) paintings are expected, on average, to be 2.33 inches taller than Austrian school paintings.
French school (F) paintings are expected, on average, to be 10.2 inches taller than Austrian school paintings.
German school (G) paintings are expected, on average, to be 1.65 inches taller than Austrian school paintings.
Italian school (I) paintings are expected, on average, to be 10.3 inches taller than Austrian school paintings.
Spanish school (S) paintings are expected, on average, to be 30.4 inches taller than Austrian school paintings.
Paintings whose school is unknown (X) are expected, on average, to be 2.87 inches taller than Austrian school paintings.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Fitting and interpreting models

Data Science in a Box

datasciencebox.org

Models with numerical explanatory variables

Data: Paris Paintings

Goal: Predict height from width

Step 1: Specify model

Step 2: Set model fitting engine

Step 3: Fit model & estimate parameters

A closer look at model output

A tidy look at model output

Slope and intercept

Slope and intercept

Slope and intercept

Correlation does not imply causation

Parameter estimation

Linear model with a single predictor

Linear model with a single predictor

Linear model with a single predictor

Least squares regression

Least squares regression

Visualizing residuals

Visualizing residuals (cont.)

Visualizing residuals (cont.)

Properties of least squares regression

Properties of least squares regression

Properties of least squares regression

Properties of least squares regression

Models with categorical explanatory variables

Categorical predictor with 2 levels

Height & landscape features

Height & landscape features

Relationship between height and school

Dummy variables

Categorical predictor with 3+ levels

Relationship between height and school

Models with numerical explanatory variables

Help