infer::generate(reps, type = "bootstrap")
Describe the simulation process for estimating the parameter assigned to your team.
What happens to the width of the confidence interval as the confidence level increases? Why? Should we always prefer a confidence interval with a higher confidence level?
Which of the following is more informative:
Which of the following is more informative:
What does your answer tell you about interpretation of confidence intervals for differences between two population parameters?
infer::hypothesize(null = "point")
and infer::generate(reps, type = "simulate")
or infer::generate(reps, type = "bootstrap")
infer::hypothesize(null = "independence")
and infer::generate(reps, type = "permute")
Describe the simulation process for tesing for the parameter assigned to your team.
In a court of law
Which is worse: Type 1 or Type 2 error?
Fill in the blanks in terms of correctly/incorrectly rejecting/failing to reject the null hypothesis:
The dataset is in the openintro
package.
glimpse(ncbirths)
## Rows: 1,000## Columns: 13## $ fage <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, …## $ mage <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, …## $ mature <fct> younger mom, younger mom, younger mom, y…## $ weeks <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, …## $ premie <fct> full term, full term, full term, full te…## $ visits <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8…## $ marital <fct> not married, not married, not married, n…## $ gained <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, …## $ weight <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44…## $ lowbirthweight <fct> not low, not low, not low, not low, not …## $ gender <fct> male, male, female, male, female, male, …## $ habit <fct> nonsmoker, nonsmoker, nonsmoker, nonsmok…## $ whitemom <fct> not white, not white, white, white, not …
## # A tibble: 1 × 7## min xbar med s q1 q3 max## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>## 1 20 38.3 39 2.93 37 40 45
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
1 variable: length of gestation, weeks
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
1 variable: length of gestation, weeks
(2) What type(s) of variable(s)?
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
1 variable: length of gestation, weeks
(2) What type(s) of variable(s)?
Numerical
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
1 variable: length of gestation, weeks
(2) What type(s) of variable(s)?
Numerical
(3) What is the research question?
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between ---- and ---- weeks.
(1) How many variables?
1 variable: length of gestation, weeks
(2) What type(s) of variable(s)?
Numerical
(3) What is the research question?
Estimate the average length of gestation → confidence interval
Goal: Use bootstrapping to estimate the sampling variability of the mean, i.e. the variability of means taken from the same population with the same sample size.
Goal: Use bootstrapping to estimate the sampling variability of the mean, i.e. the variability of means taken from the same population with the same sample size.
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Calculate the mean of the bootstrap sample.
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap means.
Calculate the bounds of the 95% confidence interval as the middle 95% of the bootstrap distribution.
From the documentation of set.seed
:
set.seed
uses a single integer argument to set as many seeds as are required. There is no guarantee that different values of seed will seed the RNG differently, although any exceptions would be extremely rare.set.seed(20180326)
boot_means <- ncbirths %>% filter(!is.na(weeks)) %>% # remove NAs specify(response = weeks) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean")ggplot(data = boot_means, aes(x = stat)) + geom_histogram(binwidth = 0.03)
boot_means %>% summarise( lower = quantile(stat, 0.025), upper = quantile(stat, 0.975) )
## # A tibble: 1 × 2## lower upper## <dbl> <dbl>## 1 38.2 38.5
boot_means %>% summarise( lower = quantile(stat, 0.025), upper = quantile(stat, 0.975) )
## # A tibble: 1 × 2## lower upper## <dbl> <dbl>## 1 38.2 38.5
Assuming that this sample is representative of all births in NC, we are 95% confident that the average length of gestation for babies in NC is between 38.1 and 38.5 weeks.
The average length of human gestation is 280 days, or 40 weeks, from the first day of the woman's last menstrual period. Do these data provide convincing evidence that average length of gestation for women in NC is different than 40 weeks? Use a significance level of 5%.
The average length of human gestation is 280 days, or 40 weeks, from the first day of the woman's last menstrual period. Do these data provide convincing evidence that average length of gestation for women in NC is different than 40 weeks? Use a significance level of 5%.
H0:μ=40
HA:μ≠40
The average length of human gestation is 280 days, or 40 weeks, from the first day of the woman's last menstrual period. Do these data provide convincing evidence that average length of gestation for women in NC is different than 40 weeks? Use a significance level of 5%.
H0:μ=40
HA:μ≠40
We just said, "we are 95% confident that the average length of gestation for babies in NC is between 38.1 and 38.5 weeks".
Since the null value is outside the CI, we would reject the null hypothesis in favor of the alternative.
But an alternative, more direct, way of answering this question is using a hypothesis test.
Goal: Use bootstrapping to generate a sampling distribution under the assumption of the null hypothesis being true. Then, calculate the p-value to make a decision on the hypotheses.
Goal: Use bootstrapping to generate a sampling distribution under the assumption of the null hypothesis being true. Then, calculate the p-value to make a decision on the hypotheses.
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Calculate the mean of the bootstrap sample.
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap means.
Shift the bootstrap distribution to be centered at the null value by subtracting/adding the difference between the center of the bootstrap distribution and the null value to each bootstrap mean.
Calculate the p-value as the proportion of simulations that yield a sample mean at least as extreme as the observed sample mean.
boot_means_shifted <- ncbirths %>% filter(!is.na(weeks)) %>% # remove NAs specify(response = weeks) %>% hypothesize(null = "point", mu = 40) %>% # hypothesize step generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean")ggplot(data = boot_means_shifted, aes(x = stat)) + geom_histogram(binwidth = 0.03) + geom_vline(xintercept = 38.33, color = "red") + geom_vline(xintercept = 40 + (40 - 38.33), color = "red")
boot_means_shifted %>% filter(stat <= 38.33) %>% summarise(p_value = 2 * (n() / 1000))
## # A tibble: 1 × 1## p_value## <dbl>## 1 0
boot_means_shifted %>% filter(stat <= 38.33) %>% summarise(p_value = 2 * (n() / 1000))
## # A tibble: 1 × 1## p_value## <dbl>## 1 0
Since p-value less than the significance level, we reject the null hypothesis. The data provide convincing evidence that the average length of gestation of births in NC is different than 40.
df %>% specify(response, explanatory) %>% # explanatory optional generate(reps, type) %>% # type: bootstrap, simulate, or permute calculate(stat)
stat
calculate
to see which stat
istics can be calculatedhypothesize()
step between specify()
and generate()
null = "point"
, and then specify the null valuenull = "independence"
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |