Confidence Intervals - SUGGESTED ANSWERS

Packages

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
✔ broom        1.0.5     ✔ rsample      1.1.1
✔ dials        1.2.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.0     ✔ yardstick    1.2.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

Context

The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). A sepal is the outer parts of the flower (often green and leaf-like) that enclose a developing bud. The petal are parts of a flower that are the pollen producing part of the flower that are often conspicuously colored. The difference between sepals and petals can be seen below.

The data were collected in 1936 at the Gaspé Peninsula, in Canada. The data set is prepackaged in R, and is called iris.

data(iris)

Last time, we conducted a hypothesis test to test for a difference between the mean sepal legnth of setosa and versicolor. Now, we want to estimate what the difference actually is.

First, we want to filter the data set to only contain our two Species. Please create a new data set that achieves this below. Let’s also calculate our statistic.

iris_filter <- iris |>
  filter(Species != "virginica")

iris_filter |>
  group_by(Species) |>
  summarize(mean_sep = mean(Sepal.Length))
# A tibble: 2 × 2
  Species    mean_sep
  <fct>         <dbl>
1 setosa         5.01
2 versicolor     5.94

What is your point estimate? Using proper notation, report it below (setosa - versicolor).

\(\bar{x_s} - \bar{x_v} = -0.93\)

Do we have a null and alternative hypothesis to set up?

NO

Bootstrap Distributon

The term bootstrapping comes from the phrase “pulling oneself up by one’s bootstraps”, which is a metaphor for accomplishing an impossible task without any outside help

Impossible task: estimating a population parameter using data from only the given sample.

Note: This notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.

How do we create a bootstrap distribution? How is this different than a null distribution?

First, lets calculate our sample size for each group:

iris_filter |>
  group_by(Species) |>
  summarize(n = n())
# A tibble: 2 × 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50

To make a bootstrap distribution, we want respectfully take a random sample with replacement 50 times from the setosa group, and 50 times from the versicolor group, respectfully. Next, we calculate the new randomly sampled means from each group and subtract the means (setosa - versicolor). This is how one simulated resampled mean is created on our bootstrap distribution. We do this process a large number of times (1000) to create a bootstrap resampled distribution that will be used to calculate our confidence interval.