Simpson’s Paradox & Model Selection - Suggested Answers

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
✔ broom        1.0.5     ✔ rsample      1.1.1
✔ dials        1.2.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.0     ✔ yardstick    1.2.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

library(dplyr)
library(scales)
library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

library(MASS) #for stepAIC


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Why multiple variables matter

We are going to investigate the relationship between a person’s salary and their level of neuroticism. Neuroticism is a trait that reflects a person’s level of emotional stability. It is often defined as a negative personality trait involving negative emotions, poor self-regulation (an inability to manage urges), trouble dealing with stress, a strong reaction to perceived threats, and the tendency to complain.

Data

Read in the data and take a glimpse of it below:

data <- read_csv("data/salary.csv")

Rows: 1000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Education
dbl (2): Salary, Neuroticism

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data <- as_tibble(data)

glimpse(data)

Rows: 1,000
Columns: 3
$ Salary      <dbl> 54405.77, 80677.24, 47985.56, 85307.72, 98507.90, 66758.76…
$ Neuroticism <dbl> 2.738387, 3.257568, 4.128697, 4.746971, 2.817542, 2.317324…
$ Education   <chr> "Medium", "High", "Medium", "High", "High", "Low", "Medium…

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm * bill_length_mm , data = penguins)

parsnip model object


Call:
stats::lm(formula = body_mass_g ~ flipper_length_mm * bill_length_mm, 
    data = data)

Coefficients:
                     (Intercept)                 flipper_length_mm  
                        5090.509                            -7.309  
                  bill_length_mm  flipper_length_mm:bill_length_mm  
                        -229.242                             1.200

Assume we are going to investigate this using a linear model. Make an appropriate visualization to assess the relationship between salary and neuroticism. Comment on the relationship. Fit a line using method = "lm.

data |>
  ggplot(
    aes(y = Salary, x = Neuroticism)
  ) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm" , se = F)

`geom_smooth()` using formula = 'y ~ x'

As you may have noticed, we also have levels of education as a categorical variable with levels Low, Medium, and High. A fellow researcher suggests to include this information in the model to “account for it” when looking at the relationship between salary and neuroticism. Make an updated visualization that includes level of education as the z variable, and comment on the relationship between salary and neuroticism below. Fit lines using method = "lm for each of the three levels of education.

data |>
  ggplot(
    aes(y = Salary, x = Neuroticism , color = Education)
  ) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm" , se = F)

`geom_smooth()` using formula = 'y ~ x'

What is going on? What is this called?

The relationship flipped! This is Simpson’s paradox.

Model Selection: Penguins

Last class (and in warm up), we went over why $R^2$ is not a good tool for model selection. Let’s go over some more appropriate model selections tools to pick the “best” model from a subset.

Model Selection Notes:

Before we show how to do this in R…..

Note! Please read!

It is a bad idea to simply let R alone choose your best fitting model for you. Good practices include:

– Talking with the subject expert about variables that must or should not be in your model

– Align variables to your research question. This includes:

— keeping the variables you are interested in…. and the variables you want to account for…

— We use this as evidence, along with other sources of evidence to justify our model selection

Adjusted R-squared

Fit three models that we have gone over in class thus far, and use adjusted r-squared to justify which model is most appropriate. Hint: In the same pipe used to fit the model, use glance |> pull(adj.r.squared) to get the appropriate information. You can also use glance(model.name)$adj.r.squared.

Model 1: body mass vs flipper length

slr <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm , data = penguins)

glance(slr)$AIC

[1] 5062.855

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm , data = penguins) |>
  glance() |>
  pull(adj.r.squared)

[1] 0.7582837

Model 2: body mass vs flipper length + island

mrl_add <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island + flipper_length_mm, data = penguins)

glance(mrl_add)$AIC

[1] 5044.513

Model 3: body mass vs flipper length x island

mlr_int <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island*flipper_length_mm , data=penguins)

glance(mlr_int)$AIC

[1] 5030.609

AIC

What is AIC?

The Akaike information criterion (AIC) is an estimator of relative quality of statistical models for a given set of data.

You can calculate this in R much like you do adjusted r-squared: use glance |> pull(AIC) or glance(model.name)$AIC. Try it above!

Backwards Selection

Backward elimination starts with the the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the model any further.

Procedure:

– Start with a full model that has all predictors we consider and compute the AIC or adjusted-r-squared.

– Next fit every possible model with 1 less predictor.

– Compare AIC or adjusted-r-squared scores to select the best model with 1 less predictor if they show improvement over full model.

– Repeat steps 2 and 3 until you score the model with no predictors.

– Compare AIC or adjusted-r-squared among all tested models to select the best model.

Example

To better understand the idea of backwards selection, we are going to manually walk through one iteration of the procedure above. Our “full” model will be defined as an additive model with flipper length, and island on our response body mass. With this information, perform backwards selection and report which variable will not be in the final backward elimination model (if at all). Use AIC as your criteria.

Hint: In the same pipe used to fit the model, instead of using tidy(), use glance |> pull(AIC) to get the appropriate information.

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)

[1] 5035.615

## Choose the model above 


linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~  bill_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)

[1] 5216.441

Forward Selection

Procedure:

– Start with a model that has no predictors i.e. just model the response with a constant term.

– Next fit every possible model with 1 additional predictor and score each model.

– Compare AIC scores to select the best model with 1 additional predictor vs first model.

– Repeat steps 2 and 3 until you can no longer improve the model.

– Perform 1 step of forward selection using the same variables mentioned above. What variable will be in the final forward selection model?

For this example, let’s use bill length and island as our only two predictors for body mass.

Hint: We can fit a model with no predictors using ~ 1 in place of any explanatory variable.

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ 1 , data = penguins) |>
  glance() |>
  pull(AIC)

[1] 5547.496

### island, flipper length, bill length 

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins) |>
  glance() |>
  pull(AIC)

[1] 5062.855

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)

[1] 5044.513

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm * island * bill_length_mm, data = penguins) |>
  glance() |>
  pull(AIC)

[1] 4997.089

Notes about model selection

– what a “meaningful” difference? You should set it

– backward elimination and forward selection sometimes arrive at different final models

So which do I choose?

Forward stepwise selection is used when the number of variables under consideration is very large

Starting with the full model (backward selection) has the advantage of considering the effects of all variables simultaneously.

Optional

Stepwise in R

People use stepwise selection when their goal is to pick the overall “best” model.

There are many functions that will allow you to perform stepwise selection in R. Today, we are going to introduce ourselves to stepAIC. Currently, performing stepwise selection is very difficult with tidymodels, so we are going to write our models in base R using the lm function. Let’s do this below.

In stepAIC, the arguments are:

object: your model name

scope: your “full model”

direction: “forward”, or “backward”.

lm_fwd <- lm(body_mass_g ~ island, data = penguins)

stepAIC(lm_fwd,  scope = ~bill_length_mm + island + flipper_length_mm, direction = "forward")

Start:  AIC=4407.88
body_mass_g ~ island

                    Df Sum of Sq       RSS    AIC
+ flipper_length_mm  1  83480840  49512346 4072.0
+ bill_length_mm     1  51139275  81853911 4243.9
<none>                           132993186 4407.9

Step:  AIC=4071.96
body_mass_g ~ island + flipper_length_mm

                 Df Sum of Sq      RSS    AIC
+ bill_length_mm  1   1552910 47959436 4063.1
<none>                        49512346 4072.0

Step:  AIC=4063.06
body_mass_g ~ island + flipper_length_mm + bill_length_mm


Call:
lm(formula = body_mass_g ~ island + flipper_length_mm + bill_length_mm, 
    data = penguins)

Coefficients:
      (Intercept)        islandDream    islandTorgersen  flipper_length_mm  
         -4266.21            -336.64            -174.26              38.86  
   bill_length_mm  
            18.40