HW 5 - Modelling
This homework is due Tuesday, Oct 31 at 11:59pm.
Homeworks are to be turned in individually as usual.
Getting Started
Go to the sta199-f23-1 organization on GitHub. Click on the repo with the prefix
hw5
. It contains the starter documents you need to complete the homework assignment.Clone the repo and start a new project in RStudio.
Note: This hw assignment has 1 extra credit problem worth 3 points. There will not be partial credit assigned for this problem, and you can not earn more than 100% on this assignment.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualization, though you’re welcomed to also load other packages as needed.
Data
For this homework, we are again interested in the impact of smoking during pregnancy. In Homework 4, we explored the relation between mother smoking habits and the weight of babies at birth by creating and discussing histograms. This is part of the so called Exploratory Data Analysis, which is useful to summarize and describe the data and identify preliminary relations between the variables in the dataset. We can draw conclusions about the relations among variables when fitting models and this is what we’ll be doing in this homework.
Exercise 1 - Smoking during pregnancy - Part 1
We will use the dataset births14
, included in the openintro loaded at the beginning of the assignment. This is a random sample of 1,000 mothers from a data set released in 2014 by the state of North Carolina about the relation between habits and practices of expectant mothers and the birth of their children. Please, look at help(births14)
for the variables contained in the dataset.
Create a version of the
births14
data set dropping observations where there areNA
s forhabit
. Save this new dataset asbirths14_habitgiven
.Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions. Create an appropriate plot displaying the relationship between
weight
andhabit
. In 2-3 sentences, discuss the relationships observed. As always, include appropriate labels in your plot.Now, fit a linear model that investigates the relationship between
weight
andhabit
. Provide the tidy summary output below.Write the estimated least squares regression line below using proper notation. Hint: If you need to type an equation using proper notation, type your answers in-between
$$
and$$
. You may use\widehat{example}
to put a hat on a character.
Exercise 2 - Smoking during pregnancy - Part 2
Another researcher is interested in assessing the relationship between babies’ weights and mothers’ ages. Fit another linear model to investigate this relationship. Provide the summary output below.
In 2-3 sentences, explain how the regression line to model these data is fit, i.e., based on what criteria R determines the regression line.
Interpret the intercept in the context of the problem. Is the intercept meaningful in this context? Why or why not?
Interpret the slope in the context of the problem.
Data
For the next two exercises, we will be using the lego_sample
dataset, which includes data about Lego sets on sale on Amazon.
lego_sample <- read_csv("data/lego_sample.csv")
We are interested in the following three variables:
amazon_price
: the price for the Lego set listed on Amazon;pieces
: the number of Lego pieces in the set;theme
: the category the set belongs to.
Exercise 3 - LEGO
Let’s examine the relation between the Amazon Price and the number of pieces in a Lego set.
Compute and report the correlation
r
of the two variablesamazon_price
andpieces
. Interpret the correlation coefficient in the context of the problem. Note, save this calculation as r. Hint: use the definition of a correlation coefficient.Compute and report the standard deviations of the variables
amazon_price
andpieces
, that we will denote respectively as \(s_{price}\) and \(s_{pieces}\). Note, save these calculations as s_price and s_pieces.Now fit a linear model where the number of pieces is a predictor of the Amazon price. Provide the summary output and write the estimated model with proper notation.
Compute and interpret the coefficient of determination for this linear model and show that it is equal to
r^2
. Then, interpret the coefficient of determination in the context of the problem.If we denote with \(\widehat{\beta_{piceces}}\) as the slope you estimated at point c, show that this relationship holds: \[ \widehat{\beta_{piceces}} = \frac{s_{price}}{s_{pieces}} r. \]
(Hint: you have to compute the number on the right and show that it equals the estimated slope!)
Exercise 4
Let us look at how the relation between the Amazon Price and the number of pieces in a Lego set varies according to the theme of the Lego set.
Make a scatter plot of the Amazon price and number of pieces coloring the points according to the theme of the Lego set. Add linear lines for each theme. Is there evidence that the relationship between price and number of pieces is different by theme? Justify your answer.
Regardless of your answer to part a, let’s fit an additive model named
price_fit_add
. Interpret the coefficient associated with pieces in the context of the problem.Using this model, use R code to predict the mean Amazon price of Friends legos that have 50 pieces.
Exercise 5 - Interaction
a. Fit an interaction model with theme and number of pieces as your explanatory variables to investigate Amazon price. Name this model price_fit_int
. Report your model output below.
b. Write out the population model using proper notation below.
c. Write out the estimated model for the DUPLO theme.
d. Write out the estimated model for the City theme.
e. In 1-2 sentences, compare the relationship between price and number of pieces across the DUPLO and City theme.
f. Without calculating the coefficient of determination for the price_fit_add
model, justify if the coefficient of determination for the price_fit_int
model will be larger or smaller than the coefficient of determination for the price_fit_add
.
Extra Credit - Functionalizing Halloween (3 points)
In this exercise, we will attempt at writing a custom function in R.
Suppose you have two types of candy to give out on Halloween: Hershey’s bars and Starbursts. If trick-or-treaters knock at your door, you will give them Hershey’s bar half of the time and a Starbusts the rest of the time. This means that a random trick-or-treaters who knocks at your door will get a Hershey’s with probability 0.5 and a Starbust with probability 0.5.
We want to create a function that chooses for you which type of candy you will give a trick-or-treater that knocks on your door. Your function should:
- take as input: a numerical value indicating the number of trick or treaters knocking at your door;
- give as output: how many Hershey’s baars and how many Starbusts you are giving in total to the trick or treaters. The output should be a 2x2 tibble with columns
candy
andn
as an output.
For example, if you want to select candies for 3 trick or treaters, it might look something like this:
trick_or_treat(3)
And the result might look something like this:
# A tibble: 1 × 2
candy n
<chr> <int>
1 Hershey's 3
Your function should be able to display a table of counts for Hershey’s bars or Starbursts for any amount of people you expect to see on Halloween.
Write this function and test it with 15, 100, and 200 as inputs.
You will note that every time you render your document the results will change. This is expected as you’re randomly sampling. (And later on in the course we’ll talk about how to make these numbers not change every time you render, for reproducibility!)
To create and test your function, please fill in the ___
below.
Look at Chapter 26 of R for Data Science for help and here to know more about the sample
function.
Hint: Think about what varies in your function as you define your input.
<- function(___){ # Input
trick_or_treat <- c("___", "___") # Names of the types of candy
candy_types tibble(candy = sample(___, size = ___, replace = ___)) |>
___(___)
}
# Below you should call the function for 15, 100, 200 trick or treaters
trick_or_treat(___)
trick_or_treat(___)
trick_or_treat(___)
Submission
- Go to Gradescope and click Log in in the top right corner.
- Click School Credentials Duke Net ID and log in using your Net ID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
Grading
Exercise 1: 10 points
Exercise 2: 8 points
Exercise 3: 10 points
Exercise 4: 10 points
Exercise 5: 10 points
Exercise 6: (3 bonus points)
Workflow + formatting: 2 points
Total: 50