HW 1 - Data visualization
This homework is due Tuesday, Sep 12 at 11:59pm.
Getting Started
Go to the sta199-f23-1 organization on GitHub. Click on the repo with the prefix
hw-01
. It contains the starter documents you need to complete the homework assignment.-
Clone the repo and start a new project in RStudio. Reminder instructions for doing so are given here:
Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click hw-1.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
Packages
Guidelines + tips
As we’ve discussed in lecture (or will on Tuesday), your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices. For examples and resources on plot aesthetics, see Chapter 2 of the textbook, “R for Datascience”, G. Wickham (2022).
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GithHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Note: Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. You may lose points if you do not.
Exercises
Data 1: Duke Forest houses
Use the duke_forest
dataset for Exercises 1 and 2.
For the following two exercises you will work with data on houses that were sold in the Duke Forest neighborhood of Durham, NC in November 2020. The duke_forest
dataset comes from the openintro package. You can see a list of the variables on the package website or by running ?duke_forest
in your console.
Exercise 1
Suppose you’re helping some family friends who are looking to buy a house in Duke Forest. As they browse Zillow listings, they realize some houses have garages and others don’t, and they wonder: Does having a garage make a difference for the price?
Luckily, you can help them answer this question with data visualization!
- Make histograms of the prices of houses in Duke Forest based on whether they have a garage.
In order to do this, you will first need to create a new variable called
garage
(with levels"Garage"
and"No garage"
).Below is the code for creating this new variable. Here, we
mutate()
theduke_forest
data frame to add a new variable calledgarage
which takes the value"Garage"
if the text string"Garage"
is detected in theparking
variable and takes the test string"No garage"
if not.You will add to the existing pipeline (code) to make the histogram.
duke_forest |>
mutate(garage = if_else(str_detect(parking, "Garage"), "Garage", "No garage"))
- Then, facet by
garage
and use different colors for the two facets. - Choose an appropriate binwidth and decide whether a legend is needed, and turn it off if not.
- Include informative title and axis labels.
- Finally, include a brief (2-3 sentence) narrative comparing the distributions of prices of Duke Forest houses that do and don’t have garages (for example comparing the center, spread and outliers of the distributions). Your narrative should touch on whether having a garage “makes a difference” in terms of the price of the house.
Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
Usually, we expect that within any market, larger houses will have higher prices. We can also expect that there exists a relation between the age of an house and its price. However, in some markets newer houses might be more expensive, while in other markets antique houses will have ‘more character’ than newer ones and have higher prices. In this question, we will explore the relations among age, size and price of houses.
Your family friend ask: “In Duke Forest, do houses that are bigger and more expensive tend to be newer than smaller and cheaper ones?”.
Once again, data visualization skills to the rescue!
- Create a scatter plot to exploring the relationship between
price
andarea
, also display information aboutyear_built
(that is conditioning foryear_built
). - Use
geom_smooth()
with the argumentse = FALSE
to add a smooth curve fit to the data and color the points byyear_built
. - Include informative title, axis, and legend labels.
- Discuss each of the following claims (1-2 sentences per claim). Use elements you observe in your plot as evidence for or against each claim.
- Claim 1: Larger houses are priced higher.
- Claim 2: Newer houses are priced higher.
- Claim 3: Bigger and more expensive houses tend to be newer ones than smaller and cheaper ones.
Now is a good time to render, commit, and push.
Make sure that you commit and push ALL changed documents and your Git pane is completely empty before proceding.
Data 2: BRFSS
Use this dataset for Exercises 3 through 5.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Source: cdc.gov/brfss
In the following exercises we will work with data from the 2020 BRFSS survey. The data originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use can be found in the data
folder of your repo. It’s called brfss.csv
.
brfss <- read_csv("data/brfss.csv")
Exercise 3
- How many rows are in the
brfss
dataset? What does each row represent? - How many columns are in the
brfss
dataset? Indicate the type of each variable in the dataset (ex numerical, categorical etc) . - Include the code and resulting output used to support your answer.
Now is a good time to render, commit, and push.
Exercise 4
Do people who smoke more tend to have worse health conditions?
- Use a segmented bar chart to visualize the relationship between smoking (
smoke_freq
) and general health (general_health
). Decide on which variable to represent with bars and which variable to fill the color of the bars by. - Pay attention to the order of the bars and, if need be, use the
fct_relevel
function to reorder the levels of the variables.- Below is sample code for releveling
general_health
. Here we first convertgeneral_health
to a factor (how R stores categorical data) and then order the levels from Excellent to Poor.
- Below is sample code for releveling
- You will add to the existing pipeline (code) to make the segmented bar chart.
brfss |>
mutate(
general_health = as.factor(general_health),
general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor")
)
- Include informative title, axis, and legend labels.
- Comment on the motivating question based on evidence from the visualization: Do people who smoke more tend to have worse health conditions?
Now is a good time to render, commit, and push.
Exercise 5
How are sleep and general health associated?
- Create a visualization displaying the relationship between
sleep
andgeneral_health
. - Include informative title and axis labels.
- Modify your plot to use a different theme than the default.
- Comment on the motivating question based on evidence from the visualization: How are sleep and general health associated?
Now is a good time to render, commit, and push.
Exercise 6
- Fill in the blanks:
- The gg in the name of the package ggplot2 stands for ___.
- If you map the same continuous variable to both
x
andy
aesthetics in a scatterplot, you get a straight ___ line. (Choose between “vertical”, “horizontal”, or “diagonal”.)
- Code style: Fix up the code style by spaces and line breaks where needed. Briefly describe your fixes. (Hint: Refer to the Tidyverse style guide.)
ggplot(data=mpg,mapping=aes(x=drv,fill=class))+geom_bar() +scale_fill_viridis_d()
- Read
?facet_wrap
. What doesnrow
do? What doesncol
do? What other options control the layout of the individual panels? Why doesn’tfacet_grid()
havenrow
andncol
arguments (have a look here?facet_grid
)?
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials Duke Net ID and log in using your Net ID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 7 points
- Exercise 2: 9 points
- Exercise 3: 5 points
- Exercise 4: 9 points
- Exercise 5: 7 points
- Exercise 6: 8 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:
- linking all pages appropriately on Gradescope
- putting your name in the YAML at the top of the document
- committing the submitted version of your
.qmd
to GitHub - Are you under the 80 character code limit? (You shouldn’t have to scroll to see all your code). Pipes
%>%
,|>
and ggplot layers+
should be followed by a new line - You should be consistent with stylistic choices, e.g. only use 1 of
=
vs<-
and%>%
vs|>
- All binary operators should be surrounded by space. For example
x + y
is appropriate.x+y
is not.