HW 4 - Bias + Ethics

Homework

Important

This homework is due Friday, Oct 13 at 11:59pm EST.

Important

Homeworks are to be turned in individually as usual.

Important

Exercise 6 requires you to be on the Duke network or connected to the Duke VPN.

Getting Started

Go to the sta199-f23-1 organization on GitHub. Click on the repo with the prefix hw4. It contains the starter documents you need to complete the homework assignment.
Clone the repo and start a new project in RStudio.

Packages

Important

Exercise 6 asks to watch a documentary from the Duke Library. To do so, you either need to be on the Duke network or connected to the Duke VPN.

We’ll use the tidyverse package for much of the data wrangling and visualization, though you’re welcomed to also load other packages as needed.

library(tidyverse)
library(openintro)

Exercise 1

The following chart was shared by @GraphCrimes on Twitter on September 3, 2022.

What is misleading about this graph?
Suppose you wanted to recreate this plot, with improvements to avoid its misleading pitfalls from part (a). You would obviously need the data from the survey in order to be able to do that.

How many observations would these data have?
How many variables (at least) should it have, and what should those variables be?

Load the data for this survey from data/survation.csv. Confirm that the data match the percentages from the visualization. That is, calculate the percentages of public sector, private sector, don’t know for each of the services and check that they match the percentages from the plot.
Recreate the visualization, and improve it. You only need to submit the improved version, not a recreation of the misleading graph exactly.

Does the improved visualization look different than the original?
Does it send a different message at a first glance?

Exercise 2

Let’s now look at the babies dataset, loaded in the openintro package. The data contains observations about pregnancies in the 1960s among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Run help(babies) for information about the variables in the dataset. (Note that the variable smoke takes value 1 if the mother is a smoker, while the variable parity takes value 0 to indicate the first pregnancy).

Your fellow researchers are interested in exploring how the baby weight at birth varies according to different characteristics of the mother and of the pregnancy. We claim that a baby is underweight if their body weight at birth is less than or equal 100 ounces.

Make a tibble that shows the number of underweight babies for every possible combination of indicator for first pregnancy and mother smoking habits. Your tibble should not display rows containing NA values.
The other researchers made the barplot below that shows the count of babies and the share of underweight babies if the mom was a smoker and if the baby was a first pregnancy. Do you spot any inconsistencies across the two visualizations? If you do spot any, state what they are and discuss why they can be problematic.

Your fellow researchers come up with different ways to change the plot above:

Researcher 1 suggests to fix the y axis across (maintaining Count) the two facets to make the plot less misleading.
Researcher 2 suggests that the vertical axis should be in terms of percentage across the two facets.

Reproduce the plot above twice, implementing each solution separately (we are asking for two plots, one with the solution by Researcher 1 and one with the solution by Researcher 2).

You are interested in assessing the relationship between smoking and birth status. Which of the two researcher’s plot is more appropriate (less misleading)? Justify your answer.
Now, using the plot you justified in part d, let’s answer the question: Is there a relationship between smoking and birth status? Be sure to comment on and compare underweight babies across all subgroups.

Remember to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

A company wants to administer a survey to a sample of its current and past employees to learn their consumption habits in relation to periods of burnout. The study will be based on a survey administered via the company website. Participants will be asked questions about burnout experiences while working for the company and in previous jobs and questions about their consumption habits.

In this context, provide examples that may result in evidence of two of the following three: Sampling Bias, Response Bias, Non-response bias. You can think about which kind of questions asked, how the questions are asked, etc. in the company survey would generate evidence such biases. Justify your answer.

Exercise 4

A data scientist compiled data from several public sources (voter registration, political contributions, tax records) that were used to predict sexual orientation of individuals in a community. What ethical considerations arise that should guide use of such data sets?¹

Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

A data analyst received permission to post a data set that was scraped from a social media site. The full data set included name, screen name, email address, geographic location, IP (internet protocol) address, demographic profiles, and preferences for relationships. Why might it be problematic to post a deidentified form of this data set where name and email address were removed?²

Exercise 6

To complete this exercise you will first need to watch the documentary Coded Bias. To do so, you either need to be on the Duke network or connected to the Duke VPN. Then go to https://find.library.duke.edu/catalog/DUKE009834953 and click on “View Online”. Once you watch the video, write a one paragraph reflection highlighting at least one thing that you already knew about (from the course prep materials) and at least one thing you learned from the movie as well as any other aspects of the documentary that you found interesting / enlightening.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to Gradescope and click Log in in the top right corner.
Click School Credentials Duke Net ID and log in using your Net ID credentials.
Click on your STA 199 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 10 points
Exercise 2: 10 points
Exercise 3: 10 points
Exercise 4: 3 points
Exercise 5: 3 points
Exercise 6: 10 points
Workflow + formatting: 4 points
Total: 50 points

Footnotes

This exercise is from MDSR, Chp 8.↩︎
This exercise is from MDSR, Chp 8.↩︎