Data Ethics and Bias

Lecture 11

Dr. Elijah Meyer

Duke University
STA 199 - Fall 2023

2023-10-03

Checklist

– Congrats! Exam-1 is done!

– Clone ae-010

– hw-4 - released Thursday

– lab-4 will be in groups

Group Work

Without teamwork, real-world data science problems would be impossible to solve

– We are not experts in everything

– Learn from different perspectives

– Efficiency

– etc

Lab-4 and Beyond

Practice working in groups

— Technical skills

— Communicate

Group Formation

Fill out survey on Sakai (Due Wednesday 11:59 PM)

– Groups of 4-5

– Will take preferences into account (need to be within lab)

– Randomly assigned

Questions

Goals

  • Think

  • Data Ethics

  • Data privacy

  • Bias

Bias

Warm Up: Types of Bias

– Response Bias

– Non-Response Bias

– Sampling Bias

Response Bias

– Observing a systematic pattern of inaccurate or false responses based on external factors (eg social pressure, avoid scrutiny, etc.)

Response Bias (Example)

  • In North Carolina, you need to be 18 to legally get a tattoo. Do you think the legal age of drinking should be lowered to 18?

  • Do you think the legal age of drinking should be lowered to 18?

Response Bias (Example)

  • Do you think it’s appropriate to drink alcohol every single day?

  • The respondent may want to answer honestly, but they could respond in a more socially acceptable way

Response Bias (Example)

  • What is your political party affiliation?

  • Do you approve of the president?

  • Answering the previous question could influence you on how to answer the next question to avoid social pressure / judgement

Response Bias: Your Turn

– Suppose you wanted to collect data on student’s GPA. What is an example of a question (or how it is asked) that may elicit response bias?

What can we do?

– Think about leading questions

– Think about how multiple questions relate to each other

– Think about what your question is asking, and if you are asking multiple questions

– Think about sensitive questions

– Think about the information you are collecting

– Think about “setting the stage”

Non-response Bias

  • When participants do not answer question(s) due to some systematic process / pattern

Non-response Bias (Example)

  • Did you cheat on the STA199 Exam?

  • Students who did may be more likely not to respond

Non-response Bias (Example)

  • Assessing the link between smoking and heart disease

  • Studies generally show that respondents report better health outcomes and more positive health-related behaviors than nonrespondents. People with poorer health tend to avoid participating in health surveys

Non-response Bias: Your Turn

Suppose you were interested in alcohol consumption by residents in Durham. What is an example of a question (or how it is asked) that may elicit non-response bias?

What can we do?

– Think critically about your target audience

– Think about potential sensitive questions

– Language barrier

– Think about how participants are being contacted

Summary Example

  • What is your name?

  • Who is your least favorite instructor at Duke?

  • Response bias if there is a pattern of inaccurate responses

  • Non-response bias if there is a pattern of missingness

Sampling Bias

– bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others

Sampling Bias (Example)

– I’m interested in height of all Duke students! I go out and collect data on

  • The basketball team

  • This 199 class

  • Students walking by the chapel

  • Put a survey out on Facebook

What can we do

– Random Sample

  • Try really hard?

ae-10

Gettysburg Activity: Representative words of word length

Data Privacy + Intent

Every time we use apps, websites, and devices, our data is being collected and used or sold to others. More importantly, decisions are made by law enforcement, financial institutions, and governments based on data that directly affect the lives of people.

What are you okay with?

  • Name
  • Age
  • Email
  • Phone Number
  • List of every video you watch
  • List of every video you comment on
  • How you type: speed, accuracy
  • How long you spend on different content
  • List of all your private messages (date, time, person sent to)
  • Info about your photos (how it was taken, where it was taken (GPS), when it was taken)
  • Browsing history

Intended use

– Name

– Age

– Email

– Phone Number

– How long you spend on different content

– List of all your private messages (date, time, person sent to)

– Info about your photos (how it was taken, where it was taken (GPS), when it was taken)

– Browsing history

  • Targeted ads
  • Candidate for a job (Amazon has done this)
  • Predict your race to map votes (This is being done)

ae-10-part 2

Privacy

Case study: OK Cupid

OK Cupid data breach

– In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details

– Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.

Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær

Questions for you

  • Should you scrape these data?

  • How do you not violate reasonable expectations of privacy?

Summary

Bias

Bias is a disproportionate weight in favor of or against an idea or thing

  • We all have bias

  • Bias can be a part of science and research

Examples

Facial Recognition

Parting Thoughts

  • Ask questions

  • Slow down

  • Think critically