AE 10: Types of Bias and Intent- Suggested Answers

Application exercise

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Part 1 - Gettysburg Address

In the wake of the United States Civil War, President Abraham Lincoln gave the Gettysburg Address. We are going to use this speech to talk about sampling. Suppose we are interested in understanding the average word length of the Gettysburg Address.

Quickly, I want you to sample 10 words. When you sample a word, put the number of characters that word has below to create a vector of your 10 sampled words. Then, use the code to take the mean. Please change eval:false to true after you have picked your 10 words.

my_sample <- c(5, , , , , , , , ,)

mean(my_sample)

Now, we are going to see if there is evidence of sampling bias.

If there was no sampling bias, we would expect the mean of everyone’s sample to be the population mean (or close to it) of the Gettysburg Address.

What is the population mean of the Gettysburg Address?

4.22

Below, we are going to add class data within the c() below. Please feel free to also add your own if we do not record your response.

Below, set the number of bins, and write the population mean word length after xintercept =.

data <- tibble(x = c(5.4, 5.6,5.2, 5.5, 5, 4.8, 6 , 4.1, 4.7, 5.6,5.2,4.7,7.2,4.5,4.4,3.5,5,5.7,6.2,5.1,5,4.9,5.4,5.4) ) #insert data here

data |>
  summarize(md = mean(x))

data |>
  ggplot() + 
  geom_histogram(aes(x = x), bins = 20) + 
  geom_vline(xintercept = 4.22 , colour = "red" , size = 2) + #put the true mean here
  geom_vline(xintercept = mean(data$x) , color = "blue" , size = 2)

What are your major takeaways from this activity?

Taking a representitive sample is hard!

How does this concept relate to bias in algorithms?

Algorithms train off data. Perhaps the data that we train algorithms on may be bias?

Part 2 - Predicting ethnicity - data ethics

Your turn: Imai and Khanna (2016) built a racial prediction algorithm using a Bayes classifier trained on voter registration records from Florida and the U.S. Census Bureau’s name list.

Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record (Imran and Khan, 2016)

In both political behavior research and voting rights litigation, turnout and vote choice for different racial groups are often inferred using aggregate election results and racial composition. Over the past several decades, many statistical methods have been proposed to address this ecological inference problem. We propose an alternative method to reduce aggregation bias by predicting individual-level ethnicity from voter registration records. Building on the existing methodological literature, we use Bayes’s rule to combine the Census Bureau’s Surname List with various information from geocoded voter registration records. We evaluate the performance of the proposed methodology using approximately nine million voter registration records from Florida, where self-reported ethnicity is available. We find that it is possible to reduce the false positive rate among Black and Latino voters to 6% and 3%, respectively, while maintaining the true positive rate above 80%. Moreover, we use our predictions to estimate turnout by race and find that our estimates yields substantially less amounts of bias and root mean squared error than standard ecological inference estimates. We provide open-source software to implement the proposed methodology. The open-source software is available for implementing the proposed methodology.

The said “source software” is the wru package: https://github.com/kosukeimai/wru.

Then, if you feel comfortable, install the wru package and try it out using the sample data provided in the package. And if you don’t feel comfortable doing so, take a look at the results below. Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?

# install.packages("wru")   run this code in your console if your package is not installed on the server

library(tidyverse)
library(wru)

predict_race(voter.file = voters, surname.only = TRUE) %>% 
  select(surname, pred.whi, pred.bla, pred.his, pred.asi, pred.oth)

      surname    pred.whi    pred.bla     pred.his    pred.asi    pred.oth
1      Khanna 0.045110474 0.003067623 0.0068522723 0.860411906 0.084557725
2        Imai 0.052645440 0.001334812 0.0558160072 0.719376581 0.170827160
3      Rivera 0.043285692 0.008204605 0.9136195794 0.024316883 0.010573240
4     Fifield 0.895405704 0.001911388 0.0337464844 0.011079323 0.057857101
5        Zhou 0.006572555 0.001298962 0.0005388581 0.982365594 0.009224032
6    Ratkovic 0.861236727 0.008212824 0.0095395642 0.011334635 0.109676251
7     Johnson 0.543815322 0.344128607 0.0272403940 0.007405765 0.077409913
8       Lopez 0.038939877 0.004920643 0.9318797791 0.012154125 0.012105576
10 Wantchekon 0.330697188 0.194700665 0.4042849478 0.021379541 0.048937658
9       Morse 0.866360147 0.044429853 0.0246568086 0.010219712 0.054333479

If you have installed the package, re-run the code, this time to see what the package predicts for your race. Now consider the same questions again: Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?

me <- tibble(surname = "")

predict_race(voter.file = me, surname.only = TRUE) #surname.only makes predictions of race based on name only

Warning: Unknown or uninitialised column: `state`.

Proceeding with last name predictions...

ℹ All local files already up-to-date!
ℹ All local files already up-to-date!

1 (100%) individuals' last names were not matched.

  surname pred.whi pred.bla pred.his pred.asi pred.oth
1               NA       NA       NA       NA       NA