AE 08: Data Types + R Practice - Suggested Answers

Application exercise
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Data Types

Type coercion

Demo: Determine the type of the following vector. And then, change the type to numeric.

x <- c("1", "2", "3")
typeof(x)
[1] "character"
[1] 1 2 3

Let’s try another example…

y <- c("a", "b", "c")
typeof(y)
[1] "character"
Warning: NAs introduced by coercion
[1] NA NA NA
z <- c("1", "2", "three")
typeof(z)
[1] "character"
Warning: NAs introduced by coercion
[1]  1  2 NA

Survey Results

survey_results <- tibble(cars = c(1, 2, "three"))

survey_results
# A tibble: 3 × 1
  cars 
  <chr>
1 1    
2 2    
3 three

This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following

survey_results |>
  mutate(cars = as.numeric(cars))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `cars = as.numeric(cars)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3    NA

What warning comes out?

Because R doesn’t know how to change “three” into a number

This is because some of the character strings aren’t properly structured integers and so can’t be translated to the numeric class

And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Fix your mutate() call to avoid this warning.

survey_results |> 
  mutate(
    cars = if_else(cars == "three", "3", cars),
    cars = as.numeric(cars)
  )
# A tibble: 3 × 1
   cars
  <dbl>
1     1
2     2
3     3

Note: there are many ways to replace NA depending on the situation. We can use case_when for replacing multiple NAs. We can use replace_na. Etc. Etc.

Your turn: First, guess the type of the vector. Then, check if you guessed right. I’ve done the first one for you, you’ll see that it’s helpful to check the type of each element of the vector first.

Different responses

v1 <- c(1, 1L, "C")

# to help you guess
typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
typeof("C")
[1] "character"
# check the types of others here
# to check after you guess
typeof(v1)
[1] "character"
v2 <- c(1L , 0, "A")

# to help you guess
typeof(Inf)
[1] "double"
[1] "double"
typeof("A")
[1] "character"
# check the types of others here
# to check after you guess
typeof(v2)
[1] "character"

R Practice

Joining Fisheries

fisheries <- read_csv("data/fisheries.csv")
Rows: 82 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (3): capture, aquaculture, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
continents <- read_csv("data/continents.csv")
Rows: 245 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Global aquaculture production

The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries.

Our goal is to create a visualization of the mean share of aquaculture by continent.

Let’s start by looking at the fisheries data frame.

glimpse(fisheries)
Rows: 82
Columns: 4
$ country     <chr> "Angola", "Argentina", "Australia", "Bangladesh", "Brazil"…
$ capture     <dbl> 486490, 755226, 174629, 1674770, 705000, 629950, 233190, 8…
$ aquaculture <dbl> 655, 3673, 96847, 2203554, 581230, 172500, 2315, 200765, 9…
$ total       <dbl> 487145, 758899, 271476, 3878324, 1286230, 802450, 235505, …

We have the countries, but our goal is to make a visualization by continent. Let’s take a look at the continents data frame.

glimpse(continents)
Rows: 245
Columns: 2
$ country   <chr> "Afghanistan", "Åland Islands", "Albania", "Algeria", "Ameri…
$ continent <chr> "Asia", "Europe", "Europe", "Africa", "Oceania", "Europe", "…
  • Your turn (2 minutes):
    • Which variable(s) will we use to join the fisheries and continents data frames?
    • We want to keep all rows and columns from fisheries and add a column for corresponding continents. Which join function should we use?
  • Demo: Join the two data frames and name assign the joined data frame back to fisheries.
fisheries <- left_join(fisheries, continents)
Joining with `by = join_by(country)`
  • Demo: Take a look at the updated fisheries data frame. There are some countries that were not in continents. First, identify which countries these are (they will have NA values for continent). Then, manually update the continent information for these countries using the case_when function. Finally, check that these updates have been made as intended and no countries are left without continent information.
fisheries |>
  filter(is.na(continent))
# A tibble: 3 × 5
  country                          capture aquaculture   total continent
  <chr>                              <dbl>       <dbl>   <dbl> <chr>    
1 Democratic Republic of the Congo  237372        3161  240533 <NA>     
2 Hong Kong                         142775        4258  147033 <NA>     
3 Myanmar                          2072390     1017644 3090034 <NA>     

Comment through the following code below:

fisheries <- fisheries |> 
  mutate(
    continent = case_when(
    country == "Democratic Republic of the Congo" ~ "Africa",
    country == "Hong Kong" ~ "Asia",
    country == "Myanmar" ~ "Asia", 
    TRUE ~ continent # TRUE stands for everything else; ~ continent means to keep what continent is already there
    )
  )

When should we use if_else vs case_when?

if_else when we have one condition vs case_when when we have multiple conditions

  • Demo: Add a new column to the fisheries data frame called aq_prop. We will calculate it as aquaculture / total. Save the resulting frame as fisheries.
fisheries <- fisheries |>
  mutate(aq_prop = aquaculture / total)

glimpse(fisheries)
Rows: 82
Columns: 6
$ country     <chr> "Angola", "Argentina", "Australia", "Bangladesh", "Brazil"…
$ capture     <dbl> 486490, 755226, 174629, 1674770, 705000, 629950, 233190, 8…
$ aquaculture <dbl> 655, 3673, 96847, 2203554, 581230, 172500, 2315, 200765, 9…
$ total       <dbl> 487145, 758899, 271476, 3878324, 1286230, 802450, 235505, …
$ continent   <chr> "Africa", "Americas", "Oceania", "Asia", "Americas", "Asia…
$ aq_prop     <dbl> 0.0013445689, 0.0048399062, 0.3567424008, 0.5681717154, 0.…
  • Your turn (5 minutes): Now expand your calculations to also calculate the mean, minimum and maximum aquaculture proportion for continents in the fisheries data. Note that the functions for calculating minimum and maximum in R are min() and max() respectively.
fisheries |>
  group_by(continent) |>
  summarise(
    min_aq_prop = min(aq_prop),
    max_aq_prop = max(aq_prop),
    mean_aq_prop = mean(aq_prop)
  )
# A tibble: 5 × 4
  continent min_aq_prop max_aq_prop mean_aq_prop
  <chr>           <dbl>       <dbl>        <dbl>
1 Africa        0             0.803       0.0943
2 Americas      0             0.529       0.192 
3 Asia          0             0.782       0.367 
4 Europe        0.00682       0.618       0.165 
5 Oceania       0.0197        0.357       0.150 
  • Demo: Using your code above, create a new data frame called fisheries_summary that calculates minimum, mean, and maximum aquaculture proportion for each continent in the fisheries data.
fisheries_summary <- fisheries |>
  group_by(continent) |>
  summarize(
    min_aq_prop = min(aq_prop),
    max_aq_prop = max(aq_prop),
    mean_aq_prop = mean(aq_prop)
  )
  • Demo: Then, determine which continent has the largest value of max_ap. Take the fisheries_summary data frame and order the results in descending order of mean aquaculture proportion.
fisheries_summary |>
  arrange(desc(max_aq_prop))
# A tibble: 5 × 4
  continent min_aq_prop max_aq_prop mean_aq_prop
  <chr>           <dbl>       <dbl>        <dbl>
1 Africa        0             0.803       0.0943
2 Asia          0             0.782       0.367 
3 Europe        0.00682       0.618       0.165 
4 Americas      0             0.529       0.192 
5 Oceania       0.0197        0.357       0.150