AE 06: Joining Practice - Suggested Answers

Application exercise
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

if_else

Instead of changing vs to a factor and changing the labels, let’s create a new variable called vs_cat

mtcars |>
  mutate(vs_cat = if_else(vs == 1, "straight", "v-shaped")) |>
  select(vs, vs_cat)
                    vs   vs_cat
Mazda RX4            0 v-shaped
Mazda RX4 Wag        0 v-shaped
Datsun 710           1 straight
Hornet 4 Drive       1 straight
Hornet Sportabout    0 v-shaped
Valiant              1 straight
Duster 360           0 v-shaped
Merc 240D            1 straight
Merc 230             1 straight
Merc 280             1 straight
Merc 280C            1 straight
Merc 450SE           0 v-shaped
Merc 450SL           0 v-shaped
Merc 450SLC          0 v-shaped
Cadillac Fleetwood   0 v-shaped
Lincoln Continental  0 v-shaped
Chrysler Imperial    0 v-shaped
Fiat 128             1 straight
Honda Civic          1 straight
Toyota Corolla       1 straight
Toyota Corona        1 straight
Dodge Challenger     0 v-shaped
AMC Javelin          0 v-shaped
Camaro Z28           0 v-shaped
Pontiac Firebird     0 v-shaped
Fiat X1-9            1 straight
Porsche 914-2        0 v-shaped
Lotus Europa         1 straight
Ford Pantera L       0 v-shaped
Ferrari Dino         0 v-shaped
Maserati Bora        0 v-shaped
Volvo 142E           1 straight

Friendly reminder that if you wanted to change the vs variable instead of making a new one, we can do that too!

scales

scales

  • scale_x_continuous
  • scale_y_continuous
  • scale_x_discrete
  • scale_y_discrete

In the following activity, we are going to define the four following functions and nest label and breaks within each to really take control of our plots.

Dollar

Plots look better with labels. We can also label the units using the above functions + labels! Comment the code below.

mtcars |>
  ggplot(
    aes(x = wt, y = mpg)
  ) + 
  geom_point() + 
  scale_x_continuous(labels = label_dollar())

Now, suppose we wanted to change the scale of the x-axis and add a suffix. We can do that within label_dollar using the following:

Hint: Run ?label_dollar

mtcars |> #comment the code up
  ggplot(
    aes(x = wt, y = mpg)
  ) + 
  geom_point() + 
  scale_x_continuous(labels = label_dollar(scale = 1/10 , suffix = "USD"))

Percents

We can also add percents. See the bottom of this qmd file for more documentation on other labels we can add.

mtcars |>
  ggplot(
    aes(x = wt, y = mpg)
  ) + 
  geom_point() +
  scale_y_continuous(labels = percent)

Breaks

Often, we can use these scale functions to control the range of our axes. To do so, we set up a sequence with a start, finish, and tell R how to move between the points.

mtcars |>
  ggplot(
    aes(x = wt, y = mpg)
  ) + 
  geom_point() +
  scale_x_continuous(
    breaks = seq(from = 0, to = 5, by = 0.5), limits = c(0,5)) 
Warning: Removed 3 rows containing missing values (`geom_point()`).

How does this change if we are working with a categorical variable?

We don’t have a quantitative variable to sequence over in the following situation. When working with a quantitative variable, we may want to change the label of the groups. Their are multiple ways to do this. 1) We could change the physical data set using mutate, or we could relabel our plot using the following:

mtcars |>
  mutate(vs = as.factor(vs)) |>
  ggplot(
    aes(x = vs, y = mpg)
  ) + 
  geom_boxplot() + 
  scale_x_discrete(labels = c(
    "0" = "v-shape" , 
    "1" = "straight"
  ))

The entire package booklet can be found here

This is just a quick example. You can do so much more. For an outside source, please visit here

Important

Go to the course GitHub organization and locate the repo titled ae-06-YOUR_GITHUB_USERNAME to get started.

This AE is due Saturday, Sep 22nd at 11:59pm.

Working with multiple data frames

Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.

There are many possible types of joins. All have the format something_join(x, y).

Study Tip change the x and y data sets to different situations you are curious about and practice joining them together to see what happens!

x <- tibble(
  value = c(1, 2, 3),
  xcol = c("x1", "x2", "x3")
  )

y <- tibble(
  value = c(1, 2, 4),
  ycol = c("y1", "y2", "y4")
  )

x
# A tibble: 3 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
y
# A tibble: 3 × 2
  value ycol 
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4   

We will demonstrate each of the joins on these small, toy datasets.

Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!

[1] "value" "xcol" 
[1] "value" "ycol" 

inner_join() all rows in x matching y

Joining with `by = join_by(value)`
# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
x |>
  inner_join(y)
Joining with `by = join_by(value)`
# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

left_join() include all rows in x and adds y information when there is a match

left_join(x , y)
Joining with `by = join_by(value)`
# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

right_join() includes all rows in y and adds x information when there is a match

right_join(x, y) 
Joining with `by = join_by(value)`
# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4   

full_join() include all rows from both x and y

full_join(x, y)
Joining with `by = join_by(value)`
# A tibble: 4 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4   

semi_join()return all rows of x that match y

Joining with `by = join_by(value)`
# A tibble: 2 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   

anti_join() returns all rows of x that do not match with y

anti_join(x, y)
Joining with `by = join_by(value)`
# A tibble: 1 × 2
  value xcol 
  <dbl> <chr>
1     3 x3   

Joining by more than one variable

x2 <- tibble(
  value = c(1, 2, 3),
  value2 = c(4,5,6),
  xcol = c("x1", "x2", "x3")
  )

y2<- tibble(
  value = c(1, 2, 4), 
  value2 = c(4,5,7),
  ycol = c("y1", "y2", "y4")
  )

x2
# A tibble: 3 × 3
  value value2 xcol 
  <dbl>  <dbl> <chr>
1     1      4 x1   
2     2      5 x2   
3     3      6 x3   
y2
# A tibble: 3 × 3
  value value2 ycol 
  <dbl>  <dbl> <chr>
1     1      4 y1   
2     2      5 y2   
3     4      7 y4   

inner_join()

inner_join(x2, y2, by = c("value"))
# A tibble: 2 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2   
inner_join(x2, y2 , by = c("value" , "value2"))
# A tibble: 2 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2   

left_join()

left_join(x2 , y2, by = c("value"))
# A tibble: 3 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2   
3     3        6 x3          NA <NA> 
left_join(x2 , y2, by = c("value" , "value2"))
# A tibble: 3 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2   
3     3      6 x3    <NA> 

Takeaway: R will default to join 2 data sets together by ALL common variable names. If we don’t want this, we can override that by setting the key using by = . When we have more than one variable as the key, we need to look for matches across the first variable AND for matches across the second variable before joining data sets together.