Lab 3 - Data tidying

Lab
Important

This lab is due Thursday, September 28 at 5:00pm.

Learning goals

In this lab, you will…

  • use pivoting to reshape data
  • continue developing a workflow for reproducible data analysis
  • continue working with data visualization tools

Getting started

  • Go to the sta199-f23-1 organization on GitHub. Click on the repo with the prefix lab-03. It contains the starter documents you need to complete the lab.
  • Clone the repo and start a new project in RStudio. See the Lab 1 instructions for details on cloning a repo and starting a new R project.
  • First, open the Quarto document lab-03.qmd and Render it.
  • Make sure it compiles without errors.

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

  • Update the YAML, changing the author name to your name, and render the document.
  • Commit your changes with a meaningful commit message.
  • Push your changes to GitHub.
  • Go to your repo on GitHub and confirm that your changes are visible in your `.qmd and .pdf files. If anything is missing, render, commit, and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and the scales package for better plot labels. These packages are already installed for you. You can load it by running the following in your Console:

Data

The datasets that you will work with in this dataset come from the Organization for Economic Co-Operation and Development (OECD), stats.oecd.org.

Part 1: Inflation across the world

For this part of the analysis you will work with inflation data from various countries in the world over the last 30 years.

country_inflation <- read_csv("data/country-inflation.csv")
  1. The dataset contains the annual percent inflation experienced in each of several countries over a range of years. How is this information arranged in terms of rows and columns of the dataset? Specifically:

    • What does each row of the country_inflation dataset represent?
    • What are the columns in the dataset and what do they represent?
  2. Reshape (pivot) country_inflation such that each row represents a country/year combination, with columns country, year, and annual_inflation. Make sure that the newly created annual_inflation and year variables are numeric. Save the result as a new data frame named country_inflation_pivot. Glimpse the dataframe you created.

  3. Create a vector called countries_of_interest which contains the names of countries you want to visualize the inflation rates for over the years. For example, if these countries are “Country1”,“Country2”, and “Country3”, you can express this as follows:

countries_of_interest <- c("Country1", "Country2", "County3")
  • Your countries_of_interest should consist of between three and five countries.
  • Make sure that the spelling of your countries matches how they appear in the dataset. Countries in the dataset are case sensitive!
  1. In a single pipeline, filter your reshaped dataset to include only the countries_of_interest from the previous exercise and create a plot of annual inflation vs. year for these countries. Then, in a few sentences, describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.

    • Data should be represented with points as well as lines connecting the points for each country.
    • Each country should be represented by a different color line.
    • You may shorten/abbreviate country names as appropriate and necessary.
    • Axes and legend should be properly labeled.
    • The plot should have an appropriate title (and optionally a subtitle).
    • Axis labels for inflation should be shown in percentages (e.g., 25% not 25). Hint: The label_percent() function from the scales package will be useful.
    • Make sure you remove the eval: false label before you render
ggplot(...) +
  ... +
  scale_y_continuous(label = label_percent(scale = 1))

If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.


Submission

Once you are finished with the lab, you will your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Component Points
Ex 1 3
Ex 2 6
Ex 3 4
Ex 4 12
Workflow & formatting 5
Total 30
Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

  • linking all pages appropriately on Gradescope
  • putting your name in the YAML at the top of the document
  • committing the submitted version of your .qmd to GitHub
  • Are you under the 80 character code limit? (You shouldn’t have to scroll to see all your code). Pipes %>%, |> and ggplot layers + should be followed by a new line
  • You should be consistent with stylistic choices, e.g. only use 1 of = vs <- and %>% vs |>
  • All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.