project-description-f23
Timeline Due Dates
Proposal Wednesday November 1st
Draft 1 Wednesday November 15th
Peer review Wednesday November 29th
Presentation + slides December 6th
Final report and [final GitHub repo] December 8th 5:00PM
Teamwork
Teamwork is an important skill that we will continue to grow during the STA199 project. It is my job to help ensure that each group member is communicating and contributing at each step of the project. Please consider the following strategies in order to help facilitate this:
COMMUNICATE. Poor communication is often the root of teamwork issues. Talk to each other. Set up a group chat on Slack / other device.
Show up to Lab, and communicate if you can not make it.
Do NOT divide and conquer this project. Every part of this project is everyone’s responsibility.
If you are not contributing to the project, you will not earn the same grade as your group mates.
Important: You will be asked to fill out a survey where you rate the contribution and teamwork of each team member AFTER EACH PROJECT COMPONENT. This will be done on Sakai. Failure to fill out the survey will impact your own project component grade.
If you score abnormally low as an individual on the survey, and this is corrorborated by your lab leader, lab participation, and number of commits on your repo, your grade will be scaled. If you, as an individual, earn a 0, you will earn a 0% on this portion of the project. If you score is above 0, but abnormally low, you will earn 50% of what the group earned.
If this is not indicated on the survey, I will assume everyone on the team equally contributed and will receive full credit for the teamwork portion of the grade.
How to get help
The feedback for each of your project components will be in the Issues tab in your project repository. If you have questions about your feedback, or would like to talk more about the feedback given, it is highly suggested that you visit the office hours of, make arrangements with, and/or email your lab leader, lab helper, and/or instructor.
Introduction
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
May be too long, but please do read
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may exist online, or it may be a data set that you have personally (e.g., from work; an internship; etc). You may not collect your own data for this project (there isn’t enough time for you to do this and still be successfulu). You can choose the data based on your teams’ interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high-quality visualization will receive a much higher grade than a large number of poor-quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible (with the exception of the presentation).
You will work on the project with your lab teams.
The four primary deliverables for the final project are
- A project proposal with three dataset ideas.
- A reproducible project writeup of your analysis, with one required and one optional draft along the way.
- Formal peer review on another team’s project.
- A presentation with slides.
You will not be submitting anything on Gradescope for the project. Submission of these deliverables will happen on GitHub and feedback will be provided as GitHub issues that you need to engage with and close. The collection of the documents in your GitHub repo will create a webpage for your project. To create the webpage go to the Build tab in RStudio, and click on Render Website.
Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Resources for datasets section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.
You must use one of the data sets in the proposal for the final project, unless instructed otherwise when given feedback.
Criteria for datasets
The data sets should meet the following criteria:
At least 300 observations (or approved by me)
At least 6 unique columns that are useful and not simply identifiers (or approved by me)
Data must be real
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying. Note, we do not suggest the use of Kaggle. Often, these data sets are fake without proper documentation. If you are excited about a data set that is located on Kaggle, please find and download it from its original source to ensure its authenticity.
- Awesome public datasets
- Bikeshare data portal
- CDC
- Data.gov
- Data is Plural
- Durham Open Data Portal
- Edinburgh Open Data
- Election Studies
- European Statistics
- CORGIS: The Collection of Really Great, Interesting, Situated Datasets
- General Social Survey
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- PRISM Data Archive Project
- Statistics Canada
- The National Bureau of Economic Research
- UCI Machine Learning Repository
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- US Government Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
For each data set, include the following:
Introduction and data
For each data set:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Research question
Before writing your research question, please see the following handout
Your research question should be clear and easily understood by a general audience. Again, when writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each data set, include the following:
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Statement on why this question is important.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Literature
A literature review is an overview of the previously published works on a topic. It is good practice to familiarize yourself with published work that has already been done on the topic you are researching. Often, a literature review spans many articles, and allows the researcher to communicate how their research fits in / extends our current understanding of a topic. For this project, we will do a “mini” literature review. For each data set:
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
You can find articles using Google Scholar
or some other academic search engine. A credible article is defined as academic published literature. A blog post or newspaper article does not count. Please reach out if you have any questions about this.
Citation: A citation is the way you tell your readers that certain material in your work came from another source. We can make citations easily in quarto using a BibTex. A BibTex is a bibliographic tool that is used with LaTex to help the user create citations.
We are not going to manually create a BibTex file (.bib), but we can have quarto do this for us! Here are the steps you will use to cite your literature:
- Click the Visual tab
Click Insert -> Citation
Click From DOI and paste the DOI of the article into the text box
Click insert
This will create you a references.bib file, and will change your YAML to include references in your document. When you render, you will now see a your references show up at the end of your document.
Glimpse of data
For each data set:
- Place the file containing your data in the
data
folder of the project repo. - Use the
glimpse()
function to provide a glimpse of the data set.
Workflow + Formatting
To earn full credit, please be mindful of the following:
Have an updated about
qmd for your project.
Your website must be able to be rendered by me and your lab leader prior to submission.
EVERYONE in your group must have at least 1 MEANINGFUL commit.
Pipes %>%, |> and ggplot layers + should be followed by a new line.
All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.
How to turn this in
Project components are turned in similar to AEs, but it is critical that you build the website and push all documents. When you are done, render the website and push ALL documents up to GitHub. There is nothing to turn in on Gradescope.
Proposal grading
Total | 15 pts |
---|---|
Introduction and data | 3 |
Research question | 3 |
Literature | 3 |
Glimpse of data | 3 |
Workflow and formatting | 3 |
Each component will be graded as follows:
- Each of your three proposals will be graded out of 4 points (1 point for each criteria above). The remaining 3 points will be graded for the Workflow and formatting expectations above.
Within each proposal, criteria will be graded as follows:
Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.
It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.
Draft
The draft of your project should be on a single data set of your choice and include the following: – Introduction and data (updated from feedback)
– Methodology
– Results
When putting together these components, please take into account the feedback in the Issues tab. Note, it is an expectation you respond to and close Issues from the data set you have chosen. You only have to close the Issues from the data set selected for your project.
See below for more detail on each section. In short, take this seriously and treat this as an opportunity to get feedback on your research project before turning it in for final grading. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, prediction or modeling, and deriving initial results and conclusions.
Write the draft in the report.qmd
file in your project repo after you have selected ONE data set to use and have responded to all feedback given from the proposal. Note, that your research question can change from the proposal if you are interested in some of the new methodology we are starting to learn now. If your research question has changed, please reach out to the instructor / lab leader to discuss the change.
Draft components
Introduction and data
This should be your writing from your proposal, updated based on feedback given. As a reminder, the introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses. Additionally, consider any potential ethical issues that you believe should be addressed (if any) regarding how your data were collected.
Then identify the source of the data, when and how it was collected, the cases, a general description of relevant variables.
Methodology
The methodology section should include visualizations and summary statistics relevant to your research question. You are required to provide writing that explains the trends / patterns the visualization(s) displays as it relates to your research question. This writing can be treated as a “rough draft” for your final, but should also provide enough detail to state and justify the choice of statistical method(s) used to answer your research question.
Results
Provide results from your analysis that answer your research question. Reminder, the goal is not calculate every possible statistic and perform every possible procedure for all variables. You should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.
You are encouraged to provide writing that interprets your results and answers your research question. This includes writing a scope of inference. The writing is optional for the draft, but will be a requirement on the final.
Workflow + Formatting
To earn full credit, please be mindful of the following:
Have an updated about
qmd for your project.
Responded to / Closed all issues from your proposal from chosen data set.
Your website must be able to be rendered by me and your lab leader prior to submission.
EVERYONE in your group must have at least 3 commits.
Pipes %>%, |> and ggplot layers + should be followed by a new line.
All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.
All team members should contribute to the GitHub repository, with regular meaningful commits.
Draft grading
Total | 15 pts |
---|---|
Introduction and data | 3 |
Methodology | 5 |
Results | 5 |
Workflow and formatting | 2 |
The project draft report grade is worth 15 points. You may lose points on your draft for the following:
Introduction and data
If components are incomplete or missing
If prior feedback was not taken into account from the proposal
Methodology
If components are incomplete or missing
If methods are not properly justified and relevant to the research question
If methods are incorrect or not appropriate to answer the research question
If EDA is not created, discussed in detail, and relevant to the research question
Results
If components are incomplete or missing
Results do not relate to proposed research question
Workflow + Formatting
- If there are any outstanding Issues from the data set you have chosen. Note, you only have to close the Issues from the data set selected for your project.
- Issues with Workflow + Formatting
Final Report
Your written report must be completed in the report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false
in the YAML.
The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be a reasonable length. Please try to keep it shorter than 10 pages (excluding figures / graphs). There is no minimum page requirement; however, you should comprehensively address all of the analysis in your report.
To check how many pages your report is, open it in your browser and go to File > Print > Save as PDF and review the number of pages.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will ***not*** see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Report components
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables.
Grading criteria
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.)
Literature Review
This section includes an overview of one previously published work on your topic of interest. This includes a summary on the author’s study design, findings/results, as well as how your study is related to + adds to the existing literature on the topic. You may use footnote citations for referenced literature.
Grading criteria
The article is clearly referenced and the connection between the article you have chosen with your research is clear. The article is well summarized to the point that the reader understands the article “at a high level”, including how the study were designed and the results that came out of it. It is clearly articulated how your study pushes the existing literature on your topic of interest, answering the question “Why is your research important?”
Methodology
This section includes any exploratory data analysis and a brief description of your statistical procedure. If you are fitting a model, explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
If you are conducting a hypothesis test or creating a confidence interval, justify your decision, specifically on how it relates to your research question. Justify if you are going to use theory or simulation based techniques. Use resulting output to help address your research question.
You may use more than one statistical method to answer your research question.
Grading criteria
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to determine analyses types and addressed any concerns over appropriateness of analyses chosen.
Results
This is where you will discuss your overall finding and describe the key results from your analysis. The goal is not to interpret every single element of an output shown, but instead to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Grading criteria
The analysis results are clearly assessed and interesting findings from the analysis are described. Interpretations are used to to support the key findings and conclusions, rather than merely listing, e.g., the interpretation of every model coefficient.
Discussion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Grading criteria
Overall conclusions from analysis are clearly described, and the analysis results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Grading criteria
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix or plots) is no longer than 10 pages.
Report grading
The written report is worth 40 points, broken down as follows
Total | 40 pts |
---|---|
Introduction/data | 5 pts |
Literature Review | 3 pts |
Methodology | 10 pts |
Results | 12 pts |
Discussion | 6 pts |
Organization + formatting | 4 pts |
Peer Review
See lab instructions on how to complete your peer review.
Presentation + slides
Slides
In addition to the written report, your team will also create presentation slides and give a presentation that summarizes and showcases your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions.
You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides.
You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
It is recommended that the slide deck have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
- Title Slide
- Slide 1: Introduce the topic and motivation
- Slide 2: Introduce the data
- Slide 3: Highlights from EDA
- Slide 4-5: Inference/modeling/other analysis
- Slide 6: Conclusions + future work
Presentation
Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Overall grading
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Project proposal | 15 pts |
Draft report | 15 pts |
Peer review | 5 pts |
Final report | 40 pts |
Slides + presentation | 25 pts |
Grading summary
Grading of the project will take into account the following:
- Content - What is the quality of research and/or policy question and relevancy of data to those questions?
- Correctness - Are statistical procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of scoring is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.