library(tidyverse)
library(dplyr)
library(data.table)
Project title
Proposal
Data 1
Introduction and data
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
The source of the data was from data pulled from the Steam API and various rating websites from 2012 to 2022. The observations include: Steam id, Steam store user score, name, full price, platforms, categories, genres, GFQ reported difficulty, GFQ reported rating, GFQ reported length, MetaCritic score, MetaCritic user score, IGDB score, and the IGDB user score. This includes a variety of different rating websites.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Research Question: what factors, such as price, platform availability, genres, tags, length, or difficulty affect the user ratings for games on Steam from 2012-2022?
This research question is important as it allows developers to understand what the general gaming population wants to improve the different genres of games being played. It may also provide insight into what parts of games and activities are enjoyable to humans. It will help developers create better games in the future, and it will also help gamers make informed decisions on whether they should buy the game or not.
Literature
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
The 100 most ranked games are represented by 16 genres: adventure, role-playing, shooter, platform, puzzle, strategy, hack and slash/beat ’em up, real-time strategy, turn-based strategy, point-and-click, indie, racing, sport, fighting, arcade, and simulator games. 6 genres among the 16 represent 0.83 of the successful games: Adventure, RPG, shooter, platform, puzzle, and strategy. This leads them to recommend future games to combine the other categories into these 6 genres.
Our question is slightly different, as instead of looking at the popularity of a few games and a few genres, we are taking a look at many closely related factors from a massive expanse of hundreds of thousands of Steam games, and analyzing the many variables for each game, such as price, platform availability, genres, tags, and difficulty. We are also actually using this information to create a model to predict and show what factors make games popular.
Article:https://online-journals.org/index.php/i-jim/article/view/16691
Glimpse of data
<- fread("data/simple_steam.csv")
simple_steam
glimpse(simple_steam)
Rows: 53,981
Columns: 31
$ sid <int> 10, 20, 30, 40, 50, 60, 70, 80, 130, 220, 240,…
$ store_uscore <int> 97, 84, 90, 82, 95, 81, 96, 90, 90, 97, 96, 78…
$ name <chr> "Counter-Strike", "Team Fortress Classic", "Da…
$ full_price <int> 999, 499, 499, 499, 499, 499, 999, 999, 499, 9…
$ platforms <chr> "WIN,MAC,LNX", "WIN,MAC,LNX", "WIN,MAC,LNX", "…
$ categories <chr> "Multi-player,PvP,Online PvP,Shared/Split Scre…
$ Action <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ `Free to Play` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Strategy <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Adventure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Indie <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ RPG <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Casual <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Simulation <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Racing <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Massively Multiplayer` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Nudity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Violent <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Sports <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Gore <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Early Access` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Sexual Content` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Movie <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Game Development` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ gfq_difficulty <chr> "Just Right-Tough", "Just Right-Tough", "Just …
$ gfq_rating <dbl> 3.90, 3.47, 3.69, 3.15, 3.88, 2.65, 4.23, 3.50…
$ gfq_length <dbl> 64.5, 50.6, 53.1, 2.9, 10.7, 14.5, 21.1, 36.2,…
$ meta_score <int> 88, NA, 79, NA, NA, NA, 96, 65, 71, 96, 88, NA…
$ meta_uscore <int> 92, 71, 91, 68, 86, 68, 90, 87, 82, 91, 89, 82…
$ igdb_score <int> 70, NA, 71, NA, 70, NA, 80, 66, 60, 91, 90, NA…
$ igdb_uscore <int> 83, 70, 76, 75, 82, 72, 90, 75, 72, 91, 79, 77…
Data 2
Introduction and data
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Source Link: https://think.cs.vt.edu/corgis/csv/school_scores/
This source randomly collected information of students across the country from 2005-2015. Some of the observations included SAT scores by subject, GPA scores (cumulative & by subject), total number of sample test takers, and number of test takers by income bracket.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Research Question: What is the correlation between family income and Math SAT scores for students, and how has this correlation evolved over the last ten years (2005-2015)? (Socioeconomics)
This question is important because education and test scores tie in with career opportunities. Schools should aim to minimize any disproportionate advantages for those with higher incomes to ensure fairness for all.
This research question falls into the topic the evolution of socioeconomic differences. We hypothesize that there will be a strong, positive correlation with family income and Math SAT scores, but over time this correlation will decrease.
The variables set are Math SAT scores (quantitative/discrete), Income brackets (categorical), and the Year (categorical).
Literature
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Article Link: http://academypublication.com/issues2/jltr/vol06/06/27.pdf
Literature Summary: An experiment in Iran was conducted to see the effects of a parents’ “involvement, attitude, educational background and level of income” and their children’s English test score’s. Two different correlation tests, The Pearson Product Moment Correlation and the Spearman rank Order Correlation were used in order to see the correlation between the respective variables. This experiment was conducted by gathering a random sample of 140 parents and giving them a questionnaire, where their responses placed them in specific categories. It was hypothesized that each individual factor had an positive correlation on test scores, and the results showed that there was a positive correlation of 0.58 and 0.51 from the respective correlation tests.
This ties with our research question, but our question focuses on the income rather than all the other factors in the case above. Additionally, our question focuses on the evolution of the correlation.
Glimpse of data
<- read_csv("data/school_scores.csv") school_scores
Rows: 577 Columns: 99
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State.Code, State.Name
dbl (97): Year, Total.Math, Total.Test-takers, Total.Verbal, Academic Subjec...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(school_scores)
Rows: 577
Columns: 99
$ Year <dbl> 2005, 2005, …
$ State.Code <chr> "AL", "AK", …
$ State.Name <chr> "Alabama", "…
$ Total.Math <dbl> 559, 519, 53…
$ `Total.Test-takers` <dbl> 3985, 3996, …
$ Total.Verbal <dbl> 567, 523, 52…
$ `Academic Subjects.Arts/Music.Average GPA` <dbl> 3.92, 3.76, …
$ `Academic Subjects.Arts/Music.Average Years` <dbl> 2.2, 1.9, 2.…
$ `Academic Subjects.English.Average GPA` <dbl> 3.53, 3.35, …
$ `Academic Subjects.English.Average Years` <dbl> 3.9, 3.9, 3.…
$ `Academic Subjects.Foreign Languages.Average GPA` <dbl> 3.54, 3.34, …
$ `Academic Subjects.Foreign Languages.Average Years` <dbl> 2.6, 2.1, 2.…
$ `Academic Subjects.Mathematics.Average GPA` <dbl> 3.41, 3.06, …
$ `Academic Subjects.Mathematics.Average Years` <dbl> 4.0, 3.5, 3.…
$ `Academic Subjects.Natural Sciences.Average GPA` <dbl> 3.52, 3.25, …
$ `Academic Subjects.Natural Sciences.Average Years` <dbl> 3.9, 3.2, 3.…
$ `Academic Subjects.Social Sciences/History.Average GPA` <dbl> 3.59, 3.39, …
$ `Academic Subjects.Social Sciences/History.Average Years` <dbl> 3.9, 3.4, 3.…
$ `Family Income.Between 20-40k.Math` <dbl> 513, 492, 49…
$ `Family Income.Between 20-40k.Test-takers` <dbl> 324, 401, 21…
$ `Family Income.Between 20-40k.Verbal` <dbl> 527, 500, 49…
$ `Family Income.Between 40-60k.Math` <dbl> 539, 517, 52…
$ `Family Income.Between 40-60k.Test-takers` <dbl> 442, 539, 22…
$ `Family Income.Between 40-60k.Verbal` <dbl> 551, 522, 51…
$ `Family Income.Between 60-80k.Math` <dbl> 550, 513, 52…
$ `Family Income.Between 60-80k.Test-takers` <dbl> 473, 603, 23…
$ `Family Income.Between 60-80k.Verbal` <dbl> 564, 519, 52…
$ `Family Income.Between 80-100k.Math` <dbl> 566, 528, 53…
$ `Family Income.Between 80-100k.Test-takers` <dbl> 475, 444, 18…
$ `Family Income.Between 80-100k.Verbal` <dbl> 577, 534, 53…
$ `Family Income.Less than 20k.Math` <dbl> 462, 464, 48…
$ `Family Income.Less than 20k.Test-takers` <dbl> 175, 191, 89…
$ `Family Income.Less than 20k.Verbal` <dbl> 474, 467, 47…
$ `Family Income.More than 100k.Math` <dbl> 588, 541, 55…
$ `Family Income.More than 100k.Test-takers` <dbl> 980, 540, 30…
$ `Family Income.More than 100k.Verbal` <dbl> 590, 544, 54…
$ `GPA.A minus.Math` <dbl> 569, 544, 54…
$ `GPA.A minus.Test-takers` <dbl> 724, 673, 33…
$ `GPA.A minus.Verbal` <dbl> 575, 546, 53…
$ `GPA.A plus.Math` <dbl> 622, 600, 60…
$ `GPA.A plus.Test-takers` <dbl> 563, 173, 16…
$ `GPA.A plus.Verbal` <dbl> 623, 604, 59…
$ GPA.A.Math <dbl> 600, 580, 57…
$ `GPA.A.Test-takers` <dbl> 1032, 671, 3…
$ GPA.A.Verbal <dbl> 608, 578, 56…
$ GPA.B.Math <dbl> 514, 492, 49…
$ `GPA.B.Test-takers` <dbl> 1253, 1622, …
$ GPA.B.Verbal <dbl> 525, 499, 49…
$ GPA.C.Math <dbl> 436, 466, 45…
$ `GPA.C.Test-takers` <dbl> 188, 418, 11…
$ GPA.C.Verbal <dbl> 451, 472, 46…
$ `GPA.D or lower.Math` <dbl> 0, 424, 439,…
$ `GPA.D or lower.Test-takers` <dbl> 0, 12, 16, 0…
$ `GPA.D or lower.Verbal` <dbl> 0, 466, 435,…
$ `GPA.No response.Math` <dbl> 0, 0, 0, 0, …
$ `GPA.No response.Test-takers` <dbl> 225, 427, 91…
$ `GPA.No response.Verbal` <dbl> 0, 0, 0, 0, …
$ Gender.Female.Math <dbl> 538, 505, 51…
$ `Gender.Female.Test-takers` <dbl> 2072, 2161, …
$ Gender.Female.Verbal <dbl> 561, 521, 52…
$ Gender.Male.Math <dbl> 582, 535, 54…
$ `Gender.Male.Test-takers` <dbl> 1913, 1835, …
$ Gender.Male.Verbal <dbl> 574, 526, 53…
$ `Score Ranges.Between 200 to 300.Math.Females` <dbl> 22, 30, 119,…
$ `Score Ranges.Between 200 to 300.Math.Males` <dbl> 10, 20, 72, …
$ `Score Ranges.Between 200 to 300.Math.Total` <dbl> 32, 50, 191,…
$ `Score Ranges.Between 200 to 300.Verbal.Females` <dbl> 14, 26, 115,…
$ `Score Ranges.Between 200 to 300.Verbal.Males` <dbl> 17, 26, 86, …
$ `Score Ranges.Between 200 to 300.Verbal.Total` <dbl> 31, 52, 201,…
$ `Score Ranges.Between 300 to 400.Math.Females` <dbl> 173, 233, 88…
$ `Score Ranges.Between 300 to 400.Math.Males` <dbl> 93, 153, 450…
$ `Score Ranges.Between 300 to 400.Math.Total` <dbl> 266, 386, 13…
$ `Score Ranges.Between 300 to 400.Verbal.Females` <dbl> 123, 218, 73…
$ `Score Ranges.Between 300 to 400.Verbal.Males` <dbl> 84, 171, 613…
$ `Score Ranges.Between 300 to 400.Verbal.Total` <dbl> 207, 389, 13…
$ `Score Ranges.Between 400 to 500.Math.Females` <dbl> 514, 696, 32…
$ `Score Ranges.Between 400 to 500.Math.Males` <dbl> 293, 485, 19…
$ `Score Ranges.Between 400 to 500.Math.Total` <dbl> 807, 1181, 5…
$ `Score Ranges.Between 400 to 500.Verbal.Females` <dbl> 430, 656, 30…
$ `Score Ranges.Between 400 to 500.Verbal.Males` <dbl> 332, 552, 23…
$ `Score Ranges.Between 400 to 500.Verbal.Total` <dbl> 762, 1208, 5…
$ `Score Ranges.Between 500 to 600.Math.Females` <dbl> 722, 813, 35…
$ `Score Ranges.Between 500 to 600.Math.Males` <dbl> 614, 616, 31…
$ `Score Ranges.Between 500 to 600.Math.Total` <dbl> 1336, 1429, …
$ `Score Ranges.Between 500 to 600.Verbal.Females` <dbl> 690, 729, 36…
$ `Score Ranges.Between 500 to 600.Verbal.Males` <dbl> 617, 596, 31…
$ `Score Ranges.Between 500 to 600.Verbal.Total` <dbl> 1307, 1325, …
$ `Score Ranges.Between 600 to 700.Math.Females` <dbl> 485, 342, 16…
$ `Score Ranges.Between 600 to 700.Math.Males` <dbl> 611, 445, 21…
$ `Score Ranges.Between 600 to 700.Math.Total` <dbl> 1096, 787, 3…
$ `Score Ranges.Between 600 to 700.Verbal.Females` <dbl> 596, 423, 18…
$ `Score Ranges.Between 600 to 700.Verbal.Males` <dbl> 613, 375, 16…
$ `Score Ranges.Between 600 to 700.Verbal.Total` <dbl> 1209, 798, 3…
$ `Score Ranges.Between 700 to 800.Math.Females` <dbl> 156, 47, 327…
$ `Score Ranges.Between 700 to 800.Math.Males` <dbl> 292, 116, 63…
$ `Score Ranges.Between 700 to 800.Math.Total` <dbl> 448, 163, 95…
$ `Score Ranges.Between 700 to 800.Verbal.Females` <dbl> 219, 109, 41…
$ `Score Ranges.Between 700 to 800.Verbal.Males` <dbl> 250, 115, 50…
$ `Score Ranges.Between 700 to 800.Verbal.Total` <dbl> 469, 224, 91…
Data 3
Introduction and data
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
The source of the data is from data.gov and is an anonymous report of drug use-related deaths in Connecticut from 2012 to 2022. Observations specify whether deaths involved morphine, heroin, and other opioids.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Research Question: What is the correlation between time and popularity (proportion of uses) of drug use? How has opioid use changed over time in regard to lethal overdoses? Type of drug will be a categorical variable, while number of deaths will be a quantitative variable.
This question is important because opioid use and overdoses are on the rise in the past decade, with many people throughout the United States falling victim to addiction as a result of either prescription opioids or exposure to harder (illegal) drugs. We hypothesize that there will be a strong correlation between time and popularity of opioids and other illicit drugs. Time will be a quantitative variable, while type of drug is categorical.
Literature
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
- From 2013 to 2019, opioid overdose deaths in the United States saw a substantial increase. In 2013, there were approximately 25,000 opioid-related deaths, and by 2019, this number had risen to around 49,860. This surge was largely driven by synthetic opioids like fentanyl, which were involved in more than half of these fatalities in 2019. Additionally, prescription opioid-related deaths, though they declined from their peak in 2011, still accounted for a significant portion of opioid overdose fatalities during this period. Our research will be different from the articles in that we will focus on other types of drugs as well as opioids. We want to see how different drugs have grown in popularity over time.
Article Link: https://catalog.data.gov/dataset/accidental-drug-related-deaths-2012-2018
Glimpse of data
<- read.csv("data/acc_drug_death.csv")
acc_drug_death
glimpse(acc_drug_death)
Rows: 10,654
Columns: 48
$ Date <chr> "05/29/2012", "06/27/2012", "03/24/2014"…
$ Date.Type <chr> "Date of death", "Date of death", "Date …
$ Age <int> 37, 37, 28, 26, 41, 57, 26, 64, 33, 23, …
$ Sex <chr> "Male", "Male", "Male", "Female", "Male"…
$ Race <chr> "Black", "White", "White", "White", "Whi…
$ Ethnicity <chr> "", "", "", "", "", "", "", "", "", "", …
$ Residence.City <chr> "STAMFORD", "NORWICH", "HEBRON", "BALTIC…
$ Residence.County <chr> "FAIRFIELD", "NEW LONDON", "", "", "FAIR…
$ Residence.State <chr> "", "", "", "", "CT", "MA", "CT", "CT", …
$ Injury.City <chr> "STAMFORD", "NORWICH", "HEBRON", "", "SH…
$ Injury.County <chr> "", "", "", "", "", "HARTFORD", "", "NEW…
$ Injury.State <chr> "CT", "CT", "CT", "", "", "CT", "", "CT"…
$ Injury.Place <chr> "Residence", "Residence", "Residence", "…
$ Description.of.Injury <chr> "Used Cocaine", "Drug Use", "Drug Use", …
$ Death.City <chr> "", "NORWICH", "MARLBOROUGH", "BALTIC", …
$ Death.County <chr> "", "NEW LONDON", "", "NEW LONDON", "", …
$ Death.State <chr> "", "", "", "", "", "CT", "CT", "CT", "C…
$ Location <chr> "Residence", "Hospital", "Hospital", "Re…
$ Location.if.Other <chr> "", "", "", "", "", "Roadway in vehicle"…
$ Cause.of.Death <chr> "Cocaine Toxicity", "Heroin Toxicity", "…
$ Manner.of.Death <chr> "Accident", "Accident", "Accident", "Acc…
$ Other.Significant.Conditions <chr> "", "", "", "", "", "", "", "", "", "", …
$ Heroin <chr> "", "Y", "Y", "Y", "", "", "", "", "", "…
$ Heroin.death.certificate..DC. <chr> "", "", "", "", "", "", "", "", "", "", …
$ Cocaine <chr> "Y", "", "", "", "", "Y", "", "", "Y", "…
$ Fentanyl <chr> "", "", "", "", "Y", "", "", "", "", "",…
$ Fentanyl.Analogue <chr> "", "", "", "", "", "", "", "", "", "", …
$ Oxycodone <chr> "", "", "", "", "", "", "", "Y", "", "",…
$ Oxymorphone <chr> "", "", "", "", "", "", "", "", "", "", …
$ Ethanol <chr> "", "", "", "", "", "", "", "", "", "", …
$ Hydrocodone <chr> "", "", "", "", "", "", "", "", "", "", …
$ Benzodiazepine <chr> "", "", "", "", "", "", "", "", "", "", …
$ Methadone <chr> "", "", "", "", "", "", "", "", "", "", …
$ Meth.Amphetamine <chr> "", "", "", "", "", "", "", "", "", "", …
$ Amphet <chr> "", "", "", "", "", "", "", "", "", "", …
$ Tramad <chr> "", "", "", "", "", "", "", "", "", "", …
$ Hydromorphone <chr> "", "", "", "", "", "", "", "", "", "", …
$ Morphine..Not.Heroin. <chr> "", "", "", "", "", "", "", "", "", "", …
$ Xylazine <chr> "", "", "", "", "", "", "", "", "", "", …
$ Gabapentin <chr> "", "", "", "", "", "", "", "", "", "", …
$ Opiate.NOS <chr> "", "", "", "", "", "", "", "", "", "", …
$ Heroin.Morph.Codeine <chr> "", "", "", "", "", "", "", "", "", "", …
$ Other.Opioid <chr> "", "", "", "", "", "", "", "", "", "", …
$ Any.Opioid <chr> "", "", "", "", "Y", "", "Y", "", "", ""…
$ Other <chr> "", "", "", "", "", "", "", "", "", "", …
$ ResidenceCityGeo <chr> "STAMFORD, CT\n(41.051924, -73.539475)",…
$ InjuryCityGeo <chr> "STAMFORD, CT\n(41.051924, -73.539475)",…
$ DeathCityGeo <chr> "CT\n(41.575155, -72.738288)", "Norwich,…