library(tidyverse)
Measuring the Madness
Proposal
Data 1
Introduction and data
- Identify the source of the data.
The dataset was found by us on https://data.world/sports/ncaa-mens-march-madness, an online database for various datasets. While the site was not on the recommended sites list, the original data curator (SportReference)
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data from the dataset above is sourced from SportsReference (https://www.sports-reference.com/cbb/), an online database containing box scores from most major sports, and notably college basketball NCAA tournaments, originally collected by Kevin Johnson and Dave Quinn, and continously updated through different official sports data agencies.
- Write a brief description of the observations.
The data includes 2,205 different NCAA games ranging from 1985 to 2015, with 10 columns: the year of the game, the round of the game, the region number, the seed of the winning team and of the losing team, the score of the winning team and the losing team, and the name of the losing team and of the winning team.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How have college basketball seeds performed in March Madness relative to their expected outcomes between 1985 and 2019?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
Every year, the NCAA selection committee assigns seeds to college basketball teams across the country that determine their schedule to reach the National Championship. The best teams in the country generally receive top seeds (1-2) while lower-performing, less well-known teams, generally receive the lowest seeds (15-16). The seed is essentially a prediction of how far a team will progress in the tournament. The 1 seeds are predicted to go to the final four, 4 seeds to make the Sweet Sixteen, and 16 seeds to be eliminated in the first round. However, March Madness is special because of its upsets. For this reason, we are interested in how seeds perform relative to expectations. What seeds to tend to overperform while others underperform? We hypothesize that because of upsets, higher ranked seeds on average underperform, while the lowest seeds tend to slightly overperform because of occasional upsets. We further hypothesize that 4 and 5 seeds will most frequently overperform relative to expectations because they are predicted to be eliminated in the Sweet 16, despite generally only having a few less losses than the top seeds. To be clear, we selected 1985 as the starting point because that was the year that the NCAA expanded to a 64 team tournament with 16 seeds. Still, we anticipate that we will begin to see more upsets after 2001 and 2011 when the NCAA expanded to 65 and 68 respectively. Since “play-in” games help to filter in more experienced seeds, they will likely perform better.
- Identify the types of variables in your research question. Categorical? Quantitative?
There are quantitative variables: 7 numerical ones: the year of the game, the round of the game, the region number, the seed of the winning team and of the losing team, and the score of the winning team and the losing team, and 3 categorical ones: the region name and the name of the winning team and of the losing team.
Literature
Find one published credible article on the topic you are interested in researching.
https://digitalcommons.unomaha.edu/cgi/viewcontent.cgi?article=1180&context=university_honors_program
- Provide a one paragraph summary about the article.
This article attempted to predict the results of the 2022 NCAA March Madness tournament by drawing on data from 2011-2019 to identify “rules” for selecting teams. For example, the model showed that in the first round, all 1-2 seeds should be selected to win, but a 14 seed should be selected if they shoot over 40% on the year. The results were mixed because despite the hypothesis that these rules be able to predict upsets better, the rules overlook certain upsets such as 15 seed St. Peter’s victory over 2 seed Kentucky. This ultimately led to Kanzmeier developing a cynical take that perhaps such models are not the best approach because there is always an “immeasurable upset factor” that is difficult to account for.
- In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Our research is different from traditional prediction models because rather than trying to predict the outcomes of the tournament, we want to determine which seeds generally outperform expectations. The scope of our project is also larger because our data will analyze every tournament since the field expanded.
Glimpse of data
<- read_csv("data/Big_Dance_CSV.csv")
march_madness <- march_madness |>
march_madness rename("WinningTeam.Seed" = Seed...5,
"LosingTeam.Seed" = Seed...10,
"WinningTeam.Score" = Score...6,
"LosingTeam.Score" = Score...9,
"WinningTeam.Name" = Team...7,
"LosingTeam.Name" = Team...8)
glimpse(march_madness)
Rows: 2,205
Columns: 10
$ Year <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985…
$ Round <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ `Region Number` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
$ `Region Name` <chr> "West", "West", "West", "West", "West", "West", "Wes…
$ WinningTeam.Seed <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2…
$ WinningTeam.Score <dbl> 83, 81, 65, 85, 58, 75, 50, 54, 68, 65, 76, 59, 85, …
$ WinningTeam.Name <chr> "St Johns", "VCU", "NC State", "UNLV", "Washington",…
$ LosingTeam.Name <chr> "Southern", "Marshall", "Nevada", "San Diego St", "K…
$ LosingTeam.Score <dbl> 59, 65, 56, 80, 65, 79, 41, 63, 43, 58, 57, 58, 68, …
$ LosingTeam.Seed <dbl> 16, 15, 14, 13, 12, 11, 10, 9, 16, 15, 14, 13, 12, 1…
Data 2
Introduction and data
- Identify the source of the data.
The data is sourced from the Ergast Developer API
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data is collected from the FIA published data, the organization that presides over F1. The database is updated 2-6 hours after each race, and goes back to 1950.
- Write a brief description of the observations.
The full data set has information about all aspects of F1, including: - a list of all drivers with personal information - a list of driver standings for each year - lap times for each driver in each race -the time of each pit stop each driver took in each race - the results of each race - the results of the qualifying races - a list of constructors and their standings (the companies that construct the cars, e.g. Ferrari) and more.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What factors are most important in predicting a driver’s final race position based off of his qualifying position? (in other words, how can you predict if a driver is gonna have a big jump/fall relative to his qualifying position)
How has the new rule regarding tire compounds in qualifying affected average position for different constructors?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
In each F1 race, there is a qualifying race the day before. The drivers first spend 18 minutes driving to try to set the fastest lap time possible, with the slowest 5 drivers being cut from the next round. The 5 drivers that get cut start the actual race in positions 16-20, based on lap time. The 2nd round is the same, except it is only 15 minutes long. The 3rd round lasts for 12 minutes, and the order of lap times determines the starting positions for the actual race (so the 10th slowest driver in the 3rd round starts in 10th position). Qualifying position is extremely important, especially on circuits where passing is more difficult. However, there are instances where drivers will make big climbs or fall badly from their starting position, and we are looking to investigate the reasons why that might happen. With regards to the second question, starting in 2014 a rule was introduced where cars had to use the same type of tire they used in the 2nd round of qualifying (there are 3 types of tires: soft, medium, and hard. Soft is the fastest but has lowest durability, while hards are the slowest but last longer). This rule was removed in 2022, and we would want to investigate the effects of this rule change. We hypothesize that different circuits will have different amounts of variability from qualifying positions, but that potential factors that could cause movement are drivers’ previous performance, better constructor standings, and big discrepancies in qualifying times. Additionally, we believe that after 2022, smaller constructors will have better average positions because they are not confined to certain tire types, and can outperform better constructors using strategy.
- Identify the types of variables in your research question. Categorical? Quantitative?
Quantitative variables: - qualifying times - pit stop times - fastest lap times - Number of pit stops
Categorical variables: - driverID - Driver standings - Constructor standings - Starting grid positions - Constructor names - Circuits - Penalty/no penalty - Year/Season
Literature
- Find one published credible article on the topic you are interested in researching.
https://onlinelibrary.wiley.com/doi/full/10.1111/1467-8462.12401
- Provide a one paragraph summary about the article.
In this study, a group of researchers looked at the data to determine the importance of starting in pole position (starting 1st). They controlled for a number of different factors, including rain, accidents, technical retirements, and others. What they concluded was that there was a 10% increase in a driver’s probability of winning a race if he started in pole position. They also found that the probability of winning went down as a driver started farther back, as one would expect. They also noted that rain did not seem to have an effect on these results, and their conclusions still applied for races where it rained. Importantly, they found that the importance of pole position has increased over time. In the 1970s, the increase in win probability was only 5% (compared with the 14.7% in the 2000s). Finally, they looked at 5 of the top drivers in F1 history to see if they outperformed the data. They conclude that, for those 5 drivers, there is no change in win probability, but they will outperform the average driver by about 2 full positions on average.
- In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
These researchers looked at the importance of pole position specifically on race outcome, while we want to look at what factors are most important in causing a big change from qualifying position to final position. While they do note other factors, such as rain and driver skill, they are looking specifically at driver’s starting on pole, not at the difference between starting position vs final position.
Data 3
Introduction and data
- Identify the source of the data.
Github and Data.world
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
A group of undergraduate students from Miami University (Peter Walsh, Samantha Erne, Brendan Beattie, and Jake Holroyd) and Farmer School associate professor of information systems and analytics Fadel Megahed conducted research on NFL football injuries from the 2019-2020 to the 2022-2023 seasons. They analyzed injury rates on different field surfaces (artificial turf, natural grass, and hybrid turf) and considering a variety of game, weather, team, and player factors. I was not able to find the site which the group got their data from.
The second piece of data was created 6 years ago by data.world user Alice-C and includes data from the 2012-2013 to 2014-2015 NFL seasons. Unfortunately this is all the information that I could find on this dataset.
- Write a brief description of the observations.
For both datasets, there are fewer injuries on turf playing surface, but this difference does not seem to be statistically significant. Domed stadiums do increase the risk of injuries, and the rate of concussions is greater for turf playing surfaces.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How does the surface of play (grass or turf) affect the number and types of injuries per game experienced by National Football League players from the 2014-2015 to 2022-2023 NFL seasons?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
We will compare the number of injuries per game that players receive by the stadium in which there game is played. Each stadium will be assigned as having Turf or Grass surface. We will group injuries by categories (head, contact, non-contact, muscle strain, broken bone, etc just a few potential groupings) to see which injuries are due to the field surface. We believe that there will be more injuries per game on turf fields and there will be an increase in non-contact and contact increases for game, but head injuries will stay relatively constant.
- Identify the types of variables in your research question. Categorical? Quantitative?
There will be both categorical and quantitative variables in our research. The number of injuries per game will be a quantitative variable. However, variables such as type of injury, field surface, etc will be categorical.
Literature
Find one published credible article on the topic you are interested in researching.
(Ekstrand, Hägglund, and Fuller 2011)
https://onlinelibrary.wiley.com/doi/full/10.1111/j.1600-0838.2010.01118.x
- Provide a one paragraph summary about the article.
This research paper published in the Scandinavian Journal of Medicine and Sports attempted to compare the the difference for both male and female “football” (soccer) players’ injuries while playing on turf vs grass. The paper begins by describing the newest development of third generation turf that is intended to reduce the risk of injuries. The data that they collected came from a study of 15 male teams and 5 female teams playing on artificial turf and grass between Feburary 2003 and October of 2008. Overall, they recorded 2105 injuries were recorded, (1791 male and 314 female). Yet, they concluded that for both men and women there were no significant differences between artificial turf and grass for traumatic injuries. There were some diffrences overuse muscle injuries with a quadriceps strain being less likely on turf but an ankle sprain being more likely; however, these differences were slight. Overall, they made sure that their findings complied with FIFA guidelines, but due to limited research articles on the topic, actually recommended more in the future.
- In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Most of the existing research on the question concerns European “football” or soccer, while we will analyzing American football which should see significantly different results because there will be more traumatic injuries and less overuse injuries (also, the studies will be exclusively male). Additionally, our data will be more recent since the NFL’s policy changes are in regard to artificial turf did not exist at the time of this publication.