Lecture 20
Duke University
STA 199 - Fall 2023
2023-11-07
– Clone ae-20
– Draft Report Due November 15
– Exam-2 released November 16
— Cumulative with a focus on content post Exam-1
– In practice, we often define a success as the level we are most interested in
– In R, glm
defaults to the first level being failure
, and the second level being a success
What would R consider a success?
– Variable Smoking: “Smoking” ; “Non-Smoking”
– Variable Smoking: 1 ; 0
\[\log\Big(\frac{p}{1-p}\Big) = -1.9114 - 0.1684 \times exclaim\_mess\]
– What is our response?
– What does p stand for?
\[\log\Big(\frac{p}{1-p}\Big) = -1.9114 - 0.1684 \times exclaim\_mess\]
– Interpret -0.1684 in the context of the problem.
– What is the log odds of a spam email when the email has 3 exclamation points in it?
– The log odds is just the log of an odds ratio.
– An odds ratio is the ratio between the probability of a success and the probability of a failure
\[\frac{p}{1-p}\]
The probability of a spam email given that there are 3 exclaimation points is ~ 0.08. The odds ratio is then:
\[\frac{0.08}{0.92} = 0.087\]
Thus, we can say that the odds of being a spam email are 0.087 times as large as the odds for not being a spam email.
This may make more sense when the odds ratio is > 1. Assume p is the probability that an email is not spam:
\[\frac{0.92}{0.08} = 11.5\] Thus, we can say that the odds of being not a spam email are 11.5 times as large as the odds for being a spam email.
– The inverse logit link function and log odds are used to model categorical data, and make our response more symmetric
– We can use the response of log-odds to calculate odds or probabilities
– Can interpret interpret coefficients on the log-odds scale
– Can “math out” (i.e. set X = 3; X = 4) how odds and probabilities change at certain values of X
Goals for Today
– Why
– How
of Hypothesis Testing using simulation techniques
Null Hypothesis \(H_o\) - “Nothing going on”; “No relationship between variables”
Alternative Hypothesis \(H_a\) - “Our research question”
As researchers, we are interested in what’s going on at the population level. We need to make this clear when writing out our hypotheses.
– \(\mu\) - Population Mean
– \(\pi\) - Population Proportion
Suppose we are interested in if the body mass of penguins is larger than 50 pounds.
– \(H_o\):
– \(H_a\):
Suppose we are interested in if the body mass of penguins is larger than 50 pounds.
That is, we are interested if the true mean body mass of penguins, and hypothesize that their true mean body mass is larger than 50 pounds.
– \(H_o\): \(\mu\) = 50
– \(H_a\): \(\mu\) > 50
To test this hypothesis, we would need to go out an collect some data. Suppose I collected the body mass of 10 penguins, and found the mean to be 57.4 pounds.
– What is the proper notation for this?
– If you went out to collect data, do you think you would get the sample sample mean?
– Are you comfortable concluding that \(\mu\) > 50 is true based on my data?
Different and No…. because of variability!
Statistical inference is all about capturing variability about a statistic, and using it to make conclusions!
We use this variability to see if any difference we may observe is deemed by “random chance” or not.
Which letter is Bumba?
Please document your response here
– Set up your null and alternative hypothesis at the population level
– Collect data
– Calculate our sample statistic
– Calculate a p-value
– Write a decision
– Write a conclusion