Lecture 18
Duke University
STA 199 - Fall 2023
2023-10-31
– Clone ae-17
– HW-5 due tonight
– Check Slack for formatting tips
– Schedule Change: Project Draft pushed back to Nov 15th
Feedback for draft will be in Issues tab on GitHub
Must respond to / close Issues specific to your selected data set
Logistic Regression (Today)
Statistical Inference
– Hypothesis Testing
– Confidence Intervals
Your research question can evolve in 199 a bit as we expand our statistical tool-kit:
– Testing if their are meaningful differences between population parameters (i.e., means or proportions)
Suppose I wanted to test to see if the mean gpa for Duke students was higher than UNC students. We can carry out a hypothesis test to gather evidence if this difference is meaningful or just happens by random chance:
\(H_o: \mu_{duke} =* \mu_{unc}\) \
\(H_a: \mu_{duke} \neq \mu_{unc}\)
Note: You can do something similar with proportions if your response variable is categorical
Suppose now instead of testing for a difference, I wanted to estimate what that difference actually is? We can make confidence intervals to estimate a range of plausible values for what \(\mu_1 - \mu_2\) actually is.
Note: You can do something similar with proportions if your response variable is categorical
Model Data
– Look at relationships; predictions; estimate probabilities (today)
– Hypothesis Testing
– Confidence Intervals
– What is the difference between R-squared and Adjusted R-squared?
— How are each defined?
— When are each appropriate to use?
– How are each defined?
R-squared: The proportion of variability in our response that is explained by our model
Adjusted-R-squared: Measure of overall model fit
— When are each appropriate to use?
R-squared: when the models have the same number of variables
Adjusted-R-squared: when the models have a different number of variables
– The What, Why, and How of Logistic Regression
– How to fit these types of models in R
– How to calculate probabilities using these models
Similar to linear regression…. but
Modeling tool when our response is categorical
– This type of model is called a generalized linear model
– Bernoulli Distribution
2 outcomes: Success (p) or Failure (1-p)
\(y_i\) ~ Bern(p)
What we can do is we can use our explanatory variable(s) to model p
Note: We use \(p_i\) for estimated probabilities
– 1: Define a linear model
– 2: Define a link function
\(z_i = \beta_o + \beta_1*X_i + ...\)
We use \(z_i\) as a placeholder for our response variable
– Preform a transformation to our response variable so it has the appropriate range of values
– Next, we need a link function that relates the linear model to the parameter of the outcome distribution i.e. transform the linear model to have an appropriate range
– Or…. takes values between negative and positive infinity and map them to probabilities [0,1]
– A logit link function takes values between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\)
The logit link function is defined as follows:
– Note: log is in reference to natural log
Takes a [0,1] probability and maps it to log odds (-\(\infty\) to \(\infty\).)
This isn’t exactly what we need though…..
Will help us get to our goal
– \(logit(p)\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
– logit(p) is also known as the log-odds
– logit(p) = \(log(\frac{p}{1-p})\)
– \(log(\frac{p}{1-p})\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
– Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities. We need the opposite of the link function… or the inverse
– How do we take the inverse of a natural log?
– \(logit(p)\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
– \[log(\frac{p}{1-p}) = \widehat{\beta_o} +\widehat{\beta}_1X1 + ....\]
– Lets take the inverse of the logit function
– Demo Together
\[p = \frac{e^{\widehat{\beta_o} + \widehat{\beta_1}X1 + ...}}{1 + e^{\widehat{\beta_o} + \widehat{\beta_1}X1 + ...}}\]
Example Figure:
– We can not model these data using the tools we currently have
– We can overcome some of the shortcoming of regression by fitting a generalized linear regression model
– We can model binary data using an inverse logit function to model probabilities of success