2022-11-02
First, we need to understand what model selection is. Model selection is a process that to choose one of the models which addresses the problem the best.
Likelihood is the probability of a dataset given a model. Higher the better. Its range is between 0 and 1.
AIC(Akaike information criterion)
π΄πΌπΆ = 2π β 2ln(πΏ)
k = number of parameters
ln(L) log likelihood of the data (a big negative number)
Adding additional predictor variables will always improve the fit of the data under the model.
R example:
set.seed(2)
x <- runif(100)
y <- rnorm(100, mean = 4*x)
z <- runif(100)
fit<-glm(y~x)
fit2<-glm(y~x+z)
summary(fit)
##
## Call:
## glm(formula = y ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.23077 -0.85411 0.01703 0.92985 2.09850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04932 0.21727 0.227 0.821
## x 3.94852 0.37828 10.438 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.262504)
##
## Null deviance: 261.28 on 99 degrees of freedom
## Residual deviance: 123.73 on 98 degrees of freedom
## AIC: 311.08
##
## Number of Fisher Scoring iterations: 2
summary(fit2)
##
## Call:
## glm(formula = y ~ x + z)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.04788 -0.80474 -0.07834 0.83902 2.04717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1582 0.2579 -0.614 0.541
## x 3.8737 0.3795 10.208 <2e-16 ***
## z 0.5459 0.3706 1.473 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.247619)
##
## Null deviance: 261.28 on 99 degrees of freedom
## Residual deviance: 121.02 on 97 degrees of freedom
## AIC: 310.87
##
## Number of Fisher Scoring iterations: 2
Then, we can do log-likelihood
logLik(fit)
## 'log Lik.' -152.5386 (df=3)
logLik(fit2)
## 'log Lik.' -151.4327 (df=4)
Through those two results, we can observe two things
Also, because the model 2 has a better fit, it improves the predictor variables
Run anova table to compute analysis of variance (or deviance)
anova(fit, fit2, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: y ~ x
## Model 2: y ~ x + z
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 98 123.72
## 2 97 121.02 1 2.7064 0.1408
The residual deviance is how well the response variable can be predicted
by a model with predictor variables.
The lower the value, the better the model is able to predict the value
of the response variable.
We cna try to run more complex model and see if they can improve the the predictor variables
y2 <- rnorm(100, mean=3*x+z)
fit3 <- glm(y2 ~ x)
fit4 <- glm(y2 ~ x + z)
summary(fit3)
##
## Call:
## glm(formula = y2 ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4593 -0.6400 -0.0830 0.6321 3.1791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5197 0.2074 2.506 0.0139 *
## x 3.1441 0.3611 8.708 7.65e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.150142)
##
## Null deviance: 199.93 on 99 degrees of freedom
## Residual deviance: 112.71 on 98 degrees of freedom
## AIC: 301.76
##
## Number of Fisher Scoring iterations: 2
summary(fit4)
##
## Call:
## glm(formula = y2 ~ x + z)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68798 -0.60893 -0.09013 0.62783 2.90916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2823 0.2449 1.152 0.2520
## x 3.0585 0.3604 8.486 2.46e-13 ***
## z 0.6245 0.3520 1.774 0.0792 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.125482)
##
## Null deviance: 199.93 on 99 degrees of freedom
## Residual deviance: 109.17 on 97 degrees of freedom
## AIC: 300.56
##
## Number of Fisher Scoring iterations: 2
anova(fit3, fit4, test="LRT")
## Analysis of Deviance Table
##
## Model 1: y2 ~ x
## Model 2: y2 ~ x + z
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 98 112.71
## 2 97 109.17 1 3.5422 0.07606 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
logLik(fit3)
## 'log Lik.' -147.878 (df=3)
logLik(fit4)
## 'log Lik.' -146.2814 (df=4)
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable.
R example
First, we simulate a data set for 68 studentβs study time. If the time is negative, we set it to 0.
set.seed(1)
cod.prac <- rnorm(68, mean=32, sd=16)
cod.prac[cod.prac<0] <- 0
Create a empty vector store the results. lets create a vector of prob of passes by dividing the amount practiced by 60. (You can say if you study more hours, your passing probability is higher.)
result <- c()
probs.pass <- cod.prac/60
Stimulate pass and fail data by a loop (Each time, we sample p(pass) or f(fail), the probability here is a vector of probability weights for obtaining the elements of the vector being sampled.)
So, this code you can explain in this way: Every time we sample a p or f, the prob (weight, it should have two vector, one for p, another one for f) will be (prob.pass for p; max(probs.pass)-probs.pass for f > this mean if you study more, you have higher pass rate and lower fail rate)
for(i in 1:68){
result[i] <- sample(c("p","f"),1,prob=c(probs.pass[i],max(probs.pass)-probs.pass[i]))
}
Then, you plot the result.
Do logistic regression
fit <- glm(as.factor(result) ~ cod.prac, family = binomial)
We also can create a new data for logistic regression.
The reason we create and predict a newdata for that, because our
response is binary(p/f) and the outcome will only 2 results. Explanation
for these code: creating a data frame with 1000 data that cod.practice
between 0 to maximum time. Then, using the predict function to create
the result(probability of passing)
newdat <- data.frame(cod.prac=seq(0, max(cod.prac), length.out=1000))
newdat$result <- predict(fit, newdata = newdat, type="response")
ys is to set the result for 68 students (p or f) to numeric,0 or 1. 0 for f and 1 for p.
ys <- as.numeric(as.factor(result))-1
Letβs plot the results.
plot(ys~cod.prac,col=rgb(1,0,0,.5) )
lines(newdat$result~newdat$cod.prac, col = "green")
Calculate how much time we should study to pass the exam.
newdat$cod.prac[min(which(newdat$result > .5))]
## [1] 30.94791