Fitting a Generalized Linear Model (GLM) in R


I am learning about Generalized Linear Models and the use of the R statistical package, but, unfortunately, I am unable to understand some fundamental concepts.

I am trying to develop a GLM – Poisson model but using a specific log link function. The function is of the form

$$ln(E(y_i)) = ln(beta_1) + beta_2 ln(text{exp}_1) + beta_3 ln(text{exp}_2).$$

In this equation, $text{exp}_1$ and $text{exp}_2$ are measures of exposure in the model. From my understanding, in R, I would first load all the data and ensure it was properly set-up. I then believe I should be running:

model = glm(formula = Y~exp1+exp2, family=poisson(link="log"),data=CSV_table)

As I am new to GLMs and R, I am not exactly sure what specifying poisson(link=”log”) does. I hope this question isn’t too trivial. I have been trying to google clear concise explanations online for hours; however many answers/links assume a level of knowledge higher than mine.

Can the likelihood function in generalized linear model be written in terms of the model parameter and the input variable?


In Generalized Linear Models by Peter McCullagh, John Ashworth
Nelder
,

  1. the exponential family distribution has the following
    form: $$ f_Y(y; theta, phi) = exp { (ytheta – b(theta))/a(phi) + c(y,
    phi)} (2.4) $$ Then the mean is $$ EY = b’(theta) $$

    I was wondering if in general, is $b’$ assumed to be invertible or
    injective in generalized linear models, so that we can write $theta
    = b’^{-1}(EY)$?

  2. The mean is related to the linear predictor via $g(EY) = X beta$. Is the link function $g$ assumed to be injective or invertible in
    generalized linear models, so that $EY= g^{-1}(Xbeta)$?

The purpose of the two questions is to help me to understand how the parameter $beta$ in generalized linear models is estimated via MLE. In particular, can the pdf of $Y$ be written as $f_Y(y; beta, X, phi)$, for example, by $theta = b’^{-1}(g^{-1}(Xbeta))$, so that we can apply MLE on $f_Y(y; beta, X, phi)$? Thanks!

Brier versus AIC


There are four models I’m looking at and trying to evaluate which will give me the best predictions going forward. So I loop through each model and look at the error rate, the Brier Error and the AIC. In some cases the AIC and Brier Error are not consistent (results for model 3 and 4 below). If they give contrasting indications, which one should be paid attention to? Why? Or am I looking at this the wrong way? For those interested, my R code and results are below:

line<- seq(71,79, by=.5)

models<- list("target~total+tot_eit_h_h1+tot_both_h_h1","target~total", "target~total*spread*as.factor(loc)", "target~total+spread*as.factor(loc)")

    for(j in 1:length(models)){
       for(i in 1:length(line)){

    data$target <- ifelse(data$result>line[i], 1, 
                          ifelse(data$result<line[i], 0, 
                                 rbinom(dim(data)[1],1,.5)))

    model<-glm(models[[j]], data=data, family="binomial")
    data$prediction <- predict(model, data, type='response') 

    brier=with(data, mean((prediction-target)^2))
    error_rate <- mean((data$prediction>0.5 & data$target==0) | (data$prediction<0.5 & data$target==1))
    aics<-c(aics, model$aic)
    briers<-c(briers, brier)
    error_rates<-c(error_rates, error_rate)
    }
  print(paste(mean(error_rates), "error rates"))
  print(paste(mean(aics), "AIC"))
  print(paste(mean(briers), "BRIER"))
  cat("n")
  error_rates<-c()
  aics<-c()
  briers<-c()

}

RESULTS

[1] "0.381418853255588 error rates"
[1] "1633.50190616553 AIC"
[1] "0.232600213678421 BRIER"

[1] "0.38491739552964 error rates"
[1] "1641.54699258567 AIC"
[1] "0.234847243715002 BRIER"

[1] "0.314091350826045 error rates"
[1] "1451.51883987259 AIC"
[1] "0.200957523053587 BRIER"

[1] "0.324509232264334 error rates"
[1] "1467.88024134523 AIC"
[1] "0.205168360087164 BRIER"

What is the best book about generalized linear models for novices?


I’m still pretty new to generalized linear models, and I struggle with a lot of the notation in most of the GLM texts I’ve picked up. Are there extremely popular GLM books that lend themselves better to readability?

Normalize the ratio predictor variable by logit or not?


I have a Poisson GLM model

m1 <- glm(count ~ x, data = data, family = "quasipoisson")

The predictor variable x is a cumulative sum of ratios, i.e. real numbers between 0 and 1. (Not important for this question, but if you are interested why the sum is needed, it is because it is a transformation of a simpler but recursive model).

Is it desirable to normalize the predictor variable by logit?

i.e. instead of using cumulative sum of ratios, use cumulative sum of logit of the ratios (x2)?

m2 <- glm(count ~ x2, data = data, family = "quasipoisson")

I tried to look at the q-q plots (plot(model)) but they seem to be the same. However I tried to do another plot which is not present in the plot(model) output – I don’t know if it’s actually important or not. So see comparison of the two models on this plot:

enter image description here

Is this plot actually important for this decision? Does the second plot actually show heteroscedasticity? If yes, then is the original model (without transformation) possibly better according to these plots?

IRLS working weights proportional odds


According to Reduced-rank vector generalized linear models the parameter estimates are obtained by Fisher-scoring algorithm. In the VGAM package the IRLS algorithm is used:

$beta^{(a+1)} = (sum X_i^top W_i ^{(a)} X_i)^{-1} (sum X_i^top W_i ^{(a)} z_i^{(a)})$

My aim is to reproduce the IRLS steps by extracting the working weights $ W_i ^{(a)}$ for each step $a$ and then fit the parameters for the $(a+1)$-th step.
I am failing with the first step because the vglm function does not accept the working weights.

pneumo <- transform(pneumo, let = log(exposure.time))
fit1 <- vglm(cbind(normal, mild, severe) ~ let,
             cumulative(), data=pneumo, maxit=1)
weight1 <- weights(fit, type = "working") # working weights

# the weights can't be used to fit the second step
fit2 <- vglm(cbind(normal, mild, severe) ~ let,
             cumulative(), data=pneumo, maxit=1, weights=weight1)
weight2 <- weights(fit, type = "working")

?vglm tells me:
If the VGAM family function handles multiple responses (q > 1 of them, say) then weights can be a matrix with q columns. Each column matches the respective response.

?weightsvglm tells me:
Working weights are used by the IRLS algorithm. They correspond to the second derivatives of the log-likelihood function with respect to the linear predictors. The working weights correspond to positive-definite weight matrices and are returned in matrix-band form, e.g., the first M columns correspond to the diagonals, etc.

Plot prediction for covariates from GLM


I have run GLMs and got my final model that fits my data. Now I would like to plot each of my important covariate versus the predicted values. I would like to keep all the other covariates of the model constant and have one covariate that varies and plot it. My goal is to obtain a line with confidence intervals on which I can see the predicted value for a certain value of my covariate. I have a really clear idea of the graph I want but I have no idea how to plot it in R. Any suggestions?
Thank you,

Comparison negative binomial model and quasi-Poisson


I have run negative binomial and quasi-Poisson models based on an hypothesis testing approach. My final models using both methods have different covariates and interactions. It seems that there are no patterns when I plot my residuals in both cases. Thus, I was wondering which test I could use to see which model fits my data better as the quasi-Poisson does not have any likelihood or AIC…

Also, I have a lot of overdispersion which makes me think that the negative binomial would be more appropriate, but I don’t know if I can choose my model based on common sense…

Getting started with VGAM::vglm


Trying to fit a zero-inflated Poisson model, I have trouble to understand the parameters to the vglm function in VGAM. As an example, here are some simulated data:

n <- 100
x <- factor(c(rep("a", n), rep("b", n), rep("c", n)))
lambda <- ifelse(x=="a", 1.3, ifelse(x=="b", 1.6, 1.1))
pstr0 <- 0.04

y <- rzipois(n, lambda=lambda, pstr0=pstr0)
dat <- data.frame(x=x, y=y)

(histogram)

Now I want to fit a model that estimates the four parameters (three for lambda: 1.1, 1.3, 1.6 and one for pstr0: 0.04). I tried

m.fit <- vglm(y~x, 
    family=zipoisson(zero=1, llambda="identity", lpstr0="identity"), 
    data=dat
)

with various parameters to zipoisson. I also tried to add “-1″ to the formula and to use zipoissonff as family function. However, I get only very small estimates for the factor levels b and c:

coef(m.fit, matrix=TRUE)

                 pstr0        lambda
(Intercept) 0.06974519  1.305019e+00
xb          0.00000000 -5.308981e-15
xc          0.00000000  1.210689e-15

What am I doing wrong? Or is it so hard to estimate the parameters?

Categorical independent variables and binary dependent variable? GLM?


I’ve done a literature review. The variable treat takes only TRUE and FALSE values. It says whether they applied a given treatment in a given article.
I want to know if the probability of applying the treatment changes depending on the journal and on the article.type. By the way, I should say that all journals do not have the same frequency of article type.

Here is a data to use as an example:

df = data.frame ( treat=c(T,F,T,T,F,T,T,T,F,T,T,T,T,F,T,T,F,T,F,T,T,T,F,F),
journal=rep(c('a','b','c'),8),
article.type=c('a','b','c','a','b','a','a','b','c','a','a','b','c','a','c',
'a','a','a','b','a','c','a','a','c')
)

First question:

What statistics test should I use ?

I thought of using GLM (generalized linear model) with a binomial error distribution. Does it seem good to you ?

Second question: Already Answered

summary(glm(treat~journal*article.type, data=df, family='binomial'))

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)             1.957e+01  4.390e+03   0.004    0.996
journalb               -1.997e+01  4.390e+03  -0.005    0.996
journalc                8.429e-09  8.781e+03   0.000    1.000
article.typeb          -3.913e+01  1.162e+04  -0.003    0.997
article.typec           4.134e-08  1.162e+04   0.000    1.000
journalb:article.typeb  3.884e+01  1.162e+04   0.003    0.997
journalc:article.typeb  3.913e+01  1.756e+04   0.002    0.998
journalb:article.typec         NA         NA      NA       NA
journalc:article.typec -1.916e+01  1.388e+04  -0.001    0.999

When running this, I get several p-values per variable. One p-value for “article.typeb” and one for “article.typec” for example. What does it mean ?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam