Looking for and dealing with collinearity in a GLM

I’ve got this dataset with one continuous dependent variable and two categorical explanatory variables. I’m wanting to run glms on the data but I’m finding problems with what I think is collinearity. When I analyse the data I get a bunch of NAs on the last coefficient. I can’t find a way to test this kind of data for collinearity though, and particularly not for a glm. Although I’m not sure what the best thing would be to do if I did find collinearity!

Please help! This is for my undergrad dissertation and I know practically nothing about R or statistics.

By the way, the dependent variable is proportion data, hence why I am using the binomial family.

> summary(Model1data)
  Host                 Parasite  Replicate   Mortality     
 1      : 1   Control          : 1   A:35      Min.   :0.0000  
 MCB4865:21   Mix              :38   B:37      1st Qu.:0.0100  
 MY10   :21   S.marcescens.2170:42   C:51      Median :0.1500  
 MY14   :21   S.marcescens.D   :42             Mean   :0.2055  
 MY17   :18                                    3rd Qu.:0.3096  
 MY8    :21                                    Max.   :0.9885  
 N2     :20                                                    

> glm1 <- glm(data = Model1data, 
+            Mortality ~ Host + Parasite, family = binomial)
Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
> summary(glm1)

glm(formula = Mortality ~ Host + Parasite, family = binomial, 
data = Model1data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.86613  -0.19364  -0.08645   0.12684   1.11525  

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)                 -6.801     30.015  -0.227   0.8207  
HostMCB4865                  2.654     30.069   0.088   0.9297  
HostMY10                     1.466     30.073   0.049   0.9611  
HostMY14                     1.184     30.074   0.039   0.9686  
HostMY17                     2.247     30.071   0.075   0.9404  
HostMY8                      1.523     30.072   0.051   0.9596  
HostN2                       1.955     30.071   0.065   0.9482  
ParasiteMix                  4.118      1.788   2.304   0.0212 *
ParasiteS.marcescens.2170    4.165      1.782   2.337   0.0194 *
ParasiteS.marcescens.D          NA         NA      NA       NA  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 41.499  on 122  degrees of freedom
Residual deviance: 15.220  on 114  degrees of freedom
AIC: 88.973

Number of Fisher Scoring iterations: 9

Also this is (I think) the contingency table

> table(Model1data$Host, Model1data$Parasite)

          Control Mix S.marcescens.2170 S.marcescens.D
  1             1   0                 0              0
  MCB4865       0   7                 7              7
  MY10          0   7                 7              7
  MY14          0   7                 7              7
  MY17          0   4                 7              7
  MY8           0   7                 7              7
  N2            0   6                 7              7

Evaluating a binomial (success vs. failure) glm

I’m familiar with (some) approaches to evaluating the fit (or accuracy) of a binary (logistic) model (e.g. AUC). Are there methods/approaches that are particularly well-suited for a binomial (success vs. failure) model?

If the suggestion is to use a variant of a (pseudo-) R-square, what are recommended approaches? I can think of a few, using logit vs. response-scale predictions and with and without weighting by # of subjects, but am unsure of their appropriateness:

[s = # of successes, f = # of failures]

  1. summary(lm(predict(model)~I(s/(s+f))))$r.square

  2. summary(lm(predict(model,type=’response’)~I(s/(s+f))))$r.square

  3. summary(lm(predict(model)~I(s/(s+f)),weights=s+f))$r.square

  4. summary(lm(predict(model,type=’response’)~I(s/(s+f)),weights=s+f))$r.square

  5. 1-var(residuals(model))/(var(s/(s+f)))

Was this the appropriate regression model?

I was recently proof-reading a friend’s thesis (for their writing, not stats usage) when I came across a usage of a regression model which I would regard as incorrect. However, I’m pretty new to the game so I wanted to see what other people thought.

In short: They used a linear regression model for a percentage. I think they should have used a logistic regression. I’ll explain below.

They were testing the percentage of flowers which had grown from seedlings and whether there were observable differences depending on which soil mixtures were used i.e. did soil A result in a higher percentage of seeds which flowered compared with soil B.

To me, this seems like a binary outcome i.e. yes, it did flower or no, it did not flower. Consequently, a logistic regression model would have been a more appropriate model to use as a percentage is not a quantitative variable in the way that blood pressure is a quantitative variable.

Is this incorrect? I would be very interested in knowing, chiefly because it’s bugging me.

Analysis for checking if an Ensemble model is a better fit for a dataset than Primitive model

I have a dataset and have the option to apply either GLM (primitive) or a Random Forest (ensemble). So far the Random Forest is giving way better results than the GLM. As it is generally believed that ensemble models should not be used unless absolutely necessary, hence I am looking for any analysis which I could perform on the dataset, which could prove that indeed the only way/better way to model the relationship between variables in the dataset is by using a ensemble model like Random Forest etc.

Post hoc test for GLM using quasi poisson on count data in R? glht?

I’m currently working with count data with lots of zeros (the number of invertebrates within 6 types of hedge plots). I decided to use glm and quasipoisson which showed there is significant difference between the plots. I would now like to know which hedges plot are different to which…

Can I use the below glht function to do this ?


Is there a simple model that assumes a negative relationship between the mean and variance of response?

Poisson and Negative Binomial models assume that the variance of response is greater or equal to the mean of response. Is there a simple model where that assumption is reversed, i.e. variance goes down as mean goes up.

Finding Bias$(s^2)$ in incorrect linear model

I am unable to find the bias of the sample variance estimator in this problem. Unfortunately I keep coming up with

Question: Suppose that the true linear model is

$y = X_1beta_1 + X_2beta_2 + e$,

where $e sim MVN(0,sigma^2 I)$, $X_1: N times p_1, beta_1: p_1 times 1, X_2: N times p_2, beta_2: p_2 times 1$. But we wrongly fit the model $y = X_1beta_1+e$ and calculate the estimators of $beta_1$ and $sigma^2$ as:

$hat{beta}_1 = (X_1′X_1)^{-1}X_1′y$,

$s^2 = large frac{y’[I-X_1(X_1'X_1)^{-1}X_1']y}{N-p_1}$.

(a) Show that, in general, $hat{beta}_1$ is a biased estimator with

Bias$(hat{beta}_1) = (X_1′X_1)^{-1}X_1′X_2hat{beta}_2$

Under what condition on $X_1$ and $X_2$ is $hat{beta}_1$ unbiased?

(b) Show that, in general, $s^2$ is a biased estimator with

Bias$(s^2) = frac{1}{N-p_1}beta_2′X_2′[I-X_1(X_1'X_1)^{-1}X_1']X_2beta_2$

Is the bias positive or negative? Under what condition on $beta_2$ is $s^2$ unbiased?


For part (a):

Bias$(hat{beta}_1) = E((X_1′X_1)^{-1}X_1′y – beta_1)$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′E(y) – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′(X_1beta_1 + X_2beta_2) – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′X_1beta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2) – beta_1$

$hspace{15mm}= Ibeta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2 – beta_1$

$hspace{15mm}= beta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2 – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′X_2beta_2$

If $X_1$ and $X_2$ are orthogonal, then $hat{beta}_1$ is unbiased.

For part (b), I am unable to figure out how to derive:

Bias$(s^2) = frac{1}{N-p_1}beta_2′X_2′[I-X_1(X_1'X_1)^{-1}X_1']X_2beta_2$

I can get


after taking the expectation of y on both sides.

Any help would be much appreciated.

Low sample size: LR vs F – test

Some of you might have read this nice paper:

O’Hara RB, Kotze DJ (2010) Do not log-transform count data. Methods in Ecology and Evolution 1:118–122. klick.

Currently I am comparing negative binomial models with gaussian models on transformed data. Unlike O’Hara RB, Kotze DJ (2010) I’m looking at the special case of low sample sizes and in a hypothesis testing context.

A used simulations to investigate the differences between both.

Type I Error simulations

All compuations have been done in R.

I simulated data from a factorial design with one control group ($μ_c$) and 5 treatment groups ($μ_{1−5}$). Abundances were drawn from a negative binomial distributions with fixed dispersion parameter (θ=3.91). Abundances were equal in all treatments.

For the simulations I varied the sample size (3, 6, 9, 12) and the abundances (2, 4 ,8, … , 1024). 100 datasets were generated and analysed using a negative binomial GLM (MASS:::glm.nb()), a quasipoisson GLM (glm(..., family = 'quasipoisson') and a gaussian GLM + log-transformed data (lm(...)).

I compared the models with the null model using a Likelihood-Ratio test (lmtest:::lrtest()) (gaussian GLM and neg. bin GLM) as well as F-tests (gaussian GLM and quasipoisson GLM)(anova(...test = 'F')).

If needed I can provide the R code, but see also here for a related question of mine.

enter image description here

For small sample sizes, the LR-tests (green – neg.bin.; red – gaussian) lead to an increased Type-I error. The F-test (blue – gaussian, purple – quasi-poisson) seem to work even for small sample sizes.

LR tests give similar (increased) Type I errors for both LM and GLM.

Interestingly the quasi-poisson works pretty well (but also with an F-Test).

As expected, if sample size increases LR-Test performs also well (asymptotically correct).

For the small sample size there have been some convergence problems (not show) for the GLM, however only at low abundances, so source of error can be neglected.


  1. Note the data was generated from a neg.bin. model – so I would have expected that the GLM performs best.
    However in this case a linear model on transformed abundances performs better. Same for quasi-poisson (F-Test). I suspect this is because of the F-test is doing better with small sample sizes – is this correct and why?

  2. The LR-Test does not perform well because of asymptotics. Are the possibilities for improvement?

  3. Are there other tests for GLMs which may perform better? How can I improve testing for GLMs?

  4. What type of models for count data with small sample sizes should be used?


Interestingly, the LR-Test for a binomial GLM does work pretty well:
enter image description here

Here i draw data from a binomial distribution, setup similar as above.

Red: gaussian model (LR-Test + arcsin transformation), Ocher: Binomial GLM (LR-Test), Green: gaussian model (F-Test + arcsin transformation), Blue: Quasibinonial GLM (F-test), Purple: Non-parametric.

Here only the gaussian model (LR-Test + arcsin transformation) shows an increase Type I error, whereas the GLM (LR-Test) does pretty well in terms of Type I error. So there seems to be also a difference between distributions (or maybe glm vs. glm.nb?).

Interaction in generalized linear model

I have 2 continuous variables as my predictors and the interaction between them, so 3 effects all together (when I center my predictors only the interaction is significant). I am using a binary logit model except where I have fixed the value of the number of trials at 20, for my dependent variable. My problem has to do with further understanding the nature of the interaction. I know that in generalized linear models the interaction is more complex compared to the linear model because of the link function. I have read that interpreting the sign of the parameter estimates is very limited, so I am trying to find a way to further understand the interaction, and graphing it would be nice as well. Unfortunately, I only have SPSS at my disposal. Jeff Gill (http://www.artsci.wustl.edu/~jgill/papers/interactions3.pdf) wrote about a method called first differences which seems very similar to the inteff command in stata. If anybody could tell me how to how to further understand and test the interaction in a generalized linear model using spss or hand calculations I would be eternally grateful.

Increased Type I error – GLM

Some of you might have read this nice paper:

O’Hara RB, Kotze DJ (2010) Do not log-transform count data. Methods in Ecology and Evolution 1:118–122. klick.

In my field of research (ecotoxicology) we’re dealing with poorly replicated experiments and GLMs are not widely used. So I did a similar simulation as O’Hara & Kotze (2010), but mimicked ecotoxicological data.

Power simulations:

I simulated data from a factorial design with one control group ($mu_c$) and 5 treatment groups ($mu_{1-5}$). Abundance in treatment 1 was identical to the control ($mu_1 = mu_c$), abundances in treatments 2-5 was half of the abundance in the control ($mu_{2-5} = 0.5 mu_c$).
For the simulations I varied the sample size (3,6,9,12) and the abundance in the control group (2, 4 ,8, … , 1024).
Abundances were drawn from a negative binomial distributions with fixed dispersion parameter ($theta = 3.91$).
100 datasets were generated and analysed using a negative binomial GLM and a gaussian GLM + log-transformed data.

The results are as expected: the GLM has greater power, especially when not many animals were sampled.
enter image description here
Code is here.

Type I Error:

Next I looked at type one error.
Simulations were done as above, however all groups had same abundance ($mu_c = mu_{1-5}$).

However, the results are not as expected:
enter image description here
Negative binomial GLM showed a greater Type-I error compared to LM + transformation. As expected the difference vanished with increasing sample size.
Code is here.


Why is there an increased Type-I Error compared to lm+transformation?

If we have poor data (small sample size, low abundance (many zeros)), should we then use lm+transformation? Small sample sizes (2-4 per treatment) are typical for such experiments and cannot be increased easily.

Although, the neg. bin. GLM can be justified as being appropriate for this data, lm + transformation may prevent us from type 1 errors.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam