Post hoc test for GLM using quasi poisson on count data in R? glht?


I’m currently working with count data with lots of zeros (the number of invertebrates within 6 types of hedge plots). I decided to use glm and quasipoisson which showed there is significant difference between the plots. I would now like to know which hedges plot are different to which…

Can I use the below glht function to do this ?

summary(glht(model,mcp(Plot="Tukey"))

Is there a simple model that assumes a negative relationship between the mean and variance of response?


Poisson and Negative Binomial models assume that the variance of response is greater or equal to the mean of response. Is there a simple model where that assumption is reversed, i.e. variance goes down as mean goes up.

Finding Bias$(s^2)$ in incorrect linear model


I am unable to find the bias of the sample variance estimator in this problem. Unfortunately I keep coming up with

Question: Suppose that the true linear model is

$y = X_1beta_1 + X_2beta_2 + e$,

where $e sim MVN(0,sigma^2 I)$, $X_1: N times p_1, beta_1: p_1 times 1, X_2: N times p_2, beta_2: p_2 times 1$. But we wrongly fit the model $y = X_1beta_1+e$ and calculate the estimators of $beta_1$ and $sigma^2$ as:

$hat{beta}_1 = (X_1′X_1)^{-1}X_1′y$,

$s^2 = large frac{y’[I-X_1(X_1'X_1)^{-1}X_1']y}{N-p_1}$.

(a) Show that, in general, $hat{beta}_1$ is a biased estimator with

Bias$(hat{beta}_1) = (X_1′X_1)^{-1}X_1′X_2hat{beta}_2$

Under what condition on $X_1$ and $X_2$ is $hat{beta}_1$ unbiased?

(b) Show that, in general, $s^2$ is a biased estimator with

Bias$(s^2) = frac{1}{N-p_1}beta_2′X_2′[I-X_1(X_1'X_1)^{-1}X_1']X_2beta_2$

Is the bias positive or negative? Under what condition on $beta_2$ is $s^2$ unbiased?

Attempt:

For part (a):

Bias$(hat{beta}_1) = E((X_1′X_1)^{-1}X_1′y – beta_1)$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′E(y) – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′(X_1beta_1 + X_2beta_2) – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′X_1beta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2) – beta_1$

$hspace{15mm}= Ibeta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2 – beta_1$

$hspace{15mm}= beta_1 + (X_1′X_1)^{-1}X_1′X_2beta_2 – beta_1$

$hspace{15mm}= (X_1′X_1)^{-1}X_1′X_2beta_2$

If $X_1$ and $X_2$ are orthogonal, then $hat{beta}_1$ is unbiased.

For part (b), I am unable to figure out how to derive:

Bias$(s^2) = frac{1}{N-p_1}beta_2′X_2′[I-X_1(X_1'X_1)^{-1}X_1']X_2beta_2$

I can get

$frac{1}{N-p_1}beta_1′X_1′+beta_2′X_2′[I-X_1(X_1'X_1)^{-1}X_1']X_1beta_1+X_2beta_2$

after taking the expectation of y on both sides.

Any help would be much appreciated.

Low sample size: LR vs F – test


Some of you might have read this nice paper:

O’Hara RB, Kotze DJ (2010) Do not log-transform count data. Methods in Ecology and Evolution 1:118–122. klick.

Currently I am comparing negative binomial models with gaussian models on transformed data. Unlike O’Hara RB, Kotze DJ (2010) I’m looking at the special case of low sample sizes and in a hypothesis testing context.

A used simulations to investigate the differences between both.

Type I Error simulations

All compuations have been done in R.

I simulated data from a factorial design with one control group ($μ_c$) and 5 treatment groups ($μ_{1−5}$). Abundances were drawn from a negative binomial distributions with fixed dispersion parameter (θ=3.91). Abundances were equal in all treatments.

For the simulations I varied the sample size (3, 6, 9, 12) and the abundances (2, 4 ,8, … , 1024). 100 datasets were generated and analysed using a negative binomial GLM (MASS:::glm.nb()), a quasipoisson GLM (glm(..., family = 'quasipoisson') and a gaussian GLM + log-transformed data (lm(...)).

I compared the models with the null model using a Likelihood-Ratio test (lmtest:::lrtest()) (gaussian GLM and neg. bin GLM) as well as F-tests (gaussian GLM and quasipoisson GLM)(anova(...test = 'F')).

If needed I can provide the R code, but see also here for a related question of mine.

Results
enter image description here

For small sample sizes, the LR-tests (green – neg.bin.; red – gaussian) lead to an increased Type-I error. The F-test (blue – gaussian, purple – quasi-poisson) seem to work even for small sample sizes.

LR tests give similar (increased) Type I errors for both LM and GLM.

Interestingly the quasi-poisson works pretty well (but also with an F-Test).

As expected, if sample size increases LR-Test performs also well (asymptotically correct).

For the small sample size there have been some convergence problems (not show) for the GLM, however only at low abundances, so source of error can be neglected.

Questions

  1. Note the data was generated from a neg.bin. model – so I would have expected that the GLM performs best.
    However in this case a linear model on transformed abundances performs better. Same for quasi-poisson (F-Test). I suspect this is because of the F-test is doing better with small sample sizes – is this correct and why?

  2. The LR-Test does not perform well because of asymptotics. Are the possibilities for improvement?

  3. Are there other tests for GLMs which may perform better? How can I improve testing for GLMs?

  4. What type of models for count data with small sample sizes should be used?

Edit:

Interestingly, the LR-Test for a binomial GLM does work pretty well:
enter image description here

Here i draw data from a binomial distribution, setup similar as above.

Red: gaussian model (LR-Test + arcsin transformation), Ocher: Binomial GLM (LR-Test), Green: gaussian model (F-Test + arcsin transformation), Blue: Quasibinonial GLM (F-test), Purple: Non-parametric.

Here only the gaussian model (LR-Test + arcsin transformation) shows an increase Type I error, whereas the GLM (LR-Test) does pretty well in terms of Type I error. So there seems to be also a difference between distributions (or maybe glm vs. glm.nb?).

Interaction in generalized linear model


I have 2 continuous variables as my predictors and the interaction between them, so 3 effects all together (when I center my predictors only the interaction is significant). I am using a binary logit model except where I have fixed the value of the number of trials at 20, for my dependent variable. My problem has to do with further understanding the nature of the interaction. I know that in generalized linear models the interaction is more complex compared to the linear model because of the link function. I have read that interpreting the sign of the parameter estimates is very limited, so I am trying to find a way to further understand the interaction, and graphing it would be nice as well. Unfortunately, I only have SPSS at my disposal. Jeff Gill (http://www.artsci.wustl.edu/~jgill/papers/interactions3.pdf) wrote about a method called first differences which seems very similar to the inteff command in stata. If anybody could tell me how to how to further understand and test the interaction in a generalized linear model using spss or hand calculations I would be eternally grateful.

Increased Type I error – GLM


Some of you might have read this nice paper:

O’Hara RB, Kotze DJ (2010) Do not log-transform count data. Methods in Ecology and Evolution 1:118–122. klick.

In my field of research (ecotoxicology) we’re dealing with poorly replicated experiments and GLMs are not widely used. So I did a similar simulation as O’Hara & Kotze (2010), but mimicked ecotoxicological data.

Power simulations:

I simulated data from a factorial design with one control group ($mu_c$) and 5 treatment groups ($mu_{1-5}$). Abundance in treatment 1 was identical to the control ($mu_1 = mu_c$), abundances in treatments 2-5 was half of the abundance in the control ($mu_{2-5} = 0.5 mu_c$).
For the simulations I varied the sample size (3,6,9,12) and the abundance in the control group (2, 4 ,8, … , 1024).
Abundances were drawn from a negative binomial distributions with fixed dispersion parameter ($theta = 3.91$).
100 datasets were generated and analysed using a negative binomial GLM and a gaussian GLM + log-transformed data.

The results are as expected: the GLM has greater power, especially when not many animals were sampled.
enter image description here
Code is here.

Type I Error:

Next I looked at type one error.
Simulations were done as above, however all groups had same abundance ($mu_c = mu_{1-5}$).

However, the results are not as expected:
enter image description here
Negative binomial GLM showed a greater Type-I error compared to LM + transformation. As expected the difference vanished with increasing sample size.
Code is here.

Question:

Why is there an increased Type-I Error compared to lm+transformation?

If we have poor data (small sample size, low abundance (many zeros)), should we then use lm+transformation? Small sample sizes (2-4 per treatment) are typical for such experiments and cannot be increased easily.

Although, the neg. bin. GLM can be justified as being appropriate for this data, lm + transformation may prevent us from type 1 errors.

How to apply transformations to the predictors of a GLM?


This post discusses why we need to transform $Y$ before estimating the predictors exponents in order to reduce the problem to a linear fit. The example builds on $Y$ log-normal. In the case of a GLM, we can run a BoxCox likelihood estimate on the predictors against $log(Y)$, then use a logit function in the GLM.

How can we estimate the predictor transformation when $Y$ is not log-normal, when using GLM? For example in the case of a Poisson $Y$ distribution with $X$ normal, how do we linearize $Y ~ X$?

How to structure data for SPSS (percentages for log-linear analysis)?


I am a bit desperate because I am writing my Msc thesis and I am not sure how to organize my data for an SPSS usage.

My research examines the performance of mobile banner campaigns and how the 3 independent variables: Source(in-app or mobile web promotion), Style (animated or static banner) and Size (full screen or small banner) will affect the dependent variables: Click-Through Rate((total clicks/total impressions)*100) and Conversion Rate((total installs/total clicks)*100).

Initially I was considering ANOVA tests like the ones below:

Click-Through Rate = β0 + β1*Source + β2*Style + β3*Size + β4*(Source*Style) +
                     β5*(Source*Size) + β6*(Style*Size) + β7*(Source*Style*Size)  + error 

Conversion Rate = β0 + β1*Source + β2*Style + β3*Size + β4*(Source*Style) + 
                  β5*(Source*Size) + β6*(Style*Size) + β7*(Source*Style*Size) + error

But my supervisor advises that I can not use percentages withing ANOVA and I should use loglinear analyses.

Can you please advice me if I can somehow still use ANOVA for my tests or if I need to use loglinear analyses and how to structure my data for SPSS use. My supervisor is mentioning I need to organize it into counts, but doesn’t give me more details and I am a bit confused: what does that mean?

Standard errors for the CV error curve using the boot package


Does anybody know how to obtain the standard errors for the CV error curve using the boot package? I understand the boot package can compute the K-fold CV for a fitted model, but I’d like to know if the same package give the corresponding standard error.

Thanks!

code for ordered probit model


I have a data set with 7 predictor variables and one dependent variable. The dependent variable has 4 categories so it’s not binomial. I need to fit a probit model. I need codes for probit model in R, SPSS or Matlab.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam