Logistic regression on binary response or ANOVA on proportions/percentages

I am analyzing survival of seedlings in Styrofoam blocks (known as styroblocks). These blocks have a certain size and contain cavities in which the seedlings are planted. All cavities within a block have the same dimensions (diameter and depth). Different blocks come with different cavity dimensions. I am analyzing whether there is an influence of cavity size on survival. I am using 5 different styroblocks with 5 different cavity volumes and plant seedlings in all cavities. For the analysis, however, I only consider 15 seedlings from the center of each block. Furthermore, I plant 3 different plant varieties and repeat each variety X styroblock combination 3 times. Lastly, this experiment was replicated at 2 different nurseries to test for nursery effects.

Here’s an example dataset:

xx <- data.frame(Nursery = rep(c("Nursery A", "Nursery B"), each = 675),
Styroblock = rep(c("Block A", "Block B", "Block C", "Block D", "Block E"), 6, each = 45),
Variety = rep(c("Variety A", "Variety B", "Variety C"), 2, each  = 225),
Replicate = rep(c(1,2,3), 30, each = 15), Survival = sample(c(1,0),1350, replace=T, prob=c(.85,.15)))

Here’s my first option:

fit = glm(Survival ~ Nursery * Variety * Styroblock, data =xx, family=binomial(link="logit"))
summary(fit) ## at this point some model simplification should be performed
Anova(fit, type="II", test="Wald")

HOWEVER, since cavities within styroblocks are spatially dependent, for my other measurements such as height (not shown here), I took an average across all 15 seedlings to get one measurement per styroblock (which avoids potential pseudo-replication). Given this, I summarized dead/alive (binary) by styroblock and clone, resulting in a proportion (or percentage) instead. This also reduces my sample size from 1350 to 90. See here:

xx.sum<-ddply(xx, .(Nursery, Styroblock, Variety, Replicate), summarise, Survival = sum(Survival)/15)

Question 1: Is this the correct way to do?

If yes, I am not sure if I can follow up with ANOVA on the proportions/percentages due to unequal variances:

ggplot(xx.sum, aes(x=Styroblock, y=Survival))+geom_boxplot()+facet_grid(Nursery~Variety)

Once option would be to transform, which I tried using arcsine and square root transformation but this does not help.

Question 2: How would you suggest proceeding in this case, i.e. which test is most appropriate to understand whether my predictors or predictor combinations influence survival?


Generalized gamma as a member of the exponential family

I want to show if this generalized gamma (GG) distribution is a member of the exponential family. I don’t know how to start since the exponential family has only 2 paramters and the GG has 3 parameters to be estimated. Below are definitions of the GG distr. and exponential family distr. for your perusal.

The density of the generalized gamma distribution with $y>0$ , shape parameter $kappa$, scale parameter $omega>0$ and location parameter $eta>0$.
$lambda=|kappa|^{-2}$ , $z=sign(kappa)frac{ln(y)-eta}{omega}$, and $u=lambda,exp(|kappa|,z)$.


A random variable $Y$ is a member of the exponential family if its probability density function can be written in the form
f(y;theta,phi)=expBigg[frac{y,theta-b(theta)}{a(phi)} + c(y,theta)Bigg]

Thank you all so much

interpreting poisson regression models with log transformation and factors/qualitative variable

I have a dataset with both quantitative ($x_1,x_2, text{and} x_3$) and qualitative variables ($x_4$ – 4 levels ~0,1,2,3). 3 variables ($x_1,x_2,x_3$) have been log transformed. I do not know how to interpret coefficients when its log transformed.

glm(formula = y ~ log(1 + x1) + log(1 + x2) +                   
      log(1 + x3) + factor(x4), family = "quasipoisson",data = data)

(Intercept)     0.20
log(1 + x1)     0.76
log(1 + x2)     -0.1
log(1 + x3)     0.25
factor(x4)1     0.4
factor(x4)2     0.45
factor(x4)3     0.57

Let’s suppose, if I want $x_4$ (for levels 0,1,2,3) to vary $x_1$ from 0,1,2,…,40 how would it effect my response considering everything being equal ? In addition, how to interpret $x_1,x_2, text{ and } x_3$ ?.

Numerical Example, I want to vary $x_3$ between 0,1,2,3,4,5,… and so on and determine its impact on y for 4 different levels in variable $x_4$:

Let’s suppose I want to predict for factor 0 which is when $x_4$ at 0 when $x_3 = 5$:

$$y = exp^{(0.20+0.25*5)}$$

Let’s suppose I want to predict for factor 2 which is when $x_4$ at level 1 when x3 = 5:

$$y = exp^{(0.25*5+0.45)}$$

is my interpretation correct ?

Can I use logistic regression if the distribution of proportions is skewed & lies in the middle of the [0,1] interval?

I am conducting a logistic regression in order to predict the service point win percentage of a tennis player.

In terms of data – I have (for each player A) approx 300 matches – for each match I have the total number of player A service points (points where he is the server), total number of player A service point wins and total number of player A service point losses.

To do so, I have service point win percentage as the DV, and my independent variables are:

+Average service win percentage of last 3 matches
+ln(player’s ranking points)
+ln(opposition’s ranking points)
+surface the match was played on

My dependent variable data, service win percentage, lies usually in the range of 0.4-0.8, there are pretty much no values greater that 0.8 (about 2.8% of values and this drops to < 1% at around 0.84) and there exists no values less than 0.22. In addition my data is much more concentrated above 0.5 than it is below 0.5.

Thus, I worry that since my data doesn’t have points close to zero or 1, and is not symmetrical around 0.5 (like the sigmoidal curve of logistic regression) that I am wasting my time with this model type. The results it is giving for my preliminary model outlined above are, although not shocking, pretty volatile.

I am conducting this in R and using the weights command to allow me use a proportion in the DV, giving the total number of trials as the weights. I use ln(points) because ranking points are exponential in nature.

The goal is to predict / forecast the service point win percentage of the player based on the IV’s. Considering my data distribution, and my goal, does logistic regression make sense? If not is there any other type of model that makes more sense?

Poisson distribution vs Three-way table to Compare multiple incidences

I need some help understanding the difference between comparing incidences of rare events with a Poisson distribution and using a three-way contingency table. I’m sure I have some fundamental misunderstanding here.

I counted the number of cases of food poisoning in each cafeteria on campus by the class of the student. In this chart below I am showing the data as incidence: the number of food poisoning cases per 100 student years. Each student always stays at the same cafeteria.

enter image description here

I have also displayed this data as a three-way contingency table (truncated here to save space): enter image description here

I want to determine if the distribution of food poisoning is different across different classes in each Cafeteria. What is the best way to do this??

From what I understand, I do not want to do a Chi-Square test on the incidence rates because 1) chi-square tests are for event counts, 2) some cells have less than 5 and a few have 0. I know I can’t just do an analysis on the counts of poisoning in each group and ignore the denominator since each class has a different number of people. I am aware of Fisher’s exact test, but I do not know if it is appropriate for incidences or cells with 0.

I have narrowed this down to using a GLM with a Poisson distribution on the incidence rates and log-linear analysis of the multi-way frequency table, but I am completely missing the significant difference between these two approaches.

Also, I know I might be completely wrong in my thinking here. Any help or guidance is much appreciated!!


How to get standard deviation from estimated marginal means (lsmeans)?

I am trying to calculate adjusted means and standard deviation across groups of people according to age.
Using SPSS, I use general linear model -> univariate, then calculate the means adjusting for other covariates. Usually I get estimated marginal means with standard error.

However, I need the standard deviation. I couldn’t find a way to do it using SPSS, but I searched that with R it’s possible, using lsmeans function.

Is that correct?
If so, how to gain the standard deviation?

Thank you

glht: difference between Tukey and manual matrix

I am testing for contrasts in contrasts in my dataset. Going in, I thought Tukey showed all the contrasts between the coefficients, but it seems that manually putting in a matrix yields different results. In addition, the Tukey findings seem to just take the regular coefficents from the glm (except for the last one), and the manual matrix is getting confused with adding and subtracting negative numbers… Can anyone explain what is happening here?

Here’s my glm model, glht contrast using Tukey and a manual matrix below.


        Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -3.0910     1.0225  -3.023   0.0025 **
swagtypeC    -0.4785     1.2488  -0.383   0.7016   
swagtypeD     0.8087     1.1251   0.719   0.4723  

glht tukey (note that “B” is the intercept):

summary(glht(gc.swag.odds, linfct = mcp(swagtype = "Tukey")))
Linear Hypotheses:
       Estimate Std. Error z value Pr(>|z|)
C - B == 0  -0.4785     1.2488  -0.383    0.921
D - B == 0   0.8087     1.1251   0.719    0.748
D - C == 0   1.2872     0.8570   1.502    0.284
(Adjusted p values reported -- single-step method) 

glht using my self- created matrix:

 K <- matrix(c(-1, 1, 0,
         -1, 0, 1,
         0, -1, 1),3)
 rownames(K) <- c("C - B", "D - B", "D - C")
 compare.levels <- glht(gc.swag.odds, linfct = K)

 Linear Hypotheses:
       Estimate Std. Error z value Pr(>|z|)    
  C - B == 0   3.5695     0.7170   4.978   <0.001 ***
  D - B == 0  -4.0820     2.0902  -1.953    0.110    
  D - C == 0   0.5125     2.2097   0.232    0.968  

GLM high standard errors, but variables are definitely not collinear

When I use a GLM using R, my standard errors are ridiculously high. It can’t be because the independent variables are related because they are all distinct ratings for an individual (i.e., interaction variables are out of the picture). Any idea on what is causing this?

Below is the contingency table and glm summary:

has.gc.swag  A  B  C  D
      FALSE  1 22 71 49
      TRUE   0  1  2  5

summary(glm(has.gc.swag~swagtype, family=binomial, data=tt.dataset))
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -15.57    1455.40  -0.011    0.991
swagtypeB      12.48    1455.40   0.009    0.993
swagtypeC      12.00    1455.40   0.008    0.993
swagtypeD      13.28    1455.40   0.009    0.993   

Note: I use swagtype instead of the real name since the info I am dealing with is confidential.

OLS vs. Poisson GLM with identity link

My question reveals my poor understanding of Poisson regression and GLMs in general. Here’s some fake data to illustrate my question:

### some fake data
y=c(0,  1,  2,  3,  1,  4,  9, 18, 23, 31, 20, 25, 37, 45)

Some custom functions to return psuedo-R2:

### functions of pseudo-R2

psuR2 <- function(null.dev, model.dev) { 1 - (model.dev / null.dev)}

predR2 <- function(actuals, predicted) { 1 - (sum((actuals - predicted)^2)) / sum((actuals - mean(actuals))^2)}

Fit four models: OLS, Gaussian GLM with identity link, Poisson GLM with log link, Poisson GLM with identity link

pred.ols = predict(mdl.ols)

predR2(y, pred.ols)

#### GLM MODEL, family=gaussian(link="identity")
mdl.guass <- glm(y~x, family=gaussian(link="identity"), maxit=500)
pred.guass = predict(mdl.guass)

psuR2(mdl.guass$null.deviance, mdl.guass$deviance)
predR2(y, pred.guass)

#### GLM MODEL, family=possion (canonical link)
mdl.poi_log <- glm(y~x, family=poisson(link="log"), maxit=500)
pred.poi_log= exp(predict(mdl.poi_log))  #transform

psuR2(mdl.poi_log$null.deviance, mdl.poi_log$deviance)
predR2(y, pred.poi_log)

#### GLM MODEL, family=poisson((link="identity")
mdl.poi_id <- glm(y~x, family=poisson(link="identity"), start=c(0.5,0.5), maxit=500)
pred.poi_id = predict(mdl.poi_id)

psuR2(mdl.poi_id$null.deviance, mdl.poi_id$deviance)
predR2(y, pred.poi_id)

Finally plot the predictions:

#### Plot the Fit
plot(x, y) 
lines(x, pred.ols)
lines(x, pred.guass, col="green")
lines(x,pred.poi_log, col="red")
lines(x,pred.poi_id, col="blue")

I have 2 questions:

  1. It appears that the coefficients and predictions coming out of OLS and Gaussian GLM with identity link are exactly the same. Is this always true?

  2. I’m very surprised that the OLS estimates and predictions are very different from the Poisson GLM with identity link. I thought both methods would try to estimate E(Y|X). What does the likelihood function look like when I use the identity link for Poisson?

How to fit a glm with sum to zero constraints in R (no reference level)

Question has been rewritten

I am trying to fit a glm to find out how the rate of events happens (counts/exposure) related to some covariates, with Poisson error.
Counts is the number of events happened and the covariates include age, sex, smoking status, policy class and exposure, where policy class has 6 levels: class 0, class 1, …, class 5

I was wondering how to:

  1. keeping the intercept (if possible? if not please would you explain me why not). [Answered by Affine]

  2. making the coefficients of polclass0, …, polclass5 sum to 0, I know somehow I need to use contr.sum, but it doesn’t work (see below) [An example from other site, but I couldn’t replicated it.

  3. I used the +0 in the formula to make the polclass0 “take over” the intercept. However, is it possible to have other categorial variable (e.g. sex, or some categorial variable that takes 3+ values) to have two all coefficients display?


    sex <- as.factor(sample(c(0,1), 50, replace=T, prob=c(0.5, 0.5)))
    smoker <- as.factor(sample(c(0,1), 50, replace=T, prob=c(0.3, 0.7)))
    polclass <- as.factor(sample(0:5, 50, replace=T))
    age <- sample(16:80, 50, replace=T)
    count <- rpois(50, 2.5)
    exposure <- rgamma(50, 100)

    glm1 <- glm(count ~ polclass + sex + smoker + age + offset(log(exposure)) + 0 ,
                family=poisson(link='log'), contrasts = list(polclass = contr.sum))


        glm(formula = count ~ polclass + sex + smoker + age + offset(log(exposure)) + 
            0, family = poisson(link = "log"), contrasts = list(polclass = contr.sum))

        Deviance Residuals: 
            Min       1Q   Median       3Q      Max  
        -2.3604  -0.5625  -0.1263   0.5246   1.5584  

                   Estimate Std. Error z value Pr(>|z|)    
        polclass0 -3.719037   0.423105  -8.790   <2e-16 ***
        polclass1 -3.347628   0.328248 -10.198   <2e-16 ***
        polclass2 -3.638992   0.368175  -9.884   <2e-16 ***
        polclass3 -3.833882   0.426819  -8.982   <2e-16 ***
        polclass4 -3.780837   0.399512  -9.464   <2e-16 ***
        polclass5 -3.838007   0.373120 -10.286   <2e-16 ***
        sex1      -0.057738   0.091936  -0.628    0.530    
        smoker1   -0.174328   0.118654  -1.469    0.142    
        age       -0.001846   0.006362  -0.290    0.772    
        Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        (Dispersion parameter for poisson family taken to be 1)

            Null deviance: 8764.667  on 50  degrees of freedom
        Residual deviance:   40.865  on 41  degrees of freedom
        AIC: 189

        Number of Fisher Scoring iterations: 5

as you can see, the contr.sum doesn’t work and I would like to know why and how to do that exactly?

Thanks in advance!

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam