Modeling continuous abundance data with a GLM in R: how to select the correct distribution family?

I have abundance data (counts) that I have standardized by area sampled, making them continuous. I would like to explain them with my two independent variables using a GLM but I am having trouble specifying a model distribution. The data are derived from raw counts of salamanders at 40 sites standardized by area of pond sampled. The raw counts are as follows:

0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  2  2  2  2  3  3  3  4  5  8 10 11 12 21

And after standardizing the counts by area sampled:

0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.1754590  0.4491828  0.8517423  1.4341965  2.2777698  4.0467065  0.3889454  0.5935273  1.9376223  2.0924642  0.5110034  0.5418544  6.9358962  1.5491324  2.2689315 14.2278592  1.4483645  0.3947695  6.2244910  8.7240609

It seems that poisson and negative binomial are now out of the question because my data are not integers. My data has zeros so I don’t think I can use gamma distribution without transforming it.

I used the fitdist function in R (package fitdistrplus) to generate parameters for continuous distributions (exponential and normal). I then randomly sampled true exponential and normal distributions with the generated parameters [rexp(n,rate) and rnorm(n,mean,sd)], respectively. Using a two-sided KS test [ks.test(x,y)], I compared my data with the generated data and the distributions were significantly different (p<<<0.05), ruling out my data being normal or exponential. I transformed the data with an ln+1 transformation (to keep zero values) and they still deviate significantly from the hypothesized distributions.

Because my data has no negative values, contains zeros, and is not right and/or left bound, I’m not sure what distribution to specify for the glm.

My questions are:
1) How can I determine the distribution of my response variable, and if I need to transform, can I use a transformation that turns my zeros into non-zeros or even negative values?
2) If the model family depends on the distribution of errors, how can I know their distribution without first removing the potentially explainable variation? It seems like I would need to explain some variation with my predictor variables, and then model the distribution of the remaining error. Is there a better approach to this?

Any suggestions are helpful, specifically those for R or excel. My knowledge of statistical theory is limited.


Fitting a glm to a zero inflated positive continuous response

I’m trying to fit R glm’s to data sets where the response is zero inflated positive continuous. This is an example data set

df <- structure(list(y=c(0,0,0,0,0,0,0,0,4.09417142857143e-05,0,0,0,0,2.18718787234043e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,4.60192985143918e-05,0,1.16500081901182e-05,0,1.13497040795196e-05,0,2.01959645093946e-05,0,0,2.24704801120448e-05,3.50195529889299e-05,2.26178388643899e-06,0,0,0,0,0,0,1.24748435005701e-05,0,0,0,1.14008923269261e-06,0,0,0,0,0,0,0,1.95445414576182e-05,0,0,0,0,0,0,0,0,1.72293987591653e-05,1.11122316638726e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),behavior = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4),time = c(272,148,132,854,234,340,90,282,840,230,118,620,130,1410,1272,684,6624,1220,250,4556,4500,2254,662,334,336,1364,346,4308,372,7448,1242,9658,926,6706,494,1996,3570,13550,62134,6876,3108,4722,794,2394,21502,21048,5442,2748,920,25596,1954,7896,996,12108,2046,23034,3216,7574,9882,5560,5610,4524,6846,3450,11630,3528,3546,9544,4820,4852,13508,3726,3744,1252,1254,2514,3786,2534,14080,3882,6510,10520,72928,10052)),.Names = c("y","behavior","time"),row.names = c(NA, 84),class = "data.frame")

set behavior to be a factor

df$behavior = as.factor(df$behavior)

The response is df$y, df$behavior and df$time are the fixed effects (categorical and continuous, respectively).

A glimpse at the distribution of the response:
enter image description here

I tried using the R tweedie package with glm2 (and base glm) with this code: <- tweedie.profile(y ~ behavior + time, do.plot=TRUE, data = df) <- glm2(y ~ behavior + time, family=tweedie(var.power=out$xi.max, link.power=1-out$xi.max), data = df)

which either does not converge (in case of other data sets) or shows poor qq-plot fit: enter image description here

Is there any hope to reliably fit a glm to such data? If so what am I doing wrong?

I also tried using the R gamlss package with a log normal family where I add .Machine$double.eps to df$y:

df$y <- df$y + .Machine$double.eps <- gamlss(y ~ behavior + time, data = df, family = LOGNO())

but it doesn’t seem to do much better:
enter image description here

Help would be greatly appreciated.

Extracting random variable coefficients from model averaged objects

I’d like to extract the coefficients of the random effects from a model averaged object of class averaging created in the package MuMIn.

I can’t seem to find a function comparable to the ranef function used to extract the random variable coefficients in the coxph model classes.

Is it possible to extract the random variable coefficients from a model of class averaging?

Any help would be greatly appreciated, Cheers.

vglm: Error in vglm.fitter, due to matrix dimension?

I’ve looked to other Error in vglm.fitter-related posts but they don’t seem (or I cannot) related to this one.

This the error I get when running vglm for multinomial regression (classification):

R: fit.tracks.vglm = vglm(formula = class ~ ., data =[c(-1)], family = multinomial())
Error in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2,  : 
  17115 parameters but only 14697 observations

My dimension is 4899 x 5706 including 5704 “X..” variables, see below:

  col1      class X2 X6 X7 X17 X18 X25 X33 X37
1 220351 class1  0  0  0   0   1   0   0   0
2  30981 class1  0  0  0   0   0   0   0   0
3 154632 class1  0  0  0   0   0   0   0   0

The matrix is very sparse (0.3% of non zero elements, but each variable has at least 10 non zero elements). The class column has 4 categories, thus I understand that 5705*3 = 17115 and 4899*3 = 14697.

Is it simply because there are too much variables ? is there a threshold ?, ratio rule of the thumb e.g. nrows/nvars, to follow ? …

Several hours later after post: I’ve run a successful test with 2000 variables … seems to be a max limit on variables number issue …?

Thanks in advance

Is my nested random-effect model non-hierarchical?

I have a problem with model structure because of the way factors are nested in a potentially non-hierarchical way. I’m not sure if I fully understand the issue but I can’t find a way to specify the model which Minitab or R will process without error.

The design is fairly simple. I have one response variable Reject which is measured under two levels of a within-subjects factor Equal and under two levels of a between-subject factor Cost. I also want to include Gender as a factor. I account for the within subject design by including Individual as a random factor.

I think the problem is that Individual is nested under both Gender (because each individual has only one gender) and Cost (because that is a between-subject factor), but neither of Cost or Gender is nested in the other. So that perhaps makes any valid model non-hierarchical (that is in any case Minitab’s complaint).

Has anyone got any tips for how to construct a valid model testing the significance of Equal, Cost, and Gender, implementable in R or Minitab?

confidence interval on expected number of cases in a set given a model

I have a set of units, where unit $i$ has:

  • a binary outcome $y_i$ (“case” or “non-case”)
  • a binary vector of treatments $mathbf{t_i}$, where $t_{i,j}=1$ if unit $i$ received treatment $j$.
  • several hundred confounding covariates $mathbf{x_i}$, some real-valued and some binary.

I expect that most treatments have no association with the outcome. I am trying to spot the outlier treatments that do have an effect, and estimate how many cases that effect was responsible for. For each treatment, I’d like to say something like:

Supposing there were no association between this treatment and the outcome, here is the number of cases (and a confidence interval) we would have expected to see for this treatment, predicted based on the covariates for the set of units that received that treatment.

Is there a standard way for doing this?

I fit a logistic regression model $m(mathbf{x})$ that gives $E[Y]$ for a given $mathbf{x_i}$. That is, it predicts the outcome given the covariates.

Then I can look at:

$$c_j=text{actual number cases with treatment $j$} = sum_i{t_{i,j} y_i}$$
$$h_j=text{predicted number of cases with treatment $j$} = sum_i{t_{i,j} m(mathbf{x_i})}$$

and compare the two.

These quantities are easily calculable, but I’m unsure how to test significance of the difference (call it $q_k$, where $k$ is the number of units with the treatment), since I don’t know what distribution $Q_k$ is drawn from under the null hypothesis.

I want $Q_k$ to be the distribution one would get by doing the following: sample $k$ many $(mathbf{x_i}, y_i)$ pairs according to the process that generated my data (or uniformly from the data itself), and compute the difference between the predicted and actual number of cases. Intuitively, this distribution should get tighter as $k$ increases and also as the model gets more accurate. That is, I don’t want to just assume that my model is accurate (and model $Q_k$ as something like a poisson binomial), but rather I want the distribution of $Q$ to take into account how well my model fits the data.

Any pointers or ideas would be very appreciated!


Thanks @Bill for the link! My understanding of that standardization approach is that their confidence interval takes into account the variance of the model (i.e. the uncertainty in the logistic regression coefficients), but not the performance of the model. They suggest separately testing the goodness of fit of the model. I am hoping to find something that takes directly into account the performance of the model in the confidence interval. Does that make sense?


By “performance”, I mean how well my model fits the data (perhaps something involving the AUC or the $R^2$).

Here’s the sort of thing I’m imagining.

Let $Y=y(x)$ be my outcome and $V=m(x)$ be the logistic regression model’s predicted probability for a data point $x$ drawn at random. Suppose I draw $k$ points, so I have $Y_1, … Y_k$ and $V_1, … V_k$. Define:

$$ M = sum_{i=0}^k V_i = text{sum of model estimates} $$
$$ Z = sum_{i=0}^k Y_i = text{sum of actual outcomes} $$
$$ Q = M – Z $$

I want to ideally know the distribution of $Q$, or at least know its first two moments. Trying to work those out:

$$ E[Q] = E[M] – E[Z] $$
$$ E[M] = k E[V] = kbar{v} $$
$$ E[Z] = kbar{y} $$

where $bar{v}$ and $bar{y}$ come from my data and are the average model prediction and the average outcome, respectively. (I would expect $E[Q]$ to usually be close to $0$.) For the variance:

$$ Var[Q] = E[Q^2] – E[Q]^2 $$
$$ E[Q^2] = E[(M-Z)^2] = E[M^2] – 2E[MZ] – E[Z^2] $$
$$ E[M^2] = E[(V_1 + ... + V_k)^2] = k E[V^2] = k bar{v^2} $$
$$ E[Z^2] = E[(Y_1 + ... + Y_k)^2] = k E[Y^2] = k E[Y] = k bar{y} $$
$$ E[MZ] = E[(V_1 + ... V_k)(Y_1 + ... + Y_k)] = k E[VY] = k (Pr[Y=1] E[VY vert Y=1] + Pr[Y=0] E[VY vert Y=0] = kPr[Y=1]E[V vert Y=1] = kbar{y} E[V vert Y=1] = 2kbar{y}bar{v}_{Y=1}$$

where in all of these, I’ve assumed independence between each of my $k$ draws. The $bar{v}_{Y=1} = E[V vert Y=1]$ is related to the accuracy of my model, and is the average model prediction across all the positive cases. Putting this together, if the algebra is right:

$$ E[Q] = kbar{v} – kbar{y} $$
$$ Var[Q] = k bar{v^2} – 2kbar{y}bar{v}_{Y=1} – k bar{y} – (kbar{v} – kbar{y})^2 $$

I’m interested if this makes sense, and if there’s a standard technique that does this (or if perhaps the technique @Bill mentioned is already doing this, but I’m just not understanding).

bayesglm (arm) versus MCMCpack

Both bayesglm() (in the arm R package) and various functions in the MCMCpack package are aimed at doing Bayesian estimation of generalized linear models, but I’m not sure they’re actually computing the same thing. The MCMCpack functions use Markov chain Monte Carlo to obtain a (dependent) sample from the joint posterior for the model parameters. bayesglm(), on the other hand, produces… I’m not sure what.

It looks like bayesglm() produces a point estimate, which would make it MAP (maximum a posteriori) estimation rather than a full Bayesian estimation, but there’s a sim() function that looks like it can be used to get posterior draws.

Can someone explain the difference in intended use for the two? Can bayesglm() + sim() produce true posterior draws, or is it some sort of approximation?

Bootstrap glm and extract pvalue

I am running a glm model using bootstrap, I can extract the coefficient mean and the confidence intervals for all the factors in my model.

But how can I get the pvalue from there?


rate.lm <- glm(one ~ two + factor(three), data = data)

lmfit <- function(data, indices) {
  fit <- glm(one ~ two + factor(three), data = data[indices, ])

results <- boot(data= data, statistic = lmfit, R = 1000, strata=data$two)

for (i in 2:length(rate.lm$coefficients)) {
      bci <-, type = "basic", index = i)
      print(sprintf("%s,%.4f,%.4f,%.4f", names(rate.lm$coefficients)[i], results$t0[i],
                    bci$basic[4], bci$basic[5]))

Which gives me back the coeff mean and the 95% confidence interval. In that case I know the variable two is significant but I need a pvalue absolutely for the rest of my analysis and I am unsure on how to get it. Thanks for your help!

[1] "two,0.0101,-0.0022,0.0229"
[1] "factor(three)2,-0.2029,-0.2185,-0.1888"
[1] "factor(three)3,-0.1752,-0.1914,-0.1595"
[1] "factor(three)4,-0.2093,-0.2259,-0.1947"
[1] "factor(three)5,-0.1745,-0.1910,-0.1588"
[1] "factor(three)6,-0.1988,-0.2139,-0.1849"

post hoc nested

Hy, need to do post hoc test for nested factor in SAS.
I have two factors, A and B with B nested within A and I want to perform post hoc test in GLM in SAS. Is it possible?

simulate GLM with square root link in R

I’m trying to simulate a fitted GLM using basic functions, not using the simulate() and predict() functions that are widely questioned and answered. I get different results when I compare my math function to the simulate() and predict() function. Most probably I’m doing something wrong but I cannot seem to find the error.

First I generate random correlated data with a skewed dependent variable:

n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)

Next I fit a GLM with a square root link to adjust for the skewed data:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))

Next I generate predicted values first on the linear scale and then on the inverse scale inclusing stochastic uncertainty (eventually I want to use other than the source data as input values):

p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

I compare the results to the simulate() and predict() function:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))
points(predict(m, type="response"), d[,1], col=rgb(1,0,0,alpha=0.1))
abline(a=0, b=1, col="red")

What is going wrong in my formula to predict values?
Where can I find what mathematical and R expressions are being used in the predict() and simulate() functions?
Is there a link explaining simulation of GLM including stochastic uncertainty (and eventually my next step is also parameter uncertainty) in various family/link combinations applied in R. I found one nice source on GLM simulation though not answering my specific questions:

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam