Modelling count data: mean-variance relationship

I have fit a poisson, quasi poisson, and negative binomial model to some count data.

To ensure that I have valid models, I am checking that the following assumptions are satisfied:

  1. No multicollineairty
  2. Independent errors
  3. Mean-variance relationship
  4. Linear on link scale

To visually check assumption 3. is satisfied for each model, I created groups based on the linear predictor, computed the mean and variance for each group (left plot), and plotted the mean-variance relationship (right plot red line).

First plot = poisson model

Second plot = quasi poisson model

Third plot = negative binomial model

Poisson model

Quasi poisson model

Negative binomial model

I am having a hard time telling which of the three models best satisfies assumption 3. from these plots.

I would appreciate any advice on how to choose the model out of these three which is the closest to satisfying assumption 3.

As a final note, there is evidence of overdispersion ($hat{phi}=85.80$) and when I fit zero-inflated models Vuong tests find the standard poisson and negative binomial to be superior.

transform of predictor variables

Assume I have a linear model like this:

y_i = beta_1 x_{i1} + cdots + beta_{ip} x_{ip} + varepsilon_i hspace{1cm} i = 1,dots,n

I know that if $y_i$ must be greater than zero, I should fit against $log(y_i)$ rather than $y_i$. If $y_i$ must be between $[0,1]$ (like the case in logistic regression), I should use a logistic link function.

On the other hand, from my experience, I found that if $x_{ij}$ must be greater than zero. Usually I will get a better fit if I first log-transform it, i.e. I use $log(x_{ij})$ rather than $x_{ij}$.

Are there any theory for this experience?

Moreover, Let’s say:

(1) If I know $x_{ij}$ is an indicator variable (must be either $0$ or $1$). Shall I use an inverse of logistic function to transform it first?

(2) How about if I know $x_{ij}$ is on the interval $[a,b]$?

Probabilistic forcasting

My question relates to probabilistic forecasting. How does one actually go about computing a forecast?

Lets say I have some data that can be modelled by a specific distribution, and the values of the coefficients etc. are known. Therefore, I know the mean of this distribution gives me the expected value, but is this value the forecast? How would the forecast progress?

For example, with ARMA forecasts, the forecasted values are computed using a recursive formula which is updated at each time step with the new value. I don’t see how this applies to a probabilistic forecast.

Is it perhaps random (the term might be stochastic)? As in, for each time you want to forecast, you randomly generate a value from the distribution?


What's the difference between error distribution and residual distribution in generalized linear models?

I have met with generalized linear model, but I’m confused with the errors and residuals? Can anyone help me out? I have got three questions.
(1)what’s the difference between error and residual?
(2)what does error distribution refers to in generalized linear models? The distribution of the response variable? Or anything else?
(3)Assuming normally distributed residuals is always needed whether in linear model or generalized linear model?

What is the best way to test if an IV has a stronger effect on the DV than another IV that a subset of the first IV?

I wish to investigate the effect of general international experience of a corporation measured as the number of foreign subsidiaries it own vs. the effect of local experience in a region measured as the total number of subsidiaries in the region on a choice related to the region expressed as a binary variable.

I am applying logistic regression, and I suspect that there is a problem of entering both variables at the same time since they are dependent on each other most likely and may be highly correlated (but maybe not, the data is not ready yet). In such case would it be sensible to use a step wise approach and entering general experience first, and then adding local experience (basically a subset) to see if captures the some of the variance of the first and improves predictive power in general?

If it given the related nature of the variables is a no go to apply the above approach, will a better approach then be to compare two identical models where in the first the general experience variable is added and in the other the local experience variable, and then comparing the size effect (of the estimator and odds ratio) along with the general model fit of both models?

I know the diagnosing variables and assumptions in relation to these is generally very basic stuff, but I can’t seem to reach a conclusion whereas to adding both variables is valid or not. I appreciate all help I can get.

mixed models/glm – What is the best way to evaulate if a variable in time can predict outcome?

I am interested in how the slopes of a variable measured at different time points (e.g. 0,1,2,3,4 months) can predict another outcome variable measured at 4 months.

I have done this by creating creating a longitudinal model using the measured variable (Y) as dependent variable and as independent variable, time and outcome with their interactions. e.g. Y ~ time + time^2 + outcome + time*outcome + time^2*outcome; using glm or linear mixed model.

Is this the correct method, because normally you would put the variable of interest (in this case outcome) as the dependent variable, right? Is it also correct to adjust for Variables measured at time=0 this way? are there any other ways, maybe by extracting the random effects?


Poisson model with fractions

I have a simple website with a home page that has 5 different images on it. All images have a fixed set of ‘features’ associated with them (size, color, position etc.,). When a visitor comes to the page, he can either click on one of the images or simply leave.

For a given time interval, since both the number of views for the page as well as the number of clicks on an image are ‘counts’, they can be modeled as Poisson RVs. I want to build a regression model for the click-through-rate (CTR) ie., # clicks / # views (where # views is common to all 5 images and # clicks is obviously different).

I’m using the generalized linear model (glm) in R for this:

model <- glm(log(numClicks) ~ (features) + offset(log(numViews)), family=poisson, data=mydata)

Is this the right approach? I don’t know if it’s correct to take the ratio of two Poissons in combination with the canonical log link function and glm.

Is there a better way to model CTR?

Generalized Linear Models vs Timseries models for forecasting

What are the differences in using Generalized Linear Models, such as Automatic Relevance Determination (ARD) and Ridge regression, versus Time series models like Box-Jenkins (ARIMA) or Exponential smoothing for forecasting? Are there any rules of thumb on when to use GLM and when to use Time Series?

Modeling continuous abundance data with a GLM in R: how to select the correct distribution family?

I have abundance data (counts) that I have standardized by area sampled, making them continuous. I would like to explain them with my two independent variables using a GLM but I am having trouble specifying a model distribution. The data are derived from raw counts of salamanders at 40 sites standardized by area of pond sampled. The raw counts are as follows:

0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  2  2  2  2  3  3  3  4  5  8 10 11 12 21

And after standardizing the counts by area sampled:

0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.1754590  0.4491828  0.8517423  1.4341965  2.2777698  4.0467065  0.3889454  0.5935273  1.9376223  2.0924642  0.5110034  0.5418544  6.9358962  1.5491324  2.2689315 14.2278592  1.4483645  0.3947695  6.2244910  8.7240609

It seems that poisson and negative binomial are now out of the question because my data are not integers. My data has zeros so I don’t think I can use gamma distribution without transforming it.

I used the fitdist function in R (package fitdistrplus) to generate parameters for continuous distributions (exponential and normal). I then randomly sampled true exponential and normal distributions with the generated parameters [rexp(n,rate) and rnorm(n,mean,sd)], respectively. Using a two-sided KS test [ks.test(x,y)], I compared my data with the generated data and the distributions were significantly different (p<<<0.05), ruling out my data being normal or exponential. I transformed the data with an ln+1 transformation (to keep zero values) and they still deviate significantly from the hypothesized distributions.

Because my data has no negative values, contains zeros, and is not right and/or left bound, I’m not sure what distribution to specify for the glm.

My questions are:
1) How can I determine the distribution of my response variable, and if I need to transform, can I use a transformation that turns my zeros into non-zeros or even negative values?
2) If the model family depends on the distribution of errors, how can I know their distribution without first removing the potentially explainable variation? It seems like I would need to explain some variation with my predictor variables, and then model the distribution of the remaining error. Is there a better approach to this?

Any suggestions are helpful, specifically those for R or excel. My knowledge of statistical theory is limited.


Fitting a glm to a zero inflated positive continuous response

I’m trying to fit R glm’s to data sets where the response is zero inflated positive continuous. This is an example data set

df <- structure(list(y=c(0,0,0,0,0,0,0,0,4.09417142857143e-05,0,0,0,0,2.18718787234043e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,4.60192985143918e-05,0,1.16500081901182e-05,0,1.13497040795196e-05,0,2.01959645093946e-05,0,0,2.24704801120448e-05,3.50195529889299e-05,2.26178388643899e-06,0,0,0,0,0,0,1.24748435005701e-05,0,0,0,1.14008923269261e-06,0,0,0,0,0,0,0,1.95445414576182e-05,0,0,0,0,0,0,0,0,1.72293987591653e-05,1.11122316638726e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),behavior = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4),time = c(272,148,132,854,234,340,90,282,840,230,118,620,130,1410,1272,684,6624,1220,250,4556,4500,2254,662,334,336,1364,346,4308,372,7448,1242,9658,926,6706,494,1996,3570,13550,62134,6876,3108,4722,794,2394,21502,21048,5442,2748,920,25596,1954,7896,996,12108,2046,23034,3216,7574,9882,5560,5610,4524,6846,3450,11630,3528,3546,9544,4820,4852,13508,3726,3744,1252,1254,2514,3786,2534,14080,3882,6510,10520,72928,10052)),.Names = c("y","behavior","time"),row.names = c(NA, 84),class = "data.frame")

set behavior to be a factor

df$behavior = as.factor(df$behavior)

The response is df$y, df$behavior and df$time are the fixed effects (categorical and continuous, respectively).

A glimpse at the distribution of the response:
enter image description here

I tried using the R tweedie package with glm2 (and base glm) with this code: <- tweedie.profile(y ~ behavior + time, do.plot=TRUE, data = df) <- glm2(y ~ behavior + time, family=tweedie(var.power=out$xi.max, link.power=1-out$xi.max), data = df)

which either does not converge (in case of other data sets) or shows poor qq-plot fit: enter image description here

Is there any hope to reliably fit a glm to such data? If so what am I doing wrong?

I also tried using the R gamlss package with a log normal family where I add .Machine$double.eps to df$y:

df$y <- df$y + .Machine$double.eps <- gamlss(y ~ behavior + time, data = df, family = LOGNO())

but it doesn’t seem to do much better:
enter image description here

Help would be greatly appreciated.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam