glm with offset for rate data


I am attempting to fit a GLM to rate data. In my case it is the number of provisioning visits to a bird nest per hour. I model the data like in the example below including time as an offset.

number of visits ~ x , offset=log(time), family=poisson(link = log)

However this is complicated by the fact that my nests also have different numbers of chicks and I want to include this in my model. This suggests to me that I need to adjust my offset to account for both time and number of chicks. Am I thinking about this the correct way or do I need a different approach?

Help with complex model formula in lmer (lme4) for R


Most examples about lmer formula description in R target rather simple study designs. However, sometimes one is confronted with more complex designs and there is no documentation about a general rule to formula-construction. For instance, how would I formulate the following design:

  • 15 subjects (ID) treated as random fixed factor in 3 groups (Group) treated as fixed effect or random effect
  • 2 within variables (Position, Size)
  • 1 random factor primary eye (Eye)

I tried:

lmer(Y ~ Group*Position*Size + (1|ID) + (1|Eye) + (1|Position:ID) + (1|Size:ID))

However, with this formula I’m not sure:

  • whether Group should also appear in the definition of 1|Position and 1|Size as it is actually nested not only within id but id/Group
  • similar question as above for the random factor Eye
  • how I would go about defining Group as random between factor instead of fixed between factor as the group sampling happened randomly

Obtaining predicted values (Y=1 or 0) from a logistic regression model fit


Let’s say that I have an object of class glm (corresponding to a logistic regression model) and I’d like to turn the predicted probabilities given by predict.glm using the argument type="response" into binary responses, i.e. $Y=1$ or $Y=0$. What’s the quickest & most canonical way to do this in R?

While, again, I’m aware of predict.glm, I don’t know where exactly the cutoff value $P(Y_i=1|hat X_{i})$ lives — and I guess this is my main stumbling block here.

Binary logistic regression with interaction terms


I have been reading several CV posts on binary logistic regression but I am still confused for my current situation.

I am attempting to fit a binary logistic regression to a series of continuous and categorical variables in order to predict the mortality or the survival of animals (qual_status). Please see the str below:

> str(logit)
'data.frame':   136 obs. of  9 variables:
 $ id         : Factor w/ 135 levels "01001","01002",..: 26 27 28 29 30 31 32 33 34 35 ...
 $ gear       : Factor w/ 2 levels "j","sc": 2 1 1 2 1 2 1 2 2 1 ...
 $ depth      : num  146 163 179 190 194 172 172 175 240 214 ...
 $ length     : num  37 35 42 38 37 41 37 52 38 37 ...
 $ condition  : Factor w/ 4 levels "1","2","3","4": 1 1 4 1 4 2 2 1 2 1 ...
 $ in_water   : num  80 45 114 110 60 121 56 140 93 68 ...
 $ in_air     : num  60 136 128 136 165 118 220 90 177 240 ...
 $ delta_temp : num  8.5 8.4 8.3 8.5 8.5 8.6 8.6 8.7 8.7 8.7 ...
 $ qual_status: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 2 1 1 1 ...

I have no issues fitting an the following additive binary logistic regression with the glm function:

glm(qual_status ~ gear + depth + length + condition + in_water + in_air + delta_temp, data = logit, family = binomial)

…but I am also interested at how these predictor variables interact with one another and possibly influence survival. However, when I attempt the following interactive binary logistic regression:

glm(qual_status ~ gear * depth * length * condition * in_water * in_air * delta_temp, data = logit, family = binomial)

I receive a warning message "glm.fit: fitted probabilities numerically 0 or 1 occurred", along with missing coefficients due to singularities (NA or <2e-16 *) when I use summary:

Call:
glm(formula = qual_status ~ gear * depth * length * condition * 
    in_water * in_air * delta_temp, family = binomial, data = logit)

Deviance Residuals: 
  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [36]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [71]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[106]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Coefficients: (122 not defined because of singularities)
                                                            Estimate Std. Error    z value Pr(>|z|)    
(Intercept)                                                1.419e+30  5.400e+22   26274077   <2e-16 ***
gearsc                                                    -1.419e+30  5.400e+22  -26274077   <2e-16 ***
depth                                                      1.396e+28  4.040e+20   34539471   <2e-16 ***
length                                                     6.807e+28  1.836e+21   37079584   <2e-16 ***
condition2                                                -3.229e+30  8.559e+22  -37727993   <2e-16 ***
condition3                                                 1.747e+31  4.636e+23   37671986   <2e-16 ***
condition4                                                 9.007e+31  2.388e+24   37724167   <2e-16 ***
in_water                                                  -4.540e+28  1.263e+21  -35935748   <2e-16 ***
in_air                                                    -4.429e+28  1.182e+21  -37470809   <2e-16 ***
delta_temp                                                -1.778e+28  3.237e+21   -5492850   <2e-16 ***
gearsc:depth                                              -1.396e+28  4.040e+20  -34539471   <2e-16 ***
gearsc:length                                             -6.807e+28  1.836e+21  -37079584   <2e-16 ***
depth:length                                              -9.293e+26  2.450e+19  -37930778   <2e-16 ***
gearsc:condition2                                          1.348e+30  3.567e+22   37809001   <2e-16 ***
gearsc:condition3                                          2.816e+30  7.495e+22   37575317   <2e-16 ***
gearsc:condition4                                                 NA         NA         NA       NA    

Fitting only the continuous variables to a binary logistic regression doesn’t yield any warnings or singularities but the addition of the ordinal predictor variables causes issues. Along with avoiding these warnings, is there a function/package that can handle dummy variables (I believe that is what I am looking for) in logistic regressions in R?

Too many p-value less than 0.0001 is alarming?


Can anyone explain why having too many p-value less than 0.001 is alarming? Like what happens to my model right now:

enter image description here

enter image description here

All independent variables are categorical except for Calendar_Year. There are 682211 observations in my data set.

Tukey for GLM can't find data in model


I fit a GLM to a dataset. Now I want to see where the difference between my groups is, so I tried to run a Tukey HSD as a post-hoc test.

Because of it is a GLM, I can’t use TukeyHSD(). So I tried to run a Tukey test with:

summary(glht(my.mod, mcp(treatment="Tukey")))  

When I tried this, I get the following error:

Variable(s) ‘treatment’ have been specified in ‘linfct’ but cannot be found in ‘model’!

This is what it looks like in R:

> recol
   Sample At..Ind. At.Sp.   X treatment
1      C1        1      1   0         C
2      C2        2      2   0         C
3      C3        1      1   0         C
4     FS1        1      1  50        FS
5     FS2        4      3  80        FS
6     FS3        0      0  80        FS
7     FL1        3      3 280        FL
8     FL2        4      3 310        FL
9     FL3        2      2 220        FL
10    VS1        2      2  40        VS
11    VS2        0      0  50        VS
12    VS3        1      1   5        VS
13    VL1        3      3 400        VL
14    VL2        7      6 600        VL
15    VL3        3      3 500        VL
16    MS1        1      1  30        MS
17    MS2        3      3  30        MS
18    MS3        0      0  30        MS
19    ML1        2      1 400        ML
20    ML2        7      7 300        ML
21    ML3        7      7 300        ML
>  my.mod=glm(Arachnida~as.factor(treatment),data=soort)
> summary(my.mod)

Call:
glm(formula = Arachnida ~ as.factor(treatment), data = soort)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3333  -0.3333   0.0000   0.0000   2.6667  

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)   
(Intercept)             3.107e-16  6.901e-01   0.000  1.00000   
as.factor(treatment)FL -1.277e-16  9.759e-01   0.000  1.00000   
as.factor(treatment)FS -3.511e-16  9.759e-01   0.000  1.00000   
as.factor(treatment)ML  3.000e+00  9.759e-01   3.074  0.00825 **
as.factor(treatment)MS  3.333e-01  9.759e-01   0.342  0.73775   
as.factor(treatment)VL  2.333e+00  9.759e-01   2.391  0.03141 * 
as.factor(treatment)VS  3.333e-01  9.759e-01   0.342  0.73775   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1.428571)

    Null deviance: 48.571  on 20  degrees of freedom
Residual deviance: 20.000  on 14  degrees of freedom
AIC: 74.571

Number of Fisher Scoring iterations: 2

> glht(my.mod, mcp(treatment="Tukey"))
Error in mcp2matrix(model, linfct = linfct) : 
Variable(s) ‘treatment’ have been specified in ‘linfct’ but cannot be found in ‘model’ 
> summary(glht(my.mod, mcp(treatment="Tukey")))
Error in mcp2matrix(model, linfct = linfct) : 
Variable(s) ‘treatment’ have been specified in ‘linfct’ but cannot be found in ‘model’ 

Could anyone help me with my problem?

GLM, Regression analysis, or other


I need to resolve a question with the use of SPSS, but don’t really know what model to use. The question talks about the environment (scale variable (from not at all concerned – very concerned)), I need to compare it to different scale variables like gender (male/female), age group, education level (no formal education -…) and tell which groups are more/less concerned about the environment (without taking into account effects that are caused by other variables). How should I go about answering this question? I’m pretty sure that I would need to use the GLM model.

AR terms and independent variable as regressors


After trying several models with my data, R^2 and p values are showing my model looks like below. ACF plot tells me AR term is significant. Insights into data tells me change in ‘x’ would have impact(say y is A/C usage and x is temperature).

  y(t)=a+b*y(t-1)+c*{x(t)-x(t-1)}+error

Question: Is the above model statistically meaningful? If so, what is the category of this “linear model” called? I came across AR models, MA models and ARIMA/ARMA models, but havent seen a model where AR term and diff(independent variable) exists together.

glm or glmm model with unequal variance


I am applying a GLM model with binomial family:

glm(response ~ Treatment, family = binomial, data=dat)

The only explaratory variable treatment is a categorical variable. The model fit turns out that the residuals clearly do not have equal variances among the treatment levels (heteroscedasticy).
enter image description here

My question is 1) what I can do in this situation? A gam() is not meaningful for categorical variables. On the other hand, gls() is only applicable for linear model.

2) what if I have an extra random factor in the model (GLMM) and the residuals are still heterogeneous?

glmer(response~ Treatment + (1 | groupID), family = binomial, data=dat)

Is there any techniques or R packages can solve this?

This question is different from Regression modelling with unequal variance which asks similar situations with a linear model.

GLM analogue of weighted least squares


The short version:

I can fit a model using Weighted Least Squares, given a diagonal matrix of weights $W$, by solving $(X^TWX)hat{beta}=X^TWy$ for $hat{beta}$.

Is there a GLM analogue? if so, what is it?

There seems to be a GLM analogue, e.g. with the weights argument in R’s glm function. How is R using these weights?


The long version:

the situation

As a follow-up to my IPTW question, I just want to double check that I understand how to fit a parametric model using inverse probability(-of-treatment) weights (IPTW). The idea with IPTW is to simulate a dataset in which the relationship between my independent variables $(a^1,a^2,a^3)$ and dependent variable $y$ is unconfounded and therefore causal. For argument’s sake let’s say I already estimated an IPT weight $hat{w}_i$ for each observation. These weights are hypothetical probability weights from the simulated dataset.

the question

I now want to fit a GLM. I’d just use WLS, but I’m working with a binary outcome and an outcome truncated at zero. So I have a linear model $eta_i=a^Tbeta$, a link $mu_i=g(eta_i)$, and a variance $V(y_i)$ derived from my likelihood for $y$. Then the likelihood equations are
$$
sum_{i=1}^N frac{y_i-mu_i}{V(y_i)}frac{partialmu_i}{partialbeta_j}=sum_{i=1}^N frac{y_i-mu_i}{V(y_i)}left(frac{partialmu_i}{partialeta_i}x_{ij}right)=0,~forall j
$$ as per Categorical Data Analysis, Agresti, 2013, section 4.4.5.

So all I have to do is multiply $var(mu_i)$ by the weight $hat{w}_i$, right? The same way I might if I wanted to incorporate an overdispersion parameter? If so, is this because the variance of, say, 5 independent observations is 5 times the variance of one independent observation?

Follow-up idea: since the likelihood is the product of the likelihood for each observation, is there some weighting procedure I can use to just weight the likelihoods?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam