What GLM family and link function for “proportion of time”?

A simple question to which I don’t seem to find the answer anywhere.

I have a response variable duration of time spent doing A of individuals tested for $text{max duration}=X$. Therefore the variance is likely to be much wider at $x=0.5$ compared to $X=0$ and $X=1$.

However, I am confused what family and link function I need as time is continuous but can not be less than $0$ and more than $X$.

ps: I am using R so any code examples would be appreciated

I have heard about the tobit() function from “AER” which “is a convenience interface to ‘survreg’ setting different defaults and providing a more convenient interface for specification of the censoring information.

Therefore, it seems for my data I could run the model

tobit(duration.A ~ factor1*factor2, left=0, right=X)

However, it is unclear how A) I could do this for a mixed model i.e. a model with random factors, B) what it’s assumptions are (they are not clear from the R documentation http://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf), and C) how I could get residuals and fitted values for plotting.

What exactly does a Type III test do?

I’m having trouble understanding what exactly Type III test statistic does. Here is what I got from my book:

“Type III” tests test for the significance of each explanatory variable, under the assumption that all other variables entered in the model equation are present.

My questions are :

  1. What exactly does “other variable entered in the model equation are present” mean? Let’s say I have a Type III test statistics for variable $x_i$, does Type III test tells us whether the coefficient in front of $x_i$ equal to zero or not?

  2. If so, then what’s the difference between Type III test statistics and a Wald test? (I believe they are essentially two different things since SAS gives me two different numerical outputs) Currently I have both output for my independent variables (which are all dummy variables). I don’t know which p-value to look at to decided which $x$ to drop.

SAS syntax to find differences in regards to a control treatment

I am working with a data set of bacterial cell counts, using flow cytometry. I recorded the cell number in 3 different species of bacteria, all treated with 3 different compounds (L-aspartic acid, L-asparagine and D-sorbitol), at 2 time points. My data set looks like this:

enter image description here

The cell count is the total number (tot) and the intact cells (int). I want to know whether the total number of cells is not different among species or among treatments. I would also like to see that there are no differences among time points within species. Further, I have to repeat the same analysis for the intact cells. I have the following syntax in SAS:

proc mixed data=flowcyt
class count comp species;
model cells = comp|species / outp=resid1 ddfm=kr;
repeated / group=comp;
lsmeans comp / pdiff adjust=tukey cl;
ods output diffs=ppp lsmeans=mmm;

But with this I do not get the differences by cell type (either total cell number or intact cells number). Therefore, I wrote the following:

proc mixed;
class count tpt comp;
model cells= tpt|comp /ddfm=kr;
random count;
lsmeans tpt|comp/pdiff;
contrast 'LAS - Ctr' Comp -1 0 -1 0 2 0;
contrast 'LAA - Ctr' Comp 1 1 1 -1 -1 -1;
contrast 'Sor - Ctr' Comp 0 0 0 0 1 -1;

However, I do not get the differences among species.

Hence, I thought about using GLM, but if I write it like the above syntax, I will get the differences on each level and I have to separate the data set.

proc glm
class species;
model cells= comp;
means comp/snk duncan scheffe;

With the following I get a huge output that I am not sure is correct:

proc glm;
class species count tpt comp;
model cells= tpt|comp|species;
means tpt|comp|species/ snk scheffe duncan ;

Finally, I tried this:

proc mixed data= flowcyt covtest;
class species comp tpt;
model cells= comp|tpt/ddfm= kr;
repeated tpt/ subject=species type=cs;
lsmeans tpt|comp/ pdiff;

and again, I cannot get the differences in cell count among bacterial species.

I am at odds at this moment and I think I am a bit stuck. Does anyone know what would be the right syntax to get the differences in cell counts among the 3 bacterial species, within each time point and to determine that the cell counts are not significantly different from the control treatment?

Thanks for your time!

Two simple questions regarding GLM

I’m currently doing a modelling project. However, I haven’t taken a bunch of statistics classes, so I have to teach myself generalized linear models. I’m reading Generalized Linear Models for Insurance Data (Heller and de Jong, 2008, CUP), and I have two questions:

1. On page 64, it says:

Given a response $y$, the generalized linear model is $f(y)=c(y,phi)exp{frac{ytheta – a(theta)}{phi}}$. The equation for $f(y)$ specifies that the distribution of the response is in the exponential family.

Is that the equation for the distribution of $E[y_i|x_i]$ or some other thing? If it’s the distribution for $y$ corresponding to a fixed $x_i$, is it possible that even if the plot of $y$ against $x$ looks like a straight line, I should still use GLM instead of simple regression?

update: I guess I should clarify myself a little bit. Currently I have a dataset and my dependent variable is $y$. I made a histogram of $y$ (with frequency on y-axix) and it looks like a gamma curve fits well. Does that essentially imply that I should choose $f(y)$ to be gamma? I kinda doubt it because I suppose $y_i|_{X=x_i}$ and $Y$ are essentially two different things. I hope I’m not confusing you guys.

2. The book suggests that when assuming response $y$ follows a gamma distribution, it is a common practise to use a logarithmic link function. I don’t quite understand the reason behind that.

Any suggestion would be great. Thanks!

Zero inflated model problem: system is computationally singular

I’m using R.After getting an error asking me to provide starting values for a glm (poisson family), I took a look at my data and realized I had quite a bit of zeroes. So, I tried zeroinfl from pscl. I got the “computationally singular” error, so I tried dist=”negbin”. Same error. I looked at my data via with(bytype, table(Events, Type)) and with(bytype, table(Events, Generation)). It looks like my data is wonky. But I don’t know how to deal with that. I think I need to cut part of it out, but the easy attempts (group with most zeroes) didn’t work.

First, my basic model: Events ~ Generation*Type + offset(log(Population))

Second, data summaries:

by Generation:

Events Familicide Home.Intruder Rampage Sabotage School Vehicular Workplace
     0         49            88      59      103     91       111        94
     1         40            21      37       10     11         1        17
     2         18             4      11        0     10         1         2
     3          4             0       5        0      1         0         0
     4          2             0       1        0      0         0         0

by Event Type:

Events Missionary Lost  GI Silent Boomers   X
     0        139  103 110    103     106  34
     1         24   17  15     29      34  18
     2          4    5   1     13      17   6
     3          1    1   0      1       3   4
     4          0    0   0      1       1   1

Due to posting length limits, my entire data won’t fit. I hope this is enough to go on

Added on edit: It might matter that “Generation” is ordinal and I’ve set contrasts to c(“contr.sum”, “contr.poly”). I don’t know if that matters in this case.

Best Fit for Exponential Data

I’m trying to better understand some of the theory behind fitting models that have a nonlinear link between the response and the predictors.


#Create a random exponential sample

Now I will attempt 4 different models fits and explain what I expect as a result versus the actual result.


I expect the m1 model to be the worst fit. This is OLS with the log of the response taken before fitting the model. By taking the log of the response I have a model that, when exponentiated, cannot be negative. Therefore, the assumptions of normally distributed residuals with a constant variance cannot hold. The graph appears to under weight the tails of the distribution significantly.

Next, I will fit the model using a GLM.


I expect the m2 model to be of a slightly different fit than m1. This is due to the fact we are now modeling log(y+ϵ)=Xβ rather than modeling log(y)=Xβ+ϵ. I cannot justify if this fit should be better or worse from a theoretical standpoint. R gives an R-squared figure for lm() functions calls but not glm() (rather it gives and AIC score). Looking at the plot it appears that m2 is more poorly matched in the tails of the distribution than m1. The issues with normally distributed residuals should still be the same.

For m3, I relax the residual distribution assumptions by changing to a Gamma family residual distribution. Since an exponential distribution is a Gamma family distribution, I expect this model to very closely match the sampled data.


AIC suggests that m3 is a better fit than m2, but it still does not appear to be a very good fit. In fact, it appears to be a worse estimate in the tails of the distribution than m1.

The m4 model will use nls just for a different approach.

m4<-nls(y~exp(a+b*x),data=df1, start = list(a = 0, b = 0))

#for direct comparison

This appears to very closely match m2. I also expected this to very closely fit the data.

My questions are – how come does m3 not more closely match my data? Is there a way using glm() to more closely fit this model? I realize that if I am randomly sampling then the tails of my sample data will not fit my model exactly, but these appear to be nowhere close. Perhaps letting n approach infinity would all one (or more) of the models to converge, but rerunning the code with n=100,000 actually appears to be worse fitting models (perhaps that’s because with 100,000 random samples more outliers were selected and due to how the graph is present an undue amount of focus is given to these outliers?).

I also realize that an exponential distribution is not the same as exponentiating a normal distribution. That is to say, I realize that just because I took the log the of response doesn’t mean that I should get a “perfect model” as a result; however, in m3 when I am fitting the models to a Gamma family distribution, I expected to be able to get a very close fit.

Thanks in advance.

How do I choose between a simple and a mixed effect logistic regression?

I have a list of predictor variables to put in to a logistic regression model. How I know that should I do a simple logistic regression (using glm function in R) or a mixed effect logistic regression (using glmer function in R) analysis?

How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done?

The data-generating-process is: $y = text{sin}Big(x+I(d=0)Big) + text{sin}Big(x+4*I(d=1)Big) + I(d=0)z^2 + 3I(d=1)z^2 + mathbb{N}left(0,1right)$

Let $x,z$ be a sequence from $-4$ to $4$ of length $100$ and $d$ to be the corresponding factor $din{0,1}$. Take all possible combinations of $x,z,d$ to calculate $y$:
enter image description here

Using the (uncentered) B-spline-Basis for $x,z$ for each level of $d$ will not be feasible by the parition-of-unity-property (rows sum to 1). Such a model will not be identifiable (even without intercept).

Example: (Setting: 5 inner knot-intervals (uniformly distributed), B-Spline of degree 2, the spline-function is a custom one)

# drawing the sequence
n <- 100
x <- seq(-4,4,length.out=n)
z <- seq(-4,4,length.out=n)
d <- as.factor(0:1)
data <- CJ(x=x,z=z,d=d)

# setting up the model
data[,y := sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)]

# creating the uncentered B-Spline-Basis for x and z
X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)]
> head(X)
     x.1d0 x.2d0 x.3d0 x.4d0 x.5d0 x.6d0 x.7d0 x.1d1 x.2d1 x.3d1 x.4d1 x.5d1 x.6d1 x.7d1
[1,]   0.5   0.5     0     0     0     0     0   0.0   0.0     0     0     0     0     0
[2,]   0.0   0.0     0     0     0     0     0   0.5   0.5     0     0     0     0     0
[3,]   0.5   0.5     0     0     0     0     0   0.0   0.0     0     0     0     0     0

Z <- data[,spline(z,min(z),max(z),5,2,by=d)]
         z.1d0     z.2d0      z.3d0 z.4d0 z.5d0 z.6d0 z.7d0     z.1d1     z.2d1      z.3d1 z.4d1 z.5d1 z.6d1
[1,] 0.5000000 0.5000000 0.00000000     0     0     0     0 0.0000000 0.0000000 0.00000000     0     0     0
[2,] 0.0000000 0.0000000 0.00000000     0     0     0     0 0.5000000 0.5000000 0.00000000     0     0     0
[3,] 0.4507703 0.5479543 0.00127538     0     0     0     0 0.0000000 0.0000000 0.00000000     0     0     0

[1,]     0
[2,]     0
[3,]     0

# lm will drop one spline-column for each factor 
lm(y ~ -1+X+Z,data=data)

lm(formula = y ~ -1 + X + Z, data = data)

 Xx.1d0   Xx.2d0   Xx.3d0   Xx.4d0   Xx.5d0   Xx.6d0   Xx.7d0   Xx.1d1   Xx.2d1   Xx.3d1   Xx.4d1   Xx.5d1  
 23.510   19.912   18.860   22.177   23.080   19.794   18.727   68.572   69.185   67.693   67.082   68.642  
 Xx.6d1   Xx.7d1   Zz.1d0   Zz.2d0   Zz.3d0   Zz.4d0   Zz.5d0   Zz.6d0   Zz.7d0   Zz.1d1   Zz.2d1   Zz.3d1  
 69.159   67.496    1.381  -11.872  -19.361  -21.835  -19.698  -11.244       NA   -1.329  -38.449  -62.254  
 Zz.4d1   Zz.5d1   Zz.6d1   Zz.7d1  
-69.993  -61.438  -39.754       NA

To overcome this problem, Wood, Generalized Additive Models: An Introduction with R, page 163-164 proposes the sum (or mean) centering constraint:


This can be done by reparametrization if a matrix $boldsymbol{Z}$ is found such that


$boldsymbol{Z}$-matrix can be found by the QR-decomposition of the constraint matrix $boldsymbol{C}^T = (boldsymbol{boldsymbol{1}^Tboldsymbol{tilde{X}_j}})^T = boldsymbol{tilde{X}_j}^Tboldsymbol{1}$.

Note that $boldsymbol{tilde{X}_j}^Tboldsymbol{1}$ is $boldsymbol{1}$ by the partition of unity-property.

The centered/constrained-version of my B-Spline-Matrix is:

X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=TRUE)]
         x.1d0      x.2d0      x.3d0      x.4d0      x.5d0       x.6d0     x.1d1      x.2d1      x.3d1      x.4d1
[1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000
[2,] 0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655
[3,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000

          x.5d1       x.6d1
[1,]  0.0000000  0.00000000
[2,] -0.2728077 -0.05790256
[3,]  0.0000000  0.00000000

Z <- data[,spline(z,min(z),max(z),5,2,by=d,intercept=TRUE)]
         z.1d0      z.2d0      z.3d0      z.4d0      z.5d0       z.6d0     z.1d1      z.2d1      z.3d1      z.4d1
[1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000
[2,] 0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655
[3,] 0.2875283 -0.3066501 -0.3079255 -0.3079255 -0.2604260 -0.05527458 0.0000000  0.0000000  0.0000000  0.0000000

          z.5d1       z.6d1
[1,]  0.0000000  0.00000000
[2,] -0.2728077 -0.05790256
[3,]  0.0000000  0.00000000

My question is : Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

# comparing with gam from mgcv
mod.gam <- gam(y~d+s(x,bs="ps",by=d,k=7)+s(z,bs="ps",by=d,k=7),data=data)
X.gam <- model.matrix(mod.gam)
  (Intercept) d1 s(x):d0.1   s(x):d0.2  s(x):d0.3  s(x):d0.4  s(x):d0.5   s(x):d0.6 s(x):d1.1   s(x):d1.2
1           1  0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.00000000
2           1  1 0.0000000  0.00000000  0.0000000  0.0000000  0.0000000  0.00000000 0.5465301 -0.05732768
3           1  0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.00000000

   s(x):d1.3  s(x):d1.4  s(x):d1.5   s(x):d1.6 s(z):d0.1    s(z):d0.2  s(z):d0.3  s(z):d0.4  s(z):d0.5
1  0.0000000  0.0000000  0.0000000  0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207
2 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000
3  0.0000000  0.0000000  0.0000000  0.00000000 0.5471108 -0.031559945 -0.2302910 -0.2213227 -0.1176356

    s(z):d0.6 s(z):d1.1    s(z):d1.2  s(z):d1.3  s(z):d1.4  s(z):d1.5   s(z):d1.6
1 -0.01043987 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000  0.00000000
2  0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207 -0.01043987
3 -0.01022388 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000  0.00000000

Thw dotted line corresponds to my fit, the straight line to the gam-version
enter image description here

Constrained Maximization and Likelihood Ratio Tests for Nested Linear Models

Suppose $boldsymbol beta in mathbb{R}^k$ is a vector of coefficients for a generalized linear model with $g left[ E(Y|X) right] = Xbeta$ for a link function $g$ and I wish to test the composite hypothesis
$H_0: beta_k =0$ versus $H_1: beta_k neq 0$
where the other parameters $beta_{-k}$ are nuisance parameters.

I can do that with a likelihood ratio test.
Formally, the likelihood ratio test of such a composite hypothesis is 2 times the difference of the maximized unrestricted and restricted log-likelihoods:
lambda = 2 left[ L^* - L_{beta_k = 0}^* right]
where $$L^* = sup_{beta in mathbb{R}^k} L(y, boldsymbol{beta}) quad (a) $$
$$L_{beta_k = 0}^*= sup_{beta_{-k} in mathbb{R}^{k-1}, beta_k=0} L(y, boldsymbol{beta}) quad (b) $$ and
$L(y, boldsymbol beta)$ is the log-likelihood of data $y$ evaluated at parameter $boldsymbol beta$.

But in practice, instead everyone replaces the constrained maximization $L_{beta_k = 0}^*$ with

L^*_{-k} = sup_{beta_{-k} in R^{k-1}} L_{k-1} (y, boldsymbol{beta}) quad (c)

where $L_{-k}$ is the log-likelihood of the model $g left[ E(Y|X) right] = X_{-k}beta_{-k}$ that deletes the last covariate, as in doing an analysis of deviance.

Is this replacement justified through a slutsky-theorem like argument about the rate of convergence that $L^*_{k-1}$ has to $L_{beta_k = 0}^*$ (eg at least as fast as $O(n)$?) Or am I missing something that actually implies the two quantities are the same in finite samples? Assuming they are different, does anyone have a reference that discusses the finite sampling differences the two estimates have?

Edit — Well this is embarrassing

With a little bit more thinking about the maximizations in (b) and (c) it’s clear that they solve the same program, hence this entire exercise was ill-posed.

sup_{beta_{-k} in mathbb{R}^{k-1}, beta_k=0} L(y, boldsymbol{beta}) &= sup_{beta_{-k} in mathbb{R}^{k-1}, beta_k=0} L(y, [beta_{-k}, beta_k]) \
& = sup_{beta in mathbb{R}^k} begin{cases} L(y, [beta_{-k}, beta_k]) & beta_k=0 \
-infty & beta_k neq 0
end{cases} \
& = sup_{beta in mathbb{R}^k} begin{cases} L_{-k}(y, beta_{-k}) & beta_k=0 \
-infty & beta_k neq 0
end{cases} \
& = sup_{beta_{-k} in mathbb{R}^{k-1}} L_{-k}(y, beta_{-k})

GLM with categorical predictor on R

I need to do a model with a generalized linear model.
My data are these: habitat : 0 or 1, group : 1 or 0 , mortality : yes or no, and the numbers of individuals for each case (habitat, group and mortality are factors).
My model is

I want to see the influence of groupe and habitat on mortality.

and I get this message error :
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

So what’s wrong?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam