What is a relevant level for percentage deviance explained in a GLM?


I am trying to sort out the unique effect of various environmental predictors on species occurrence (presence/absence data). I have been running glm models in R with family=binomial. Most of my variables have very highly significant P values, but I have a large data set (~8000 data points), so this is maybe not so surprising. I have also been calculating the percentage deviance explained using the BiodiversityR package. Some of my very significant variables based on the P value have very low explained deviance (1% or so).

Is there a rule of thumb for saying a predictor is not significant based on the deviance explained? Any other tests I should be including? I wanted to look at each variable on its own, and then test out specific interactions, but only if the variable was “important” in a model where it was the only predictor.

unique spline for different groups in a linear model (and no spline at all for one group)


I have a problem that has puzzled me for a long time. It involves linear models and spline functions. I need to model “time since diagnosis” when some individuals never had a diagnosis. I use Poisson model where I have split time on age and time since diagnosis. The data set may look like this (with millions upon millions of rows,and with 95 % having no diagnosis ) :

SUBJECT AGE DIAGNOSIS TIME_SINCE_DIAGNOSIS EVENT/OUTCOME

1 30 A 0 0

1 31 A 1 0

1 32 A 2 1

2 45 B 0 0

2 46 B 1 0

103 22 NON “0 or -” 0

103 23 NON “0 or -” 0

.
.
.

You get the idea .. Now I want to have a model that looks something like this:

log ( OUTCOME ) = Spline( AGE ) + DIAGNOSIS * Spline( TIME_SINCE_DIAGNOSIS ) + offset ( logrisktime )

and then plotting the predicted risk for the different diagnoses of A and B as a function of ” time since diagnosis ” given that the diagnosis was at age = 30, and in the same plot for diagnosis = NON but then instead as a function of age ( with year since 30 on x-axis). I like to have both those with a diagnosis and those with diagnosis=NON in the same model to be able to do contrasts/linear combinations of effects.

Model and plot is done, but my problem is that the model above also creates a spline over the ” time since diagnosis ” for those who have no diagnosis (i.e. DIAGNOSIS = NON ). So, I would like to have a model of this type which I created indicator variables for the different diagnoses :

R: factor(A) factor(B) fator(NON)

SAS: class A B NON;

log ( OUTCOME ) = Spline ( AGE ) + NON + A*Spline( TIME_SINCE_DIAGNOSIS ) + B*Spline ( TIME_SINCE_DIAGNOSIS ) + offset ( logrisktime )

Are you with me ?

So I want to create a unique spline of the variable “time since diagnosis” for each level of the variable “diagnosis” (which is done with diagnosis*spline(time_since_diagnosis) ), but with the exception that one level (those who had no diagnosis) does not to have any spline assigned.

Idea ?

Thanks,
Peter

GLM for count data


I ran an experiment with an eye tracker and my data frame has this look:

               Condition   DWellsAOI1 DwellsAOI2  TotalDwells 
Participant1       1             12         13            25
Participant2       2            100         11           111
Participant3       1             50         50           100

and so on. DWellsAOI1 counts the number of dwells on AOI1. Each participant belongs to one condition only, and the duration of the experiment is not fixed, so different TotalDwells.

I was trying to check if there is a significant difference among the between conditions in terms of DwellsAOI1, and my first approach was to compute the percentages (DwellsAOI1/TotalDwells) and run anova. However, data violates both the assumptions.

I searched on Google and I found that I can use a generalized linear model (GLM) with binomial family.

Currently I’m running it like this:

mod <- glm(cbind(DwellsAOI1,TotalDwells-DwellsAOI1) ~ Condition,
               data=df, family=binomial("logit"))
summary(aov(mod))

Is it the correct way?

Thanks!

Linear Mixed Model for Longitudinal Data: Continuous vs. Discrete Time Parameter


I would like to conduct a power analysis for a linear mixed model with fixed effects for Treatment (two levels) and Time (four time points: pre, mid, post treatment, 3 months post treatment) and correlated random effects for children (intercepts and slopes). I am using the following code in R using the lmer function:

expdat <- expand.grid(kid = factor(1:500), Time = factor(1:4), Treat = c("XTx", "BAU"))
expdat$obs <- factor(seq(nrow(expdat)))
set.seed(101)
nsim <- 20
beta <- c(100, -7, 8, 15, 20, 0, 0, 0)
theta <- c(15.000000, 7.500000, 7.500000, 7.500000, 12.990381, 4.330127, 4.330127, 
           12.247449, 3.061862, 11.858541)
ss <- simulate(~Treat*Time + (1+Time | kid), nsim = nsim, family = gaussian, 
      weights = rep(25, nrow(expdat)), newdata = expdat, newparams = 
      list(theta = theta, beta = beta, sigma = 1))
expdat$Outcome <- ss[, 1]
fit1 <- lmer(Outcome ~ Treat*Time + (1+Time | kid), data = expdat)

I have the following questions:

  1. In the output I see the variance associated with the random intercepts at each time point and their correlation. However, I cannot find in the output the variance associated with the slopes, and I cannot find information about the correction between the intercepts and the slopes. Where is that information?
  2. How does the lmer know that Time has levels and that it should estimate a unique effect for each level of the Time factor instead of a general slope parameter?

I am attaching the summary statement of the model:

Linear mixed model fit by REML ['lmerMod']
Formula: Outcome ~ Treat * Time + (1 + Time | kid)
Data: expdat

REML criterion at convergence: 22951.8

Scaled residuals: 
Min       1Q   Median       3Q      Max 
-2.55473 -0.47186 -0.00007  0.47268  2.57786 

Random effects:
Groups   Name        Variance Std.Dev. Corr          
 kid      (Intercept) 235.4531 15.3445                
          Time2       230.9249 15.1962  0.43          
          Time3       220.6076 14.8529  0.52 0.53     
          Time4       213.2725 14.6039  0.48 0.51 0.52
 Residual               0.9749  0.9874                
Number of obs: 4000, groups: kid, 500

Fixed effects:
                Estimate Std. Error t value
(Intercept)    100.92006    0.68765  146.76
TreatBAU        -6.94796    0.06245 -111.26
Time2            7.39371    0.68246   10.83
Time3           15.17849    0.66717   22.75
Time4           19.27623    0.65608   29.38
TreatBAU:Time2  -0.16052    0.08831   -1.82
TreatBAU:Time3  -0.05291    0.08831   -0.60
TreatBAU:Time4  -0.14215    0.08831   -1.61

Correlation of Fixed Effects:
            (Intr) TrtBAU Time2  Time3  Time4  TBAU:T2 TBAU:T3
TreatBAU    -0.045                                            
Time2        0.422  0.046                                     
Time3        0.510  0.047  0.528                              
Time4        0.468  0.048  0.506  0.520                       
TretBAU:Tm2  0.032 -0.707 -0.065 -0.033 -0.034                
TretBAU:Tm3  0.032 -0.707 -0.032 -0.066 -0.034  0.500         
TretBAU:Tm4  0.032 -0.707 -0.032 -0.033 -0.067  0.500   0.500 

Linear Mixed Model: Specification of matrices


I would like to conduct a power analysis for a linear mixed model with fixed effects for Treatment (Two levels) and Time (four time points: pre, mid, post treatment, 3 months post treatment) and correlated random effects for children (intercepts and slopes). I am using the following code in R using the lmer function:

library(lme4)
expdat <- expand.grid(kid = factor(1:500), Time = factor(1:4), Treat = c("XTx", "BAU"))
expdat$obs <- factor(seq(nrow(expdat)))

set.seed(101)
nsim <- 20
beta <- c(100, -7, 8, 15, 20, 0, 0, 0)
theta <- c(15.000000, 7.500000, 7.500000, 7.500000, 12.990381, 4.330127, 
           4.330127, 12.247449, 3.061862, 11.858541)
ss <- simulate(~Treat*Time + (1+Time | kid), nsim = nsim, family = gaussian, 
               weights = rep(25, nrow(expdat)), newdata = expdat, 
newparams = list(theta = theta, beta = beta, sigma = 1))
expdat$Outcome <- ss[, 1]
fit1 <- lmer(Outcome ~ Treat*Time + (1+Time | kid), data = expdat)

This is a repeated measurement design where each person is measured four times. To simulate the data assuming random intercepts and random slopes and their correlations, shouldn’t I specify the covariance matrix for the random effects? How would one do that in R?

Why does Poisson GLM fit better to non-integer data?


Let’s assume that I have a set of predictors and a non-negative integer resulting variable (number of events). All observations are repeated few times (it means that all predictors have the same values more than once). I need to predict an average number of events for every possible combination of predictor’s values. I combined all observations with the same predictors’ values to one, and assigned an average number of events for all these observations to the new one.

Next, I built four different models – OLS, OLS with transformed resulting variable, hurdle Gamma GLM and, I don’t know why, Poisson GLM. Surprisingly, Poisson was the best one. Since this is my final qualification thesis, I need some theoretical basis, but I can’t figure one, I’ve been always thinking that Poisson regression assumes integer data. Hope, somebody could help.

Log Likelihood for GLM


In the following code I perform a logistic regression on grouped data using glm and “by hand” using mle2. Why does the logLik function in R give me a log likelihood logLik(fit.glm)=-2.336 that is different than the one logLik(fit.ml)=-5.514 I get by hand?

library(bbmle)

#successes in first column, failures in second
Y <- matrix(c(1,2,4,3,2,0),3,2)

#predictor
X <- c(0,1,2)

#use glm
fit.glm <- glm(Y ~ X,family=binomial (link=logit))
summary(fit.glm)

#use mle2
invlogit <- function(x) { exp(x) / (1+exp(x))}
nloglike <- function(a,b) {
  L <- 0
  for (i in 1:n){
     L <- L + sum(y[i,1]*log(invlogit(a+b*x[i])) + 
               y[i,2]*log(1-invlogit(a+b*x[i])))
  }
 return(-L) 
}  

fit.ml <- mle2(nloglike,
           start=list(
             a=-1.5,
             b=2),
           data=list(
             x=X,
             y=Y,
             n=length(X)),
           method="Nelder-Mead",
           skip.hessian=FALSE)
summary(fit.ml)

#log likelihoods
logLik(fit.glm)
logLik(fit.ml)


y <- Y
x <- X
n <- length(x)
nloglike(coef(fit.glm)[1],coef(fit.glm)[2])
nloglike(coef(fit.ml)[1],coef(fit.ml)[2])

A categorical variable in glm shows significance from analysis of deviance, but each level is not significant in z-test


I am fitting a generalized linear model (glm). The explanatory variable is categorical with three levels (control, treat1, treat2). The response variable is 0 or 1.
The response rate for each treatment level is ploted as the figure below (from left to right: control, treat1, treat2):

enter image description here

There seems to be a big treatment effect between treat1 vs. control and treat2 vs. control. I applied glm:

fit <- glm(response ~ treatment, family = binomial, data = dat)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)        
(Intercept)   -21.57    6536.57  -0.003    0.997
treat1        23.76    6536.57   0.004    0.997
treat2        43.13    9364.95   0.005    0.996

The z-test shows that neither treat1 nor treat2 is significant compared to the reference level control.

However, the analysis of deviance confirmed that the treatment factor as a whole is highly significant:

drop1(M2, test="Chisq")

response ~ treatment
            Df   Deviance    AIC    LRT  Pr(>Chi)    
 <none>          13.003    19.003                     
 treatment   2   77.936    79.936 64.932 7.946e-15 ***

How shall I interpret such a strange result? Why does the individual z-test not give me any significant result, while according to the plot there is obviously an effect between treat1 and control, and between treat2 and control?

Design a Model to Indivisual function when of sum of all Function given


I am working on a data mining projects and I have to design following type of the model.

e.g.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these

feature such that each function depend upon some subset of available feature.

e.g.

F1(x1, x2) =x1^2+2×2^2 ( Function could be polynomial or very complex)

F2(x2, x3) =2×2^2+3×3^3

F3(x3, x4) =3×3^3+4×4^4

This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on

Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)

Now using these training data I have to make a model which which can predict each individual function.

I have thought about something like using piece-wise linear regression or MARS but I am not able to apply it in this situation.

I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.

“Bayesglm”, p-values and degrees of freedom?


I am trying to perform some logistic regressions (and I am a neophyte user of R). Initially I used “glm” to compute coefficients, AIC and p-values; this worked great until I ran across a data set suffering from complete separation. In [1], Gelman et alia suggest using an (informative) prior to address this problem; the corresponding algorithm is implemented in R as “bayesglm” (in the ARM package).

Here is my problem. Previously, with “glm”, I would compute p-values as follows:

mylogit <- bayesglm(a ~ b+c+d+e+f+g+h, data = mydata, family="binomial")
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

There are 53-48=5 degrees of freedom:

Null deviance: 71.188  on 53  degrees of freedom
Residual deviance: 37.862  on 48  degrees of freedom

However, if I use “bayesglm” instead of “glm”, the resulting degrees of freedom are a bit surprising to me:

Null deviance: 22.279  on 53  degrees of freedom
Residual deviance: 39.030  on 54  degrees of freedom

If I plug in the preceding formula for a p-value, I have -1 degrees of freedom! Can someone help me get a more sensible answer (or help me interpret this)?

By the way, the documentation on the “bayesglm” command includes the following ominous comment:

We include all the glm() arguments but we haven’t tested that all the options (e.g., offests, contrasts, deviance for the null model) all work.

[1] Gelman, Andrew, et al. “A weakly informative default prior distribution for logistic and other regression models.” The Annals of Applied Statistics (2008): 1360-1383.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam