Recalculate log-likelihood from a simple R lm model

I’m simply trying to recalculate with dnorm() the log-likelihood provided by the logLik function from a lm model (in R).

It works (almost perfectly) for high number of data (eg n=1000) :

> n <- 1000
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -2145.562 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -2145.563
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -2145.563

but for small datasets there are clear differences :

> n <- 5
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -8.915768 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -9.192832

Because of small dataset effect I thought it could be due to the differences in residual variance estimates between lm and glm but using lm provides the same result as glm :

> modlm <- lm(y ~ x)
> logLik(modlm)
'log Lik.' -8.915768 (df=3)
> sigma <- summary(modlm)$sigma
> sum(log(dnorm(x = y, mean = predict(modlm), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(modlm), mean = 0, sd = sigma)))
[1] -9.192832

Where am I wrong ?

How to report significance of factor levels shown in summary of glm?

The summary of my GLM shows day 12 of factor Date to be significant, but anova(model, test=”Chisq”) shows Date to be not significant overall. I know how to report the statistics from the Chisq table, but as I have z values in the summary table I am unsure how to report, or if I should report, that day 12 is significant.

Similarly when finding that Date is significant, how to report what specific dates seem to be important.

Thanks in advance


I have Fertility and Fecundity of female flies measured over 12 days. I want to check if there is a decline (or otherwise) in fertility/fecundity over this time.

Additionally, at day 13 females are mated (with a male from one of two different groups), and fertility and fecundity are measured until day 20. I want to use Date as a factor to identify significant peaks in fertility/fecundity i.e. after mating, and potential difference in peaks between groups.

glm(formula = dda$Fertility.Absolute ~ sqrt(dda$Fecundity) + 
    dda$Group + dda$Date + sqrt(dda$Fecundity):dda$Group + sqrt(dda$Fecundity):dda$Date, 
    family = poisson)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1397  -0.6786  -0.4797   0.3596   3.7588  

                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                    -1.91501    0.51539  -3.716 0.000203 ***
    sqrt(dda$Fecundity)             0.72372    0.12441   5.817 5.99e-09 ***
    dda$Group2                      0.19540    0.19230   1.016 0.309585    
    dda$Date4                       0.18117    0.62648   0.289 0.772439    
    dda$Date6                      -0.28952    0.68983  -0.420 0.674706    
    dda$Date8                       0.07111    0.60531   0.117 0.906480    
    dda$Date10                      0.19557    0.62232   0.314 0.753325    
    dda$Date12                      0.79619    0.60710   1.311 0.189696    
    dda$Date14                      1.93702    0.53938   3.591 0.000329 ***
    dda$Date16                      0.75623    0.58296   1.297 0.194554    
    dda$Date18                     -0.05392    0.67805  -0.080 0.936618    
    dda$Date20                     -0.26291    0.68841  -0.382 0.702530    
    sqrt(dda$Fecundity):dda$Group2 -0.07309    0.04822  -1.516 0.129583    
    sqrt(dda$Fecundity):dda$Date4   0.27388    0.17555   1.560 0.118734    
    sqrt(dda$Fecundity):dda$Date6   0.37684    0.22832   1.651 0.098836 .  
    sqrt(dda$Fecundity):dda$Date8   0.13017    0.13861   0.939 0.347674    
    sqrt(dda$Fecundity):dda$Date10  0.04552    0.15345   0.297 0.766722    
    sqrt(dda$Fecundity):dda$Date12 -0.16593    0.14861  -1.117 0.264186    
    sqrt(dda$Fecundity):dda$Date14 -0.24864    0.12754  -1.949 0.051240 .  
    sqrt(dda$Fecundity):dda$Date16  0.05496    0.14578   0.377 0.706170    
    sqrt(dda$Fecundity):dda$Date18  0.15439    0.19341   0.798 0.424715    
    sqrt(dda$Fecundity):dda$Date20 -0.02006    0.16314  -0.123 0.902161    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1327.36  on 359  degrees of freedom
Residual deviance:  298.26  on 338  degrees of freedom
AIC: 873.65

Number of Fisher Scoring iterations: 5


Analysis of Deviance Table    
Model: poisson, link: log    
Response: dda$Fertility.Absolute    
Terms added sequentially (first to last)    

                                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            359    1327.36              
    sqrt(dda$Fecundity)            1   893.88       358     433.48 < 2.2e-16 ***
    dda$Group                      1     0.03       357     433.45    0.8699    
    dda$Date                       9    82.16       348     351.29  6.01e-14 ***
    sqrt(dda$Fecundity):dda$Group  1     0.07       347     351.22    0.7859    
    sqrt(dda$Fecundity):dda$Date   9    52.96       338     298.26  2.97e-08 ***

Inverted SPSS results: Logistic regression command vs. Genlin?

I want to do a logistic regression in SPSS. However, since I analyse unemployment spells the subjects are sometimes repeated (violating the independence assumption of the regression). One way of removing the within subject variation is by applying a Genlin model with the repeated subject subcommand (in essence a GEE model). Thus, I tried out a Genlin model with binomal probability and the logit link, comparing it to a standard logistic regression. I used the exact same variables in the two procedures.

However, the results that was delivered from the Genlin procedure was inverted relative to that of the logistic regression. For instance: Exp(B) for women (of the independent variable sex/gender) was just above 2.0 in logistic regression while being at 0.49 in Genlin. The same happened with every independent variable.

  • Any suggestions to why the results of the Genlin procedure is
  • Is there any way to get the Genlin results in accordance to the logistic regression?

Calculate pseudo-$R^2$ from R's zero-inflated negative binomial regression

I’m looking into calculating a Pseudo $R^2$ used McFadden’s method for a zero-inflated negative binomial regression. I’m unclear how to go about evaluating $hat L(M_{intercept})$ in R. Any suggestions for how this might be easily done?

R Code thus far:

> require(pscl)
> require(MASS)
> ZerInflNegBinRegress<-zeroinfl(y~.|x+z, data=DATASET, dist="negbin", EM=TRUE)

Which returns the Log Likelihood for the model using the summary function. It’s finding the Log Likelihood for an intercept only function that I’m unsure of.

Factors missing in summary and Anova output from glm in R

I’m running a GLM model in R, but when I add additional variables, the degrees of freedom (df) for two of the variables (level and type, see models below) goes from 1 to 0, and because of this there is no information about them in either the Anova table or the summary output. Interactions involving these two variables however, continue to have one degree of freedom. I can’t figure out why the DF goes to zero when I add additional covariates, or what I can do to address this and was hoping someone could point me in the right direction.

The code is below. To try to figure out what is going on, I have run three models:

  1. Model 1 with just a few factors (plus their interactions)
  2. Model 2 with all of the factors (but no interactions)
  3. Model 3 with all of the factors plus the interactions I want

The first model runs fine, with the two variables level and type having one df (and therefore output exists for these two variables in the Anova and summary outputs). But for the second and third models, the two variables level and type each have zero df (although each of their interactions with medium has one df).

modeloutput <- glm(model, data=dataset, family=binomial)
## formulas:
# Model 1:
model <- as.formula(success ~ 1 + medium + level + type + motivation + level*medium +
                   type*medium + motivation*medium)
# Model 2:
model <- as.formula(success ~ 1 + course + medium + level + type + motivation + 
                    ethnicity + gender + age + enrollment + Pell + TANF + GPA + 
# Model 3:
model <- as.formula(success ~ 1 + course + medium + level + type + motivation + 
                    ethnicity + gender + age + enrollment + Pell + TANF + GPA + 
                    onlineExp + level*medium + type*medium + motivation*medium)

I feel like I must be missing something obvious, but I’m just not seeing it. If anyone has any tips about what I can do to find out more information about why this is happening (and about what my options are in trying to address it), I would be very grateful for the help!

Edit to include sample Anova output:

                   LR Chisq Df  Pr(>Chisq)
course             339.1891235  13  1.45477E-64
medium             51.82448562  1   6.06902E-13
ethnicity          66.43311272  4   1.28372E-13
gender             9.557732159  1   0.001991089
age                8.182416401  1   0.004229838
enrollment         0.149034707  1   0.699459539
Pell               1.435801885  1   0.230819897
TANF               13.54535541  1   0.000232867
GPA                179.7209374  3   1.01328E-38
onlineExp          12.78010977  3   0.005137086
level                           0   
type                            0   
motivation         18.00342825  3   0.000439134
medium:level       5.328089832  1   0.020984371
medium:type        5.98948654   1   0.014391391
medium:motivation  6.747183841  3   0.080407666

For the summary output, it looks totally normal, except that none of the values for level or type appear in the list of coefficients.

When I can model count data as continuous

I have time series of several variables of 60 or so rows of count data. I want to do a regression model y ~ x. I’ve chosen to use a Quasipoisson & Negative Binomial GLMs as there’s overdispersion etc.

Min.   : 24000  
1st Qu.: 72000  
Median :117095  
Mean   :197607  
3rd Qu.:291388  
Max.   :607492  

Min.   : 136345
1st Qu.: 405239
Median : 468296
Mean   : 515937
3rd Qu.: 633089
Max.   :1218937

The data itself are very high and so it may be best to model these as count data (this is what I’m trying to investigate – at which point I can model count data as continuous). It seems to be very common practice, what I want to know is the motivation for this?

Are there any texts that actually show the problem of modelling high count data with Poisson distribution? Perhaps something that shows the factorial in the distribution makes things difficult.

Spatial autocorrelation (SAC) while analysing survey data

I am confused about some aspects of spatial autocorrelation usind survey data (survey which is repeated every year). I have data from 1991 to 2012 with sampling region pretty consistent every year. I am studying the species distribtion of two species in the area, in this goal I used binomial GLM. After running the GLM, I noted that they were some spatial autocorrelation in the model error residuals (Moran’s I index from 0.2 to 0.05 significant on 6 first lags). Therefore I decided to take in account the spatial autocorrelation using Moran’s Eingenvector Mapping. However this method is generally described for dataset over 1 year. As the eigenvector are calculated using the coordinates (longitude and latitude) of my points in function of the neighbourhood response I assumed it should take in account the different distribution over the years. Thus I applied a 10 factor on longitude and lagitude to artificially separate points at the same location but sampled at different years (point A year 1 will have (1,1) coordinate; point B year 2 will have (10,10) coordinate.

My first question is to know if this is valid? If not what would you advise?

The outcomes of this method is that I have different eigenvector value for a point at the same position but at different years.

Is this not a problem for the model including MEM variables power of prediction?

Indeed following a question not answered yet I posted here (Spatial autocorrelation — GLM, autocovariate, MEM (Moran's eigenvector mapping)), it appears that my GLM including MEM suffer a loss of 0.25 or more power of prediction (ROC Curve – AUC value). I am digging since one week to understand what is happening and why spatial autocorrelation included using MEM method are so unefficient while they seems to be the best method following litteratur.

Any clue or help will be really appreciated,

Thank you,

Xochitl C.

Negative binomial GLM with 2 factor variables: adding interaction completely changes effect of factor levels

I am emailing for a sanity-check please!

I am analysing some marine wildlife monitoring data from an offshore construction site. The response data are counts of animals (corrected for detection and survey effort), and the model has 2 covariates, both of which are factors:

  1. Season (with 4 levels related to the animals’ seasonal migrations),
  2. Period of Construction (with 3 levels: ‘Before’, ‘During’ and ‘After’ construction).

I am using a negative binomial GLM model structure because the Poisson was overdispersed. I am working in R, using the glm.nb function in the MASS package.

When I model the count data as a function of the 2 factor variables, but without any interactions between the two factors, the model indicates that there was a significantly negative impact on animal abundance ‘During’ construction (i.e. the coefficient estimate for animal counts was significantly lower ‘During’ construction when compared to ‘Before’ construction, which is the base level for the ‘Period of Construction’ factor variable).

However, when I include an interaction between ‘Season’ and ‘Construction Period’, the coefficient estimate for ‘During’ construction changes to be positive (although non-significant). I know that by including interactions in the model, I am changing the model structure and I would expect some changes to coefficient estimates; however I am surprised by the magnitude of the change that occurs by adding the interaction between ‘Season’ and ‘Period of Construction’.

I am obviously keen to make sure I haven’t misunderstood/made some mistake! Below I have copied my Model summary tables and the Anova table for the model containing interactions (to show the covariate main effects).

P.S. The data are not balanced (i.e. not every pairwise combination of ‘Season’ and ‘Construction Period’ were sampled). I know that this should preclude assessing interactions, but I have been instructed to assess them anyway! I am wondering whether this could be causing the unexpected results??


                                  Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                    1.78284    0.06203  28.741  < 2e-16 
    as.factor(Season)Migration     0.15741    0.05078   3.100 0.001935  
    as.factor(Season)Pre-Breeding  0.78840    0.07332  10.753  < 2e-16 
    as.factor(Season)Winter        0.57884    0.13065   4.430 9.41e-06 
    as.factor(Period)During       -0.37198    0.07126  -5.220 1.79e-07 
    as.factor(Period)After         0.19159    0.05621   3.409 0.000653 

    Coefficients: (3 not defined because of singularities)
                                               Estimate Std. Error z value Pr(>|z|)    
       (Intercept)                               1.3648     0.1255  10.878  < 2e-16 
       as.factor(Season)Migration                0.4515     0.1373   3.288  0.00101 
       as.factor(Season)Pre-Breeding             1.3620     0.1354  10.058  < 2e-16 
       as.factor(Season)Winter                   0.9969     0.1683   5.924 3.15e-09 
       as.factor(Period)During                   0.3294     0.1691   1.948  0.05139  
       as.factor(Period)After                    0.6119     0.1332   4.593 4.36e-06 
                as.factor(Period)During           -0.1849     0.2066  -0.895  0.37079    
                as.factor(Period)During           -1.5185     0.2004  -7.578 3.50e-14 
                as.factor(Period)During             NA         NA      NA       NA    
                as.factor(Period)After            -0.2978     0.1488  -2.001  0.04535   
                as.factor(Period)After              NA         NA      NA       NA    
                as.factor(Period)After              NA         NA      NA       NA    

    Analysis of Deviance Table (Type II tests)
     Response: RU_density
                                               LR   Chisq Df  Pr(>Chisq)    
       as.factor(Season)                    121.927    3     < 2.2e-16
       as.factor(Period)                     64.713    2     8.865e-15
       as.factor(Season):as.factor(Period)   90.975    3     < 2.2e-16    

Weighted generalized regression in BUGS, JAGS

In R we can “prior weight” a glm regression via the weights parameter. For example:

glm.D93 <- glm(counts ~ outcome + treatment, family = poisson(), weights=w)

How can this be accomplished in a JAGS or BUGS model?

I found some paper discussing this, but none of them provides an example. I’m interested mainly into Poisson and logistic regression examples.

Diagnostic plots for count regression

What diagnostic plots (and perhaps formal tests) do you find most informative for regressions where the outcome is a count variable?

I’m especially interested in Poisson and negative binomial models, as well as zero-inflated and hurdle counterparts of each. Most of the sources I’ve found simply plot the residuals vs. fitted values without discussion of what these plots “should” look like.

Wisdom and references greatly appreciated. The back story on why I’m asking this, if it’s relevant, is my other question.

Related discussions:

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam