Is a mixed/ random effects model required if fixed effects model shows no pattern in the residuals?


I’ve got some data on discrete flood events (response variable – duration) on several rivers.

I’ve modelled the response variable by site and time using a Generalized Linear Model:

 mod = glm(duration ~ site + time)

I’ve then looked at the residual plots and cannot find any evidence for temporal correlation in the residuals. Does this mean that this model approach is valid (and that using a mixed/ random effects model is unnecessary) (or should one always use mixed/ random effects models for time-series data)?

Any links to reading references regarding this subject would be brilliant,

Parameters as sort of weights


I’m after model that would spit out weights where the weights sum to 1 rather than parameters themselves.

This is what I have done:

I have fitted 3 logit models each with 3 independent variables (and y dependent)

logit(y~x+o+z)
logit(y~r+q+a)
logit(y~b+n+m)

I have fitted 3 models since fitting 1 model with all the parameters would deem some as insignificant (due to dependencies-correlation of the respond/dependent variable). Please note that the dependent variable (explanatory) y is ~ (0,1).

Now, when I fit each model to my new data for prediction, I would like to do something like this:

predict(w1*fitted(logit1)+w2*fitted(logit2)+w3*fitted(logit3))

where w1, w2, w3 represent weights.

I have fitted again logit model as: logit(y~logit1+logit2+logit3) and I calculated the weights as param1/sum(param1+param2+param3). This way the model didn’t provide reasonable fit. (This is very likely wrong approach anyway).

The second approach I have tried:

I sampled possible weights from 0.05 to 1 (23 combinations). Multiplied the corresponding weights with the 3 models (fitted in-sample) and calculated AUC (area under the curve). THe AUC however was reasonable good for nearly all weights. Slightly higher results recorded at two combinations of weights.

What model would fit to this approach? Or how to approach this modelling?

Multicollinearity and Splines :: Is there a problem?


When using natural (i.e. restricted) cubic splines, the basis functions created are highly collinear, and when used in a regression seem to produce very high VIF (variance inflation factor) statistics, signaling multicollinearity. When one is considering the case of a model for prediction purposes, is this an issue? It seems like it will always be the case because of the nature of the spline construction.

Here is an example in R:

library(caret)
library(Hmisc)
library(car)
data(GermanCredit)

spl_mat<-rcspline.eval(GermanCredit$Amount,  nk=5, inclx=TRUE) #natural cubic splines with 5 knots

class<-ifelse(GermanCredit$Class=='Bad',1,0) #binary target variable
dat<-data.frame(cbind(spl_mat,class))

cor(spl_mat)

OUTPUT:
              x                              
    x 1.0000000 0.9386463 0.9270723 0.9109491
      0.9386463 1.0000000 0.9994380 0.9969515
      0.9270723 0.9994380 1.0000000 0.9989905
      0.9109491 0.9969515 0.9989905 1.0000000


mod<-glm(class~.,data=dat,family=binomial()) #model

vif(mod) #massively high

OUTPUT:
x         V2         V3         V4 
319.573 204655.833 415308.187  45042.675

UPDATE:

I reached out to Dr. Harrell, the author of Hmisc package in R (and others) and he responded that as long as the algorithm converges (e.g. the logistic regression) and the standard errors have not exploded (as Maarten said below) – and the model fits well, best shown on a test set, then there is no issue with this collinearity.

Further, he stated (and this is present on page 65 of his excellent Regression Modeling Strategies book) that collinearity between variables constructed in an algebraic fashion like restricted cubic splines is not an issue as multicollinearity only matters when that collinearity changes from sample to sample.

BIC vs. Out of sample performance


I have two statistical models. Model 1 uses a GLM approach while model 2 uses a time series approach for fitting. I want to compare these two models.

Model 1 (i.e. GLM) has a better out of sample performance. Model 2 has a better BIC criteria. So based on out of sample performance, I should pick up model 1 and based on BIC I should pick up model 2 as the preferred model.

I should add that in this context and for the question I am trying to answer, Both the BIC and out of sample performance are important. The question is how to choose the best model in this case? Should I consider other criteria? Please let me know if you know any good reference with similar cases.

Spatial autocorrelation — GLM, autocovariate, MEM (Moran's eigenvector mapping)


I am currently working on two marine species distribution modelling and also on their overlap distribution. For this I use a binomial logistic regression model (GLM) with response being, respectively, presence/absence species1, presence/absence species2, presence/absence overlap. My question concerns my first results concerning the overlap response (1 is coded when I have a presence of the two species and 0 when at least one species is absent). Explanatory variables are different environmental parameters (temperature, depth, ect.). After selecting a final model I examinated spatial autocorrelation (SAC) and found in the error residuals using Moran’s I index (6 first lags are significately correlated spatial). I decided to apply two different methods: autocovariate and the Moran eigenvector mapping. I am using R (package spded, spacemakeR and packfor), Dormmann (2007) “Methods to account fo spatial autocorrelation in the analysis of species distributional data – a review” appendix and Borcard et al. (2011) Numerical Ecology with R p. 275-277.

My question is related to the results these two methods (see Figure enclosed). It seems that my GLM and GLM+ac models are better than the model GLM+MEM (MEM: Moran’s eigenvector maps). Isn’t that weird? Following literature, it appeared to me that MEM should be the best method of the three. Do you have any clue why that can happen? Can a theoritical reason be able to explain this or should I investigate in the way I apply the functions? I already applied the MEM function using two different neighbourhood matrices (one leaving “empty” neighbours, while other leaving no 0-regions). Maybe I should add that my data are on large scale (statistical rectangle of the North Sea) and that a factor to identical position but at different time (20 years covered) has been added to spatially separate them.

Any help or clue would be welcome.

Accuracy parameters
ROCPlot

Ordinal Regression


In my data, the response variable is ordinal and there are 16 predicting variables which are also ordinal. I plan to perform an ordinal regression. But before doing that I want to graphically show the relationship between the response and predictor variables. In regular regression where the variables are numerical, we can produce a scatterplot matrix for that purpose. I am wondering whether for ordinal regression with ordinal predictors there is any graph that can be used to visually display the relationship before the formal analysis. Another possibility is to calculate the association between the response and the ordinal predictors before the analysis. Any recommendation on that part?
Also, I plan to perform a proportional odds model using R. Can anyone recommend a R package or functions which can be used to test the proportionality and also used for the analysis?

Recalculate log-likelihood from a simple R lm model


I’m simply trying to recalculate with dnorm() the log-likelihood provided by the logLik function from a lm model (in R).

It works (almost perfectly) for high number of data (eg n=1000) :

> n <- 1000
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -2145.562 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -2145.563
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -2145.563

but for small datasets there are clear differences :

> n <- 5
> x <- 1:n
> set.seed(1)
> y <- 10 + 2*x + rnorm(n, 0, 2)
> 
> mod <- glm(y ~ x, family = gaussian)
> logLik(mod)
'log Lik.' -8.915768 (df=3)
> sigma <- sqrt(summary(mod)$dispersion)
> sum(log(dnorm(x = y, mean = predict(mod), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(mod), mean = 0, sd = sigma)))
[1] -9.192832

Because of small dataset effect I thought it could be due to the differences in residual variance estimates between lm and glm but using lm provides the same result as glm :

> modlm <- lm(y ~ x)
> logLik(modlm)
'log Lik.' -8.915768 (df=3)
> 
> sigma <- summary(modlm)$sigma
> sum(log(dnorm(x = y, mean = predict(modlm), sd = sigma)))
[1] -9.192832
> sum(log(dnorm(x = resid(modlm), mean = 0, sd = sigma)))
[1] -9.192832

Where am I wrong ?

How to report significance of factor levels shown in summary of glm?


The summary of my GLM shows day 12 of factor Date to be significant, but anova(model, test=”Chisq”) shows Date to be not significant overall. I know how to report the statistics from the Chisq table, but as I have z values in the summary table I am unsure how to report, or if I should report, that day 12 is significant.

Similarly when finding that Date is significant, how to report what specific dates seem to be important.

Thanks in advance

Lara

I have Fertility and Fecundity of female flies measured over 12 days. I want to check if there is a decline (or otherwise) in fertility/fecundity over this time.

Additionally, at day 13 females are mated (with a male from one of two different groups), and fertility and fecundity are measured until day 20. I want to use Date as a factor to identify significant peaks in fertility/fecundity i.e. after mating, and potential difference in peaks between groups.

Call:
glm(formula = dda$Fertility.Absolute ~ sqrt(dda$Fecundity) + 
    dda$Group + dda$Date + sqrt(dda$Fecundity):dda$Group + sqrt(dda$Fecundity):dda$Date, 
    family = poisson)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1397  -0.6786  -0.4797   0.3596   3.7588  

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                    -1.91501    0.51539  -3.716 0.000203 ***
    sqrt(dda$Fecundity)             0.72372    0.12441   5.817 5.99e-09 ***
    dda$Group2                      0.19540    0.19230   1.016 0.309585    
    dda$Date4                       0.18117    0.62648   0.289 0.772439    
    dda$Date6                      -0.28952    0.68983  -0.420 0.674706    
    dda$Date8                       0.07111    0.60531   0.117 0.906480    
    dda$Date10                      0.19557    0.62232   0.314 0.753325    
    dda$Date12                      0.79619    0.60710   1.311 0.189696    
    dda$Date14                      1.93702    0.53938   3.591 0.000329 ***
    dda$Date16                      0.75623    0.58296   1.297 0.194554    
    dda$Date18                     -0.05392    0.67805  -0.080 0.936618    
    dda$Date20                     -0.26291    0.68841  -0.382 0.702530    
    sqrt(dda$Fecundity):dda$Group2 -0.07309    0.04822  -1.516 0.129583    
    sqrt(dda$Fecundity):dda$Date4   0.27388    0.17555   1.560 0.118734    
    sqrt(dda$Fecundity):dda$Date6   0.37684    0.22832   1.651 0.098836 .  
    sqrt(dda$Fecundity):dda$Date8   0.13017    0.13861   0.939 0.347674    
    sqrt(dda$Fecundity):dda$Date10  0.04552    0.15345   0.297 0.766722    
    sqrt(dda$Fecundity):dda$Date12 -0.16593    0.14861  -1.117 0.264186    
    sqrt(dda$Fecundity):dda$Date14 -0.24864    0.12754  -1.949 0.051240 .  
    sqrt(dda$Fecundity):dda$Date16  0.05496    0.14578   0.377 0.706170    
    sqrt(dda$Fecundity):dda$Date18  0.15439    0.19341   0.798 0.424715    
    sqrt(dda$Fecundity):dda$Date20 -0.02006    0.16314  -0.123 0.902161    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1327.36  on 359  degrees of freedom
Residual deviance:  298.26  on 338  degrees of freedom
AIC: 873.65

Number of Fisher Scoring iterations: 5

###

Analysis of Deviance Table    
Model: poisson, link: log    
Response: dda$Fertility.Absolute    
Terms added sequentially (first to last)    

                                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            359    1327.36              
    sqrt(dda$Fecundity)            1   893.88       358     433.48 < 2.2e-16 ***
    dda$Group                      1     0.03       357     433.45    0.8699    
    dda$Date                       9    82.16       348     351.29  6.01e-14 ***
    sqrt(dda$Fecundity):dda$Group  1     0.07       347     351.22    0.7859    
    sqrt(dda$Fecundity):dda$Date   9    52.96       338     298.26  2.97e-08 ***

Inverted SPSS results: Logistic regression command vs. Genlin?


I want to do a logistic regression in SPSS. However, since I analyse unemployment spells the subjects are sometimes repeated (violating the independence assumption of the regression). One way of removing the within subject variation is by applying a Genlin model with the repeated subject subcommand (in essence a GEE model). Thus, I tried out a Genlin model with binomal probability and the logit link, comparing it to a standard logistic regression. I used the exact same variables in the two procedures.

However, the results that was delivered from the Genlin procedure was inverted relative to that of the logistic regression. For instance: Exp(B) for women (of the independent variable sex/gender) was just above 2.0 in logistic regression while being at 0.49 in Genlin. The same happened with every independent variable.

  • Any suggestions to why the results of the Genlin procedure is
    inverted?
  • Is there any way to get the Genlin results in accordance to the logistic regression?

Calculate pseudo-$R^2$ from R's zero-inflated negative binomial regression


I’m looking into calculating a Pseudo $R^2$ used McFadden’s method for a zero-inflated negative binomial regression. I’m unclear how to go about evaluating $hat L(M_{intercept})$ in R. Any suggestions for how this might be easily done?

R Code thus far:

> require(pscl)
> require(MASS)
> ZerInflNegBinRegress<-zeroinfl(y~.|x+z, data=DATASET, dist="negbin", EM=TRUE)

Which returns the Log Likelihood for the model using the summary function. It’s finding the Log Likelihood for an intercept only function that I’m unsure of.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam