Logistic glm: marginal predicted probabilities and log-odds coefficients support different hypotheses

I’m running a fixed effects logistic regression in R. The model consists of a binary outcome and two binary predictors, with no interaction term. On the log-odds scale, and as an odds-ratio, the coefficient for one of the predictors (carbf in the mocked-up example below) indicates that the expected probability of Y=1 (“success”) is different between the two levels of the factor (i.e., the effect is significant).

When I use the effects package to get marginal predicted probabilities, the 95% CIs for the two levels of carbf overlap considerably, indicating there is no evidence of a difference in the expected probability of Y=1 between the two factor levels.

When I use the mfx package to get average marginal effects for the coefficients (i.e., for the expected difference in the probability of Y=1 between the two factor levels), I do get a significant difference.

I’m confused as to whether this discrepancy is because:

1) the output from the model and the mfx package is an expected difference in the probability of Y=1 between factor levels, rather than predicted probabilities for each level.

2) of the way the effects package is calculating the marginal effect.

In an effort to determine this, I modified the source code from the mfx package to give me average marginal effects for each level of the carbf factor. The 95% CIs for these predictions do not overlap, indicating a significant difference. This makes me wonder why I get such different results using the effects package. Or is it that I’m just confused about the difference between marginal effects for coefficients and for predicted probabilities?

# packages

# data
carsdat <- mtcars
carsdat$carb <- ifelse(carsdat$carb %in% 1:3, 0, 1)
facvars <- c("vs", "am", "carb")
carsdat[, paste0(facvars, "f")] <- lapply(carsdat[, facvars], factor)

# model
m1 <- glm(vsf ~ amf + carbf, 
    family = binomial(link = "logit"), 
    data = carsdat)

# effects package
eff <- allEffects(m1)
plot(eff, rescale.axis = FALSE)
eff_df <- data.frame(eff[["carbf"]])

#   carbf   fit    se  lower upper
# 1     0 0.607 0.469 0.3808 0.795
# 2     1 0.156 0.797 0.0375 0.469

# mfx package marginal effects (at mean)
mfx1 <-logitmfx(vsf ~ amf + carbf, data = carsdat, atmean = TRUE, robust = FALSE)

#         dF/dx Std. Err.     z  P>|z|
# amf1    0.217     0.197  1.10 0.2697
# carbf1 -0.450     0.155 -2.91 0.0037

# mfx package marginal effects (averaged)
mfx2 <-logitmfx(vsf ~ amf + carbf, data = carsdat, atmean = FALSE, robust = FALSE)

#         dF/dx Std. Err.     z  P>|z|
# amf1    0.177     0.158  1.12 0.2623
# carbf1 -0.436     0.150 -2.90 0.0037

# mfx source code
fit <- m1
x1 = model.matrix(fit)  
be = as.matrix(na.omit(coef(fit)))
k1 = length(na.omit(coef(fit)))
fxb = mean(plogis(x1 %*% be)*(1-plogis(x1 %*% be))) 
vcv = vcov(fit)

# data frame for predictions
mfx_pred <- data.frame(mfx = rep(NA, 4), se = rep(NA, 4), 
    row.names = c("amf0", "amf1", "carbf0", "carbf1"))
disc <- rownames(mfx_pred)

# hard coded prediction estimates and SE  
disx0c <- disx1c <- disx0a <- disx1a <- x1 
disx1a[, "amf1"] <- max(x1[, "amf1"]) 
disx0a[, "amf1"] <- min(x1[, "amf1"]) 
disx1c[, "carbf1"] <- max(x1[, "carbf1"]) 
disx0c[, "carbf1"] <- min(x1[, "carbf1"])
mfx_pred["amf0", 1] <- mean(plogis(disx0a %*% be))
mfx_pred["amf1", 1] <- mean(plogis(disx1a %*% be))
mfx_pred["carbf0", 1] <- mean(plogis(disx0c %*% be))
mfx_pred["carbf1", 1] <- mean(plogis(disx1c %*% be))
# standard errors
gr0a <- as.numeric(dlogis(disx0a %*% be)) * disx0a
gr1a <- as.numeric(dlogis(disx1a %*% be)) * disx1a
gr0c <- as.numeric(dlogis(disx0c %*% be)) * disx0c
gr1c <- as.numeric(dlogis(disx1c %*% be)) * disx1c
avegr0a <- as.matrix(colMeans(gr0a))
avegr1a <- as.matrix(colMeans(gr1a))
avegr0c <- as.matrix(colMeans(gr0c))
avegr1c <- as.matrix(colMeans(gr1c))
mfx_pred["amf0", 2] <- sqrt(t(avegr0a) %*% vcv %*% avegr0a)
mfx_pred["amf1", 2] <- sqrt(t(avegr1a) %*% vcv %*% avegr1a)
mfx_pred["carbf0", 2] <- sqrt(t(avegr0c) %*% vcv %*% avegr0c)
mfx_pred["carbf1", 2] <- sqrt(t(avegr1c) %*% vcv %*% avegr1c)  

mfx_pred$pred <- rownames(mfx_pred)
    mfx_pred$lcl <- mfx_pred$mfx - (mfx_pred$se * 1.96)
mfx_pred$ucl <- mfx_pred$mfx + (mfx_pred$se * 1.96)

#          mfx    se   pred     lcl   ucl
# amf0   0.366 0.101   amf0  0.1682 0.563
# amf1   0.543 0.122   amf1  0.3041 0.782
# carbf0 0.601 0.107 carbf0  0.3916 0.811
# carbf1 0.165 0.105 carbf1 -0.0412 0.372

ggplot(mfx_pred, aes(x = pred, y = mfx)) +
    geom_point() +
    geom_errorbar(aes(ymin = lcl, ymax = ucl)) +

Multilevel Modeling in stata

I would like to make a model that calculates the probability of disease.
Range of variables are following:
disease ~ (0, 1); score ~ (1-10); test ~ (0-30)
Large values of test and score indicates that there is disease.

id   disease    score   test index_score  index_test
1    0           2       10      1            1
1    0           2       20      1            2
1    0           6       10      2            1
1    0           6       20      2            2 
2    1           9       27      1            1
2    1           9       29      1            2
3    0           3       12      1            1
3    0           4       10      2            2
3    0           3       10      3            2
3    0           2       10      4            2

Index variables indicate that how many times test or score is measured.
I am trying to model this use the logit model with disease as response and predictors (score, test).

I want to get updated probability of disease with respect to test and scores. What would be best model for this unbalanced data (because score frequently measured)?

Banding variables for initial GLM fit, then replacing with continuous versions. Any pitfalls?

I am aware of some of the pitfalls of using banded versions of continuous variables when fitting a GLM (or other model), with a number of discussions on the subject on this forum (and a excellent list here: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous)

However, I’m interested to know what costs or benefits there may be in developing a model that only uses banded or binned versions of all continuous variables (with associated weighted mean bin values) and then adding the selected continuous variables later on the basis of the contribution made by their banded versions. .

I am asking as I am currently working in general insurance in the UK where a package that does just this is used extensively (it’s called Emblem and is part of a ‘pricing optimisation’ suite made by a financial services company). It does not accept any variables with more than 255 levels and so all continuous predictors have to be banded. The processed data files produced are very much smaller than their initial R or SAS counterparts and are then processed with an internal ‘R engine’.

The package allows models to be developed iteratively, adding and removing variables and interactions manually (often on the basis of expert knowledge, regulatory requirements etc.), re-fitting the model and then making a judgement on whether to keep them on the basis of graphs for each individual predictor and comparison with up to five saved ‘reference models’. Fitting times are very fast – a fraction of the time taken for the unbanded R or SAS equivalent.

All of this may be fine for a jobbing practitioner in the industry, but it got me to wondering if this approach might be appropriate for something more rigorous – what would be the pitfalls of building a model in this way and then replacing the banded versions with their original continuous values for a final evaluation for instance?

Adding a square root link function to an overdispersed negative binomial GLM

I’m analyzing nematode count data (80 data points) from a randomized block design in which I have two factors with both four levels (Plant and Inoc). The data show heavy overdispersion when analyzed with a glm with poisson-distribution. For one of the plant levels counts are very low (below 100), while for others they almost reach 100000. Therefore I analyzed the data with a Negative-binomial distribution (glm.nb in R MASS-package) which results in limited overdispersion of ~1.5 (Res. def. 89.12 over 60DF). The predicted versus residual plots look fine, but I’m still worried about the amount of overdispersion. I thought it might be worth it to try to change the link function from log to sqrt to see whether that helps. To do this you have to give starting values for your coefficients, like model<-glm.nb(Nem_gramroot ~ Block + Plant + Plant*Inoc, data=data, link = sqrt, start = c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)). I’ve tried putting in 0′s, 1′s, 10′s an the coefficients of the model based on the log link function. However, I always get the message step size truncated: out of bounds. Any ideas on this? And moreover, any thoughts on using sqrt link functions to reduce overdispersion in negative binomial glms?

Probability that LM with lesser RSS has greater residual for individual i (or opposite sign)?

You have fitted a basic linear Model #1 (i.e., GLM with identity-link) based on observed data with residuals:

Model 1: y_i = beta_0 + beta_1 x_{1i} + beta_2 x_{2i} … + R_i

A colleague claims to have a basic linear Model #2 which exhibits a better fit with $y$, i.e., “the residual sum of squares is significantly lower” in Model #2.
Model 2: y_i = beta^{‘}_0 + beta^{‘}_1 x_{1i} + beta^{‘}_2 x_{2i} + beta_3 x_{3i} … + R^{‘}_i

Before they reveal the model, they want you to estimate:

  • (1) the probability for each observation, $y_i$, that $R^{‘}_i$ has the same sign as $R_i$
  • (2) the probability for each observation, $y_i$, that $R^{‘}_i$ has the same sign AND is “absolutely greater than” $R_i$.
  • (3) the probability for each observation, $y_i$, that $y_{i(Fitted)} – y_mu$ has the same sign.

For example, Model #1:

  • $y_1=100 ; mu=50$
  • $y_{1(Fitted)}= 60$
  • $y_{1(Residual)}=+40$
  • By definition, Model #2′s RSS is lesser than Model #1′s RSS.
  • You and colleague use same $y$
  • Mean of residuals for both should be 0
  • You know RSS for your model and for null model
  • You know sd of $R_1$ among your $R$
  • You know sd of $y_1$ among $y$

(1) Can we estimate the probability that $y_{1(Fitted)}$ from Model #2 will be less than 100?

(2) Can we estimate the probability that $y_{1(Fitted)}$ from Model #2 will be less than 60?

(3) Can we estimate the probability that $y_{1(Fitted)}$ from Model #2 will be less than the mean of 50?

(4) Is our colleague asking impossible nonsense?

  • Advanced extension: What if Model #1 and Model #2 (with “greater percent deviance explained”) were both GAM instead of LMs, i.e., generalized linear models with identity-link?

Assumptions behind multinomial logistic regression

What are the proper assumptions behind multinomial logistic regression? And what are the best tests to satisfy these assumptions in any statistical software? What are other suitable models, if those assumptions are not satisfied?

Higher Moments from Factor Models

Suppose that we are fitting a linear factor model to our data
where we assume the factors, $f$, are orthogonal. Using this structure we can estimate the mean vector as
and the covariance matrix
$$Sigma =BB^T+D$$
where $D$ is the diagonal matrix of error variances. So here is my question. Using the factor model structure how can we compute the coskewness and cokurtosis tensors?

Is it valid to use a generalized model with no replicates

I’ve got a designed experiment that is a long-term agricultural station. The field is divided into 11 rows and 6 columns (each column has two sub-columns). Each resulting plot represents a unique combination of certain fertilizers and plants, while each combination is presented in two versions: with and without liming. The plots are not replicated. I’ve got 16S high-througput sequencing outputs that are taxonomic compositional data of soil prokaryotic communities. I basically wanted to make a model of some kind to test the significance of factors and their interference. I’ve been thinking about a generalized linear model with Poisson distribution, but I’m not sure whether there is a valid way to analyze my data since there are no replicates present in the experiments.

Edit. Adding the experiment design table.
experiment design

Testing the difference between two parameter estimates in binomial GLM

There are some related posts on this issue, but no answers actually demonstrate the mechanics of how to accomplish the task that I could find. I want to compare two parameter estimates in a binomial GLM (but I expect the answer to this question will work for any GLM).

My model is of the form:

y ~ b0 + b1x1 + b2x2 + e(binom)

My question: How do you test for a difference between b1 and b2?

Running this model using glm() in R provides parameter estimates and standard errors for all b. So comparing them should be easy, but I’m just having a hard time figuring out exactly what to do.

Here are some example data to work with:

y <- rbinom(n = 100, size = 1, prob = 0.5) # binomial response variable
x1a <- rnorm(n = 100, mean = 2, sd = 3)
x1b <- rnorm(n = 100, mean = 0, sd = 3)
x1 <- ifelse(y == 0, x1a, x1b) # negative relationship between y and x1
x2a <- rnorm(n = 100, mean = 2, sd = 5)
x2b <- rnorm(n = 100, mean = 15, sd = 5)
x2 <- ifelse(y == 0, x2a, x2b) # a strong, positive relationship between y and x2

Run this model and return relevant output:

an1 <- glm(y ~ x1 + x2, family=binomial)
s_an1 <- summary(an1)

    paste("residual df = ", an1$df.residual)
paste("null.df = ", an1$df.null)

And get these estimates:

              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -4.2711210  1.0451565 -4.086585 4.377686e-05
x1          -0.1638442  0.1433595 -1.142891 2.530841e-01
x2           0.6130596  0.1378431  4.447519 8.686793e-06

residual df = 97
null df = 99

Notice that x2 is a significant positive predictor of y, and x1 is non-significant, negative predictor of y. As we all know, however, this is not good evidence that the parameter estimates are different.

(How) Can I use the standard errors to compare the estimates?

I’ll point out that there appears to be a lot of information on the web about how to calculate confidence intervals, which are nice, but I want:

1) a parameter estimate for the difference (yes, I know — it’s just the difference between them),

2) a standard error (probably sqrt(se.x1^2 + se.x2^2)),

3) a t-value (or equivalent distribution statistic), and

4) a p-value.

In my real-world scenario, x1 and x2 are measured on the same scale, but don’t have the same mean or standard deviations (as in the example data above — see sd arguments to random number generators). Is standardization of some kind necessary in this case?

Help much appreciated, cheers!

Can I Use Distance Matrix (Weighted) For GLM Fit (Using R)

I am new to statistics but a business user liking to base my decisions on prediction models. And I am relatively new to regression.

I have data in the below format. (I have used only imaginary values).
(The first part is a distance matrix)

  A B C D E   N(No of observations)   Y(Success, Failure)
A 0 2 3 2 3   300                     (50,250)
B 2 0 1 4 5   240                     (20,220)
C 3 1 0 3 1   110                     (10,100)
D 2 4 3 0 2   550                     (50,500) 
E 3 5 1 2 0   650                     (60,590)

In my case, the Y(success,failure) of say “A” depends upon its relation with B,C,D and E , that of “B” depends upon its relation with A,C,D,E and F and so on.

I am actually want to model Y(success,failure) based on A+B+C+D+E.

I am not interested in the impact (coefficient) of individual predictors but the predicting capacity of the system as a whole. (Hence I guess collinearity among predictors is not a problem. Am I right?)

Using R, I am thinking to use weighted GLM fit and family=binomial.

Am I right in my approach? Both in using distance matrix and glm fit for my case?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam