Is my nested random-effect model non-hierarchical?


I have a problem with model structure because of the way factors are nested in a potentially non-hierarchical way. I’m not sure if I fully understand the issue but I can’t find a way to specify the model which Minitab or R will process without error.

The design is fairly simple. I have one response variable Reject which is measured under two levels of a within-subjects factor Equal and under two levels of a between-subject factor Cost. I also want to include Gender as a factor. I account for the within subject design by including Individual as a random factor.

I think the problem is that Individual is nested under both Gender (because each individual has only one gender) and Cost (because that is a between-subject factor), but neither of Cost or Gender is nested in the other. So that perhaps makes any valid model non-hierarchical (that is in any case Minitab’s complaint).

Has anyone got any tips for how to construct a valid model testing the significance of Equal, Cost, and Gender, implementable in R or Minitab?

confidence interval on expected number of cases in a set given a model


I have a set of units, where unit $i$ has:

  • a binary outcome $y_i$ (“case” or “non-case”)
  • a binary vector of treatments $mathbf{t_i}$, where $t_{i,j}=1$ if unit $i$ received treatment $j$.
  • several hundred confounding covariates $mathbf{x_i}$, some real-valued and some binary.

I expect that most treatments have no association with the outcome. I am trying to spot the outlier treatments that do have an effect, and estimate how many cases that effect was responsible for. For each treatment, I’d like to say something like:

Supposing there were no association between this treatment and the outcome, here is the number of cases (and a confidence interval) we would have expected to see for this treatment, predicted based on the covariates for the set of units that received that treatment.

Is there a standard way for doing this?

I fit a logistic regression model $m(mathbf{x})$ that gives $E[Y]$ for a given $mathbf{x_i}$. That is, it predicts the outcome given the covariates.

Then I can look at:

$$c_j=text{actual number cases with treatment $j$} = sum_i{t_{i,j} y_i}$$
$$h_j=text{predicted number of cases with treatment $j$} = sum_i{t_{i,j} m(mathbf{x_i})}$$

and compare the two.

These quantities are easily calculable, but I’m unsure how to test significance of the difference (call it $q_k$, where $k$ is the number of units with the treatment), since I don’t know what distribution $Q_k$ is drawn from under the null hypothesis.

I want $Q_k$ to be the distribution one would get by doing the following: sample $k$ many $(mathbf{x_i}, y_i)$ pairs according to the process that generated my data (or uniformly from the data itself), and compute the difference between the predicted and actual number of cases. Intuitively, this distribution should get tighter as $k$ increases and also as the model gets more accurate. That is, I don’t want to just assume that my model is accurate (and model $Q_k$ as something like a poisson binomial), but rather I want the distribution of $Q$ to take into account how well my model fits the data.

Any pointers or ideas would be very appreciated!

EDIT

Thanks @Bill for the link! My understanding of that standardization approach is that their confidence interval takes into account the variance of the model (i.e. the uncertainty in the logistic regression coefficients), but not the performance of the model. They suggest separately testing the goodness of fit of the model. I am hoping to find something that takes directly into account the performance of the model in the confidence interval. Does that make sense?

EDIT 2

By “performance”, I mean how well my model fits the data (perhaps something involving the AUC or the $R^2$).

Here’s the sort of thing I’m imagining.

Let $Y=y(x)$ be my outcome and $V=m(x)$ be the logistic regression model’s predicted probability for a data point $x$ drawn at random. Suppose I draw $k$ points, so I have $Y_1, … Y_k$ and $V_1, … V_k$. Define:

$$ M = sum_{i=0}^k V_i = text{sum of model estimates} $$
$$ Z = sum_{i=0}^k Y_i = text{sum of actual outcomes} $$
$$ Q = M – Z $$

I want to ideally know the distribution of $Q$, or at least know its first two moments. Trying to work those out:

$$ E[Q] = E[M] – E[Z] $$
$$ E[M] = k E[V] = kbar{v} $$
$$ E[Z] = kbar{y} $$

where $bar{v}$ and $bar{y}$ come from my data and are the average model prediction and the average outcome, respectively. (I would expect $E[Q]$ to usually be close to $0$.) For the variance:

$$ Var[Q] = E[Q^2] – E[Q]^2 $$
$$ E[Q^2] = E[(M-Z)^2] = E[M^2] – 2E[MZ] – E[Z^2] $$
$$ E[M^2] = E[(V_1 + ... + V_k)^2] = k E[V^2] = k bar{v^2} $$
$$ E[Z^2] = E[(Y_1 + ... + Y_k)^2] = k E[Y^2] = k E[Y] = k bar{y} $$
$$ E[MZ] = E[(V_1 + ... V_k)(Y_1 + ... + Y_k)] = k E[VY] = k (Pr[Y=1] E[VY vert Y=1] + Pr[Y=0] E[VY vert Y=0] = kPr[Y=1]E[V vert Y=1] = kbar{y} E[V vert Y=1] = 2kbar{y}bar{v}_{Y=1}$$

where in all of these, I’ve assumed independence between each of my $k$ draws. The $bar{v}_{Y=1} = E[V vert Y=1]$ is related to the accuracy of my model, and is the average model prediction across all the positive cases. Putting this together, if the algebra is right:

$$ E[Q] = kbar{v} – kbar{y} $$
$$ Var[Q] = k bar{v^2} – 2kbar{y}bar{v}_{Y=1} – k bar{y} – (kbar{v} – kbar{y})^2 $$

I’m interested if this makes sense, and if there’s a standard technique that does this (or if perhaps the technique @Bill mentioned is already doing this, but I’m just not understanding).

bayesglm (arm) versus MCMCpack


Both bayesglm() (in the arm R package) and various functions in the MCMCpack package are aimed at doing Bayesian estimation of generalized linear models, but I’m not sure they’re actually computing the same thing. The MCMCpack functions use Markov chain Monte Carlo to obtain a (dependent) sample from the joint posterior for the model parameters. bayesglm(), on the other hand, produces… I’m not sure what.

It looks like bayesglm() produces a point estimate, which would make it MAP (maximum a posteriori) estimation rather than a full Bayesian estimation, but there’s a sim() function that looks like it can be used to get posterior draws.

Can someone explain the difference in intended use for the two? Can bayesglm() + sim() produce true posterior draws, or is it some sort of approximation?

Bootstrap glm and extract pvalue


I am running a glm model using bootstrap, I can extract the coefficient mean and the confidence intervals for all the factors in my model.

But how can I get the pvalue from there?

Model:

rate.lm <- glm(one ~ two + factor(three), data = data)

lmfit <- function(data, indices) {
  fit <- glm(one ~ two + factor(three), data = data[indices, ])
  return(coef(fit))
}

results <- boot(data= data, statistic = lmfit, R = 1000, strata=data$two)

for (i in 2:length(rate.lm$coefficients)) {
      bci <- boot.ci(results, type = "basic", index = i)
      print(sprintf("%s,%.4f,%.4f,%.4f", names(rate.lm$coefficients)[i], results$t0[i],
                    bci$basic[4], bci$basic[5]))
}

Which gives me back the coeff mean and the 95% confidence interval. In that case I know the variable two is significant but I need a pvalue absolutely for the rest of my analysis and I am unsure on how to get it. Thanks for your help!

[1] "two,0.0101,-0.0022,0.0229"
[1] "factor(three)2,-0.2029,-0.2185,-0.1888"
[1] "factor(three)3,-0.1752,-0.1914,-0.1595"
[1] "factor(three)4,-0.2093,-0.2259,-0.1947"
[1] "factor(three)5,-0.1745,-0.1910,-0.1588"
[1] "factor(three)6,-0.1988,-0.2139,-0.1849"

post hoc nested


Hy, need to do post hoc test for nested factor in SAS.
I have two factors, A and B with B nested within A and I want to perform post hoc test in GLM in SAS. Is it possible?
Stevan

simulate GLM with square root link in R


I’m trying to simulate a fitted GLM using basic functions, not using the simulate() and predict() functions that are widely questioned and answered. I get different results when I compare my math function to the simulate() and predict() function. Most probably I’m doing something wrong but I cannot seem to find the error.

First I generate random correlated data with a skewed dependent variable:

library(MASS)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)

Next I fit a GLM with a square root link to adjust for the skewed data:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))

Next I generate predicted values first on the linear scale and then on the inverse scale inclusing stochastic uncertainty (eventually I want to use other than the source data as input values):

p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

I compare the results to the simulate() and predict() function:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))
points(predict(m, type="response"), d[,1], col=rgb(1,0,0,alpha=0.1))
abline(a=0, b=1, col="red")

What is going wrong in my formula to predict values?
Where can I find what mathematical and R expressions are being used in the predict() and simulate() functions?
Is there a link explaining simulation of GLM including stochastic uncertainty (and eventually my next step is also parameter uncertainty) in various family/link combinations applied in R. I found one nice source on GLM simulation though not answering my specific questions: http://www.sagepub.com/upm-data/57233_Chapter_6.pdf

Including an interaction term in GLM with full factorial experiment data


I collected data from an 2 X 2 X 2 full factorial experiment. Because a DV is a count variable and unequal mean and variance, I ran a Negative Bimonial Model.
It was hard for me to come up a good rational for the three-way interaction effect, but still I want to explore it by including the three-way interaction term in the model.
Does this inclusion reduce statistical power on other effects or does not because of the orthogonality of the full-factorial design?

Repeated Measures with a covariate


I have two continuous variables; a pretest and a post test. They seem to have significance in paired samples; p=.000. However I wonder what caused the post test to be better than the pre test and I suspect that another continuous variable predicted the increase in post test. Here is my research design:
Learners took a pretest on 15 vocabulary. Then they read a paragraph which included these words. While they read, I recorded their eye movements and fixation duration. Then they have taken a post test. Post test results are better. And I observed that learners paid more attention to unknown words then they did for known words while reading. So after reading and paying attention to unknown words, they became familiar with the words and did better in post test. “Fixation on unknown words” is my predictor which predicts the significance between pre and post test. All of them are continuous. But which test should I use? Generalized linear model, multivariate, repeated measures?

Problem fitting a geeglm regression


I am fitting a model using geeglm in geepack and ran into a problem.

I have a dataset pertaining to oil consumption and fit the below model.

geeglm(formula = Consumption~Income + Price + Observation, 
       id=Id, 
       corstr="ar1")

but it’s output looks like this

Coefficients:
  (Intercept)  Income 1,014.99   Income 1,047.31   Income 1,064.58  
      1316992           -453526          -2734582           1626503 
  Income 1,138.72   Income 1,153.81   Income 1,158.30   Income 1,160.69  
     -2879197           -639736            212062           -409109 
  Income 1,182.30   Income 1,249.24   Income 1,294.34   Income 1,300.16  
      1314828          -3177208             -4425           1214973 
  Income 1,316.89   Income 1,336.83   Income 1,339.14   Income 1,407.93  
      -821846          -3295240           -641897           2029621 
  Income 1,430.15   Income 1,432.48   Income 1,433.70   Income 1,441.98  
      -273614           -858012          -3472286           -851193 
  Income 9,187.94   Income 9,224.06   Income 9,313.42   Income 9,720.17  
       621587          -2171161             15996          -4233896 
  Income 964.01              Price       Observation 
     -2469417           -439571            203036 

Why is it giving me a list of coefficients on every income when I set id=Id?

How to account for overdispersion in a glm with negative binomial distribution?


I’m analysing count data with a generalised linear model in R. I started with a Poisson family distribution, but then realized that data was clearly overdispersed. I then took the option of applying a glm with negative binomial distribution (I’m using the function glm.nb() from MASS package). Interestingly, I get the same best-selected model with a forward and a backward stepwise selection approach, which is:

m.step2 <- glm.nb(round(N.FLOWERS) ~ Hs_obs+RELATEDNESS+CLONALITY+PRODUCTION, data = flower[c(-12, -17), ])

Then to test for fixed effects I use the anova() function, which gives:

anova(m.step2, test = "Chi")
Analysis of Deviance Table
Model: Negative Binomial(1.143), link: log
Response: round(N.FLOWERS)
Terms added sequentially (first to last)
              Df Deviance Resid. Df  Resid. Dev   Pr(>F)   

 NULL                           15     40.674                   
 Hs_obs       1   9.5978        14     31.076    0.001948 **
 RELATEDNESS  1   9.4956        13     21.581    0.002060 **
 CLONALITY    1   3.0411        12     18.540    0.081181 . 
 PRODUCTION   1   3.7857        11     14.754    0.051693 .
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 Warning messages: 
 1: In anova.negbin(m.step2, test = "F") : tests made without re-estimating 'theta'

However, if there were overdispersion (even with the negative binomial) these p-values should be corrected, shouldn’t they? In my case, the residual deviance (obtained from the summary(m.step2)) is 14.754 and residual degrees of freedom 11. Thus, overdispersion is 14.754/11 = 1.34.

How do I correct the p-values to account for the small amount of overdispersion detected in this negative binomial model?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam