I have a set of units, where unit $i$ has:

- a binary outcome $y_i$ (“case” or “non-case”)
- a binary vector of treatments $mathbf{t_i}$, where $t_{i,j}=1$ if unit $i$ received treatment $j$.
- several hundred confounding covariates $mathbf{x_i}$, some real-valued and some binary.

I expect that most treatments have no association with the outcome. I am trying to spot the outlier treatments that do have an effect, and estimate how many cases that effect was responsible for. For each treatment, I’d like to say something like:

Supposing there were no association between this treatment and the outcome, here is the number of cases (and a confidence interval) we would have expected to see for this treatment, predicted based on the covariates for the set of units that received that treatment.

**Is there a standard way for doing this?**

I fit a logistic regression model $m(mathbf{x})$ that gives $E[Y]$ for a given $mathbf{x_i}$. That is, it predicts the outcome given the covariates.

Then I can look at:

$$c_j=text{actual number cases with treatment $j$} = sum_i{t_{i,j} y_i}$$

$$h_j=text{predicted number of cases with treatment $j$} = sum_i{t_{i,j} m(mathbf{x_i})}$$

and compare the two.

These quantities are easily calculable, but I’m unsure how to test significance of the difference (call it $q_k$, where $k$ is the number of units with the treatment), since I don’t know what distribution $Q_k$ is drawn from under the null hypothesis.

I want $Q_k$ to be the distribution one would get by doing the following: sample $k$ many $(mathbf{x_i}, y_i)$ pairs according to the process that generated my data (or uniformly from the data itself), and compute the difference between the predicted and actual number of cases. Intuitively, this distribution should get tighter as $k$ increases and **also as the model gets more accurate**. That is, I don’t want to just assume that my model is accurate (and model $Q_k$ as something like a poisson binomial), but rather I want the distribution of $Q$ to take into account how well my model fits the data.

Any pointers or ideas would be very appreciated!

**EDIT**

Thanks @Bill for the link! My understanding of that standardization approach is that their confidence interval takes into account the variance of the model (i.e. the uncertainty in the logistic regression coefficients), but not the performance of the model. They suggest separately testing the goodness of fit of the model. I am hoping to find something that takes directly into account the performance of the model in the confidence interval. Does that make sense?

**EDIT 2**

By “performance”, I mean how well my model fits the data (perhaps something involving the AUC or the $R^2$).

Here’s the sort of thing I’m imagining.

Let $Y=y(x)$ be my outcome and $V=m(x)$ be the logistic regression model’s predicted probability for a data point $x$ drawn at random. Suppose I draw $k$ points, so I have $Y_1, … Y_k$ and $V_1, … V_k$. Define:

$$ M = sum_{i=0}^k V_i = text{sum of model estimates} $$

$$ Z = sum_{i=0}^k Y_i = text{sum of actual outcomes} $$

$$ Q = M – Z $$

I want to ideally know the distribution of $Q$, or at least know its first two moments. Trying to work those out:

$$ E[Q] = E[M] – E[Z] $$

$$ E[M] = k E[V] = kbar{v} $$

$$ E[Z] = kbar{y} $$

where $bar{v}$ and $bar{y}$ come from my data and are the average model prediction and the average outcome, respectively. (I would expect $E[Q]$ to usually be close to $0$.) For the variance:

$$ Var[Q] = E[Q^2] – E[Q]^2 $$

$$ E[Q^2] = E[(M-Z)^2] = E[M^2] – 2E[MZ] – E[Z^2] $$

$$ E[M^2] = E[(V_1 + ... + V_k)^2] = k E[V^2] = k bar{v^2} $$

$$ E[Z^2] = E[(Y_1 + ... + Y_k)^2] = k E[Y^2] = k E[Y] = k bar{y} $$

$$ E[MZ] = E[(V_1 + ... V_k)(Y_1 + ... + Y_k)] = k E[VY] = k (Pr[Y=1] E[VY vert Y=1] + Pr[Y=0] E[VY vert Y=0] = kPr[Y=1]E[V vert Y=1] = kbar{y} E[V vert Y=1] = 2kbar{y}bar{v}_{Y=1}$$

where in all of these, I’ve assumed independence between each of my $k$ draws. The $bar{v}_{Y=1} = E[V vert Y=1]$ is related to the accuracy of my model, and is the average model prediction across all the positive cases. Putting this together, if the algebra is right:

$$ E[Q] = kbar{v} – kbar{y} $$

$$ Var[Q] = k bar{v^2} – 2kbar{y}bar{v}_{Y=1} – k bar{y} – (kbar{v} – kbar{y})^2 $$

I’m interested if this makes sense, and if there’s a standard technique that does this (or if perhaps the technique @Bill mentioned is already doing this, but I’m just not understanding).