Methods to analyze a cross-over interaction between a factor and a continuous variable?

I’m wondering what would be the best method to analyze a “cross-over” interaction between a factor and a continuous variable.

Here’s my experimental set-up and hypotheses in a nutshell:
58 participants were randomly assigned to one of two conditions (each n=29) with different manipulations. It was hypothesized that a continuous individual difference variable would make participants more susceptible to the manipulations. The different manipulations were intended to have the opposite effects on the DV. Thus, it was hypothesized that in Condition 1 the DV and the IDV would have a negative correlation, and in Condition 2 they would have a positive correlation.

Indeed, correlation analyses (between the DV and the IDV) revealed that r = .-37, p < .05 in Condition 1, and r = .26, p = .18 in Condition 2.

I’ve tried the General and Generalized Linear Models on SPSS to properly test my hypothesis. In the models, I included main effects of Condition and the IDV (which I entered as a covariate), and an interaction effect, but the results don’t appear to be very robust (i.e., even the smallest changes in the models seem to have a big effect on the results). I’ve used dummy coding (0s and 1s) for Condition, and the DV and the IDV appear to be normally distributed.

Does this seem like an appropriate way of testing my hypothesis, or can someone suggest a better approach?

fitting an exponential decay onto a regression line

I have data for adherence to medicines which follows a downward linear trend for about 6 months (from 100%) and then plateaus at about 50%. Another way of describing it is by saying that adherence reaches steady state (=50%) at about 6 months after starting a medicine (

I have been exploring the effect of an intervention on adherence to medicines and have been using a longitudinal study design and a linear model to do this. However, I would like to try fit an exponential decay into the model to see if I can improve the fit. I’m using geeglm in R.

My original model is straightforward

y= Bo + B1(time) +B2(trt group) + B3(post) + B4 (time_post) +b5(post*trt_group) + B6 (time_post*trt_group)

where post= the intervention occurring at a defined point in time (change in intercept)
time_post = change in slope after the intervention.

A function that may work to fit the decay is


where A = value at plateau (estimate at 50%)
B = difference between plateau value and original intercept value (intercept should typicall be 100% – so B=50%)
-k = elimination rate constant
t = time

I’m not sure how I would fit this into my model. If anyone has any suggestions for how I could work this out, or could point me to some web-help, that would be really helpful.

Thank you.

Plot of raw data and predicted values from linear model

How to promote a regression tree over a GLM?

Does anybody have any suggestions about promoting the use of a regression tree over a GLM when the two models fit the data almost exactly the same?

My team’s current arguments are a) a tree is easier for non-technical people to understand and b) when the two models’ predictions are implemented -simplified as needed- as tables (necessary for use in actuarial software) the GLM results will be distorted more than the tree’s predictions.

Any other ideas?

Background: Office politics. We want to use one of our own models instead of having our parent company’s GLM forced on us without any comparison of the models, so to get our work a chance to be fairly evaluated we need to be promote our tree’s advantages over a GLM.

How to represent bayesian loss function in binary classification

I am studying classification using linear regression . Now, I want to map it in Bayesian regression. Let talk about binary classification using linear regression again. Assume that I have a set $X=${$x_1,x_2…x_n$} and binary lable $y$={$0,1$}. Binary classification using linear regression task can embedded in to minimum loss function
$$L=sum(h(x)-y)^2$$ where $h(x)=aX+b$ is linear regression line. It is very clear fomular and discussed in [lecture note][1]. Now I want to map it in Bayesian rule. Bayesian rule can express:
We have $$p(y=0;1|X,a,b)=p(X|y,a,b).P(X)$$
Hence, the loss function with Bayesian classification case are given
Is the Bayesian’s loss fomular correct? Thank you so much

Generalized linear model with lasso regularization for continuous non-negative response

I have a big data problem with a large number of predictors and a non-negative response (time until inspection).
For a full model I would use a glm with Gamma distributed response (link=”log”).

However I would like to find a small model. The “best subset glm” approach does not work for me as I run out of memory – it seems that it is not efficient enough for my setting (big data, weak computer).

So I switched to the LASSO approach (using R packages lars or glmnet).
glmnet even offers some distribution families besides the Gaussian but not the Gamma family. How can I do a lasso regularization for a glm with Gamma distributed response in R? Could it be a Cox-model (Cox net) for modelling some kind of waiting time?

EDIT: As my data consists of all data points with the information about the time since the last inspection it really seems appropriate to apply a COX model. Putting data in the right format (as Surv does) and calling glmnet with family="cox" could do the job in my case of “waiting times” or survival analysis. In my data all data points “died” and the Cox model allows to analyse which ones “died” sooner. It seems as if in this case family="gamma" is not needed. Comments are very welcome.

Poisson regression on the means of count data

I just finished a small research project about hummingbirds and the effect of hummingbird feeders. I am a bit unsure about how to proceed with the statistics.

We placed 15 points in a distance gradient away from the feeders, where we sampled visitation rates, pollination and bird / floral abundance (all count data). We sampled all points twice, what I would like to do is to use Poisson regression on the mean of the two samples. Does that makes statistical sense? First, I would guess that using the means would push the data distribution towards normality. Second, can you use Poisson regression on non whole numbers?

In linear regression the prediction error range is increasing while the the mean of the error is decreasing

I conducted a linear regression on a large and highly skewed data set that contain 80 variables,about 1.0 Million of users that didn’t spent money and about 15k of users that spend different amount of money.My aim was to build a regression model that will allow me to predict the amount of money that a user will spend over 180 days.I built several models ( for the first registration day, after 7 days from registration and after 14 days of registration) In all of the model I transformed the data to log scale. I got results that I find hard to explain: While the r-squered is improving over the the three models the range of the errors (after I made a prediction over the data with anti-log) is getting higher.I don’t understand how can this are the final results:

Day0 R^2 = 92.4%, average mean of errors: 2.22, Range of errors: 26,570
Day7 R^2 = 94% , average mean of errors: 1.65, Range of errors: 28,873
Day14 R^2 = 95% , average mean of errors: 1.19, Range of errors: 45,400 

I would expect that the range of errors ( actual values – prediction ) will decrease after an improvement of r^2 and other parameters such ass mean of the errors.
Any Idea is welcome

Edit: the range of errors was computed as follows : abs(min(actual-prediction)) + max (actual-prediction))
the everage of errors was computed as follows: avg(actual - predicted)

Residuals from model missing interaction

In a plot of residuals against fitted values from a generalised linear model, I’m wondering what the plot would look like if an interaction was missing from the model. Can anyone simulate a model that is missing an interaction. Without the interaction included in the model, the residuals against fitted values plot shows clear patterns. With the interaction, the residuals against fitted values does not show any patterns.

Possible to evaluate GLM in Python/scikit-learn using the Poisson, Gamma, or Tweedie distributions as the family for the error distribution?

Trying to learn some Python and SKLearn, but for my work I need to run regressions that use error distributions from the Poisson, Gamma, and especially Tweedie families.

I don’t see anything in the documentation about them, but they are in several parts of the R distribution, so I was wondering if anyone has seen implementations anywhere for Python. It would be extra cool if you can point me towards SGD implementations of the Tweedie distribution!


How to report log linear models of contigency tables

I am using log linear models (loglm function, library MASS of R) to evaluate if 3 variables in a 3 way contingency table are independent.

I build the model of mutual independence

loglm(formula = ~A + B + C, data = test.t)

Which gives me

                      X^2 df P(> X^2)
Likelihood Ratio 264.7872 50        0
Pearson          292.6937 50        0

From what I understand this is the LR test compares my model to the saturated model and observes that there is unexplained variance in my model and significant interactions need to be incorporated, which means I can reject my hypothesis that the 3 variables are independent.

How exactly should I report them this analysis in my report? Do I need to state the LR test values, degrees of freedom and P?

Is this the Pearson Chi square test in the second line? I was under the impression that the Pearson chi square test is only for 2×2 tables (the chisq.test() throws an error in bigger tables). Or is it the Pearson chi squared for the 2 models (my model vs the saturated model)?

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam