Possible to evaluate GLM in Python/scikit-learn using the Poisson, Gamma, or Tweedie distributions as the family for the error distribution?


Trying to learn some Python and SKLearn, but for my work I need to run regressions that use error distributions from the Poisson, Gamma, and especially Tweedie families.

I don’t see anything in the documentation about them, but they are in several parts of the R distribution, so I was wondering if anyone has seen implementations anywhere for Python. It would be extra cool if you can point me towards SGD implementations of the Tweedie distribution!

Thanks.

How to report log linear models of contigency tables


I am using log linear models (loglm function, library MASS of R) to evaluate if 3 variables in a 3 way contingency table are independent.

I build the model of mutual independence

loglm(formula = ~A + B + C, data = test.t)

Which gives me

Statistics:
                      X^2 df P(> X^2)
Likelihood Ratio 264.7872 50        0
Pearson          292.6937 50        0

From what I understand this is the LR test compares my model to the saturated model and observes that there is unexplained variance in my model and significant interactions need to be incorporated, which means I can reject my hypothesis that the 3 variables are independent.

How exactly should I report them this analysis in my report? Do I need to state the LR test values, degrees of freedom and P?

Is this the Pearson Chi square test in the second line? I was under the impression that the Pearson chi square test is only for 2×2 tables (the chisq.test() throws an error in bigger tables). Or is it the Pearson chi squared for the 2 models (my model vs the saturated model)?

Post-hoc after GLM


I am running a GLM, using the function glm.nb (pscl package) trying to figure how what could influence a particular trait in several locations and years. The output as follow (with slight modification and removing the things beyond this question)

          Estimate Std. Error z value Pr(>|z|)    
          state2        -0.163190   0.807307  -0.202  0.00081    
          state3        -0.530588   0.783023  -0.678  0.00302 
          state4         0.942953   0.737328   1.279  0.00094    
          year2         -1.759102   0.733214  -2.399  0.01643 
          year3         -0.633870   0.662903  -0.956  0.33897    

I understand the output as such: compared to year1, year 2 is statistically different, but year3 is not different from year1. but what about year2 and year3?

For the state factor, state2, state3 and state4 are all statistically different than state1. so I need a post-hoc test between state2 state3 and state4, am I right?

thanks

Question about Dummy Variable in Cross Level Interaction – GENLINEMIXED (SPSS)


Hi I am running linear GENLINEMIXED in SPSS 22. What does it mean when one of the dummy variables you are using in a model is significant when one group is coded as a reference group, and when you swap the reference coding to the other group, it is no longer significant. Could it be that one group has a linear relationship to the DV, and the other non-linear? The dummy variable is a level 2 predictor for a level 1 DV.

In my case, I have been working with some other co-authors on a paper that is using GENLINMIXED to run a linear multi-level model nested by 54 countries in SPSS on three different dependent variables. These DV variables are ipsative (they add up to 100%). The items ask entrepreneurs and business owners to allocate 100 points across the economic, social, and environmental values of their respective venture to try and capture some quantification of what we call in business research the “triple bottom line.” I originally used MANOVA to test the relationships on the DVs simultaneously for this study. However, a reviewer suggested I do a multilevel model for the analysis, and another reviewer I have another reviewer suggesting a Multi-level Seemingly Unrelated Regression. Any suggestions on which route would be best? The following exerpt from our working paper has utilized the multi-level model approach (e.g., GENLINMIXED in SPSS).

My goal is to identify a cross-level interaction between gender (coded as binary 0=men 1=women) and culture as operationalized by the World Values Survey (coded binary 0= post-materialist culture and 1=materialist culture). I want to identify if culture attenuates or increases the value goals between the sexes.

A weird thing happening with my data when I pursued the GENLINMIXED linear multi-level model is that if I make the reference group for my two binary independent variables men and post-materialist, the cross level interaction for women and materialism is significant across all three DVs (economic, social, and environmental). Also, the main effect of gender is significant, and culture not significant. However, if I make the reference group for the binary independent variables women and materialism, the cross level interaction term men and post-materialisim is not significant. Again gender is significant and culture not significant.

Again, could it be that the cross level interaction is linearly related to the DVs for women in materialist countries, and not linearly related to the DVs for men in post-materialist countries? I have just the methods write-up of that paper attached two with the three hypothesis if you would like further detail into the situation. Any advice or feedback would be most appreciated.

Statistical test to use?


Which statistical test should I use for prediction if there are 5 qualitative independent variables and 1 quantitative independent variable?

Why does poisson regression need to assume observations are poisson distributed?


Zuur (2013) ‘Beginners Guide to GLM and GLMM’ states that if the Pearson residuals, when plotted against fitted values from a poisson regression, show the pattern below then the assumption of poisson distributed observations is violated. Zuur urges to try a negative binomial regression if the residuals show a pattern like below. If a poisson regression is used, then Zuur states the standard errors will be wrong. But why does the model need to make assumptions about the distribution of observations around the fitted values? Why can’t it use information contained in the distribution of the observed residuals to calculate the standard errors? My best guess is that for every fitted values, there might only be a few observed values, and there is not enough information contained in these few observed values to calculate the standard errors.

enter image description here

Logistic regression: Strange standard errors from glm() in R


To my surprise I found that standard errors and thus Wald confidence intervals became smaller when I removed the intercept from a simple logistic regression model, using glm() and R.

# load an object named "my.df" to the global enviroment
load(url("http://hansekbrand.se/code/test.RData"))
# fit a model with intercept to data
my.fit <- glm(deprived.of.education ~ religion, data = my.df, family = binomial("logit"))
# fit a model without any intercept to data
my.fit.without.intercept <- glm(deprived.of.education ~ 0 + religion, data = my.df, family = binomial("logit"))

# inspect the first fit
summary(my.fit)$coefficients
#                        Estimate Std. Error   z value     Pr(>|z|)
# (Intercept)          -2.8718056 0.03175130 -90.44687 0.000000e+00
# religionChristianity  0.4934891 0.03234887  15.25522 1.519805e-52
# religionHinduism      0.5257316 0.03376535  15.57015 1.161317e-54
# religionIslam         1.5734832 0.03231692  48.68914 0.000000e+00
# religionNonreligious  1.5975456 0.03555164  44.93592 0.000000e+00

# inspect the second fit
summary(my.fit.without.intercept)$coefficients
#                       Estimate  Std. Error    z value Pr(>|z|)
# religionBuddhism     -2.871806 0.031751299  -90.44687        0
# religionChristianity -2.378317 0.006189045 -384.27842        0
# religionHinduism     -2.346074 0.011487113 -204.23530        0
# religionIslam        -1.298322 0.006019850 -215.67354        0
# religionNonreligious -1.274260 0.015992939  -79.67642        0

I understand why the z values are different, because the null hypotheses in the two cases are different. In the first case, with the intercept, the null is “same as the reference category”, while without the intercept, the null becomes “zero”.
But I do not understand the large difference in standard errors between the two models.

Without the intercept, the standard errors seem to vary with n of each level, i.e. there are many cases of “Christianity” and “Islam”, and they have small standard errors, but with the intercept, there is essentially no variation in the standard errors.

Could someone please explain the reason for the differences in the magnitude of the standard errors between the two models?

I would like to calculate probabilities and confidence intervals around them, and I have done so using the estimates from the first model. If I would do that with the estimates from the second model, the confidence intervals would be much smaller, but would they be reliable?

Parameter estimation with generalized linear models


By default when we use a glm function in R, it uses the iteratively reweighted least squares (IWLS) method to find the maximum likelihood estimation the parameters. Now I have two questions.

  1. Does IWLS estimations guarantee the global maximum of the likelihood function? Based on the last slide in this presentation, I think it does not! I just wanted to make sure of that.
  2. Can we say that the reason for question 1 above is because of the fact that almost all the numerical optimization methods may stuck at a local maximum rather than a global maximum?

What is a relevant level for percentage deviance explained in a GLM?


I am trying to sort out the unique effect of various environmental predictors on species occurrence (presence/absence data). I have been running glm models in R with family=binomial. Most of my variables have very highly significant P values, but I have a large data set (~8000 data points), so this is maybe not so surprising. I have also been calculating the percentage deviance explained using the BiodiversityR package. Some of my very significant variables based on the P value have very low explained deviance (1% or so).

Is there a rule of thumb for saying a predictor is not significant based on the deviance explained? Any other tests I should be including? I wanted to look at each variable on its own, and then test out specific interactions, but only if the variable was “important” in a model where it was the only predictor.

unique spline for different groups in a linear model (and no spline at all for one group)


I have a problem that has puzzled me for a long time. It involves linear models and spline functions. I need to model “time since diagnosis” when some individuals never had a diagnosis. I use Poisson model where I have split time on age and time since diagnosis. The data set may look like this (with millions upon millions of rows,and with 95 % having no diagnosis ) :

SUBJECT AGE DIAGNOSIS TIME_SINCE_DIAGNOSIS EVENT/OUTCOME

1 30 A 0 0

1 31 A 1 0

1 32 A 2 1

2 45 B 0 0

2 46 B 1 0

103 22 NON “0 or -” 0

103 23 NON “0 or -” 0

.
.
.

You get the idea .. Now I want to have a model that looks something like this:

log ( OUTCOME ) = Spline( AGE ) + DIAGNOSIS * Spline( TIME_SINCE_DIAGNOSIS ) + offset ( logrisktime )

and then plotting the predicted risk for the different diagnoses of A and B as a function of ” time since diagnosis ” given that the diagnosis was at age = 30, and in the same plot for diagnosis = NON but then instead as a function of age ( with year since 30 on x-axis). I like to have both those with a diagnosis and those with diagnosis=NON in the same model to be able to do contrasts/linear combinations of effects.

Model and plot is done, but my problem is that the model above also creates a spline over the ” time since diagnosis ” for those who have no diagnosis (i.e. DIAGNOSIS = NON ). So, I would like to have a model of this type which I created indicator variables for the different diagnoses :

R: factor(A) factor(B) fator(NON)

SAS: class A B NON;

log ( OUTCOME ) = Spline ( AGE ) + NON + A*Spline( TIME_SINCE_DIAGNOSIS ) + B*Spline ( TIME_SINCE_DIAGNOSIS ) + offset ( logrisktime )

Are you with me ?

So I want to create a unique spline of the variable “time since diagnosis” for each level of the variable “diagnosis” (which is done with diagnosis*spline(time_since_diagnosis) ), but with the exception that one level (those who had no diagnosis) does not to have any spline assigned.

Idea ?

Thanks,
Peter

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam