How to structure data for SPSS (percentages for log-linear analysis)?


I am a bit desperate because I am writing my Msc thesis and I am not sure how to organize my data for an SPSS usage.

My research examines the performance of mobile banner campaigns and how the 3 independent variables: Source(in-app or mobile web promotion), Style (animated or static banner) and Size (full screen or small banner) will affect the dependent variables: Click-Through Rate((total clicks/total impressions)*100) and Conversion Rate((total installs/total clicks)*100).

Initially I was considering ANOVA tests like the ones below:

Click-Through Rate = β0 + β1*Source + β2*Style + β3*Size + β4*(Source*Style) +
                     β5*(Source*Size) + β6*(Style*Size) + β7*(Source*Style*Size)  + error 

Conversion Rate = β0 + β1*Source + β2*Style + β3*Size + β4*(Source*Style) + 
                  β5*(Source*Size) + β6*(Style*Size) + β7*(Source*Style*Size) + error

But my supervisor advises that I can not use percentages withing ANOVA and I should use loglinear analyses.

Can you please advice me if I can somehow still use ANOVA for my tests or if I need to use loglinear analyses and how to structure my data for SPSS use. My supervisor is mentioning I need to organize it into counts, but doesn’t give me more details and I am a bit confused: what does that mean?

Standard errors for the CV error curve using the boot package


Does anybody know how to obtain the standard errors for the CV error curve using the boot package? I understand the boot package can compute the K-fold CV for a fitted model, but I’d like to know if the same package give the corresponding standard error.

Thanks!

code for ordered probit model


I have a data set with 7 predictor variables and one dependent variable. The dependent variable has 4 categories so it’s not binomial. I need to fit a probit model. I need codes for probit model in R, SPSS or Matlab.

linear regression intercept does not match


I have done a linear regression in R, using glm function. The calculated intercept says 0.98, but when I plot it, it does not seem to hit the estimated intercept on Y axis. Its far below. Here are my data and function:

event = c(2.2, 6.4, 3.4, 10.2, 4.45, 2.65, 8.25, 4.65, 3, 6.5, 5.25, 
8.65, 7.25, 6.4, 7.75, 7.45)

c(230208, 813178, 316617, 1531919, 576869, 270148, 1090947, 562643, 
439885, 745741, 666454, 1078175, 924429, 784333, 1091289, 948062)

fit=glm(event~size)

Call:  glm(formula = chr.co.count.wt ~ size)

Coefficients:
(Intercept)         size  
  9.783e-01    6.528e-06  

Degrees of Freedom: 15 Total (i.e. Null);  14 Residual
Null Deviance:      83.08 
Residual Deviance: 2.849    AIC: 23.8




plot(size,events,col="blue",pch=16,xlab="size",ylab="events",ylim=c(0,12),frame.plot=FALSE,xlim=c(0,2000000),axes = F)

axis(side = 1,at = c(0,0.5e6,1e6,1.5e6),labels =  c(0,0.5e6,1e6,1.5e6))
axis(side = 2,at = seq(from = 0,to = 12,by = 0.5),labels = seq(from = 0,to = 12,by = 0.5))
abline(fit.wt)

enter image description here

Why is this discrepancy ? Am i missing something here ? I have also checked the std. err which is 0.27, still higher than what is being observed on plot.

Thank you.

How to choose data for training a predictive model for attrition prediction


Trying to build a predictive model for attrition prediction at service desk/call center.

Have daily data on the following parameters:

1.Call quality – QTM (0-100%),
2.No. of calls – Calls(Number)
3.Attendance
4.Customer feedback(1/0) Q1,Q2 (0-100%)
for both, agents who left the job and for the ones who are still there, for a duration of 6 months.

Aim: to predict agents tendency/probability of staying/leaving based on his/her daily performance.

Doubts i have,
1. how should i use the data to train the model(logistic regression)

should it be trained based on the avg of the parameters taken over a duration of 6 months.

**if so can we test the daily metrics based on a model which is trained using mean of the parameters for 6 months.

Please advise.

this is my first attempt at making a predictive model,i have gone thru various case studies/models such as the titanic survival model using logistic regression,Wisconsin DEWS model.

I decided to model using the weekly aggregates of the the two populations(attrites and Non-attrites).

The Data Set (approx 5 months data,with weekly aggregates of the two populations i.e Attrites and Non-Attrites.)
AW1 : Week1 Aggregates of the performance metrics for Attrites
NAW1: Week1 Aggregates of the performance metrics for Non-Attrites

Data-set

Post this i ran a logistic Regression on 80% of this data-set and kept aside the other 20% for testing.
Results of the logistic regression: results of the logistic regression model.

and then i used the predict function on the 20% of the data which contained 3 data points for both attrites and Non-attrites,so to be 100% accurate the model should have predicted 3 as attrites and 3 as Non-attrites but the correct prediction is 5/6 that is one wrong prediction out of 6.

Please help me in interpreting the meaning of the results of the model all the z values are zero im not sure what that signifies.

Googled a little regarding the z values = 0 issue and came across some posts on stackoveflow that suggested using “bayesglm” instead of “glm”
did that and the results are good at the first look but being a newbie in the field i would like you to guide me with respect to the statistical significance of the issue and is the model really as good as the results of the “bayesglm” or is it just by fluke.

results with bayesglm

the model gives a 100% accurate prediction on the test set now 6/6.

What GLM family and link function for “proportion of time”?


A simple question to which I don’t seem to find the answer anywhere.

I have a response variable duration of time spent doing A of individuals tested for $text{max duration}=X$. Therefore the variance is likely to be much wider at $x=0.5$ compared to $X=0$ and $X=1$.

However, I am confused what family and link function I need as time is continuous but can not be less than $0$ and more than $X$.

ps: I am using R so any code examples would be appreciated

Update
I have heard about the tobit() function from “AER” which “is a convenience interface to ‘survreg’ setting different defaults and providing a more convenient interface for specification of the censoring information.

Therefore, it seems for my data I could run the model

tobit(duration.A ~ factor1*factor2, left=0, right=X)

However, it is unclear how A) I could do this for a mixed model i.e. a model with random factors, B) what it’s assumptions are (they are not clear from the R documentation http://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf), and C) how I could get residuals and fitted values for plotting.

What exactly does a Type III test do?


I’m having trouble understanding what exactly Type III test statistic does. Here is what I got from my book:

“Type III” tests test for the significance of each explanatory variable, under the assumption that all other variables entered in the model equation are present.

My questions are :

  1. What exactly does “other variable entered in the model equation are present” mean? Let’s say I have a Type III test statistics for variable $x_i$, does Type III test tells us whether the coefficient in front of $x_i$ equal to zero or not?

  2. If so, then what’s the difference between Type III test statistics and a Wald test? (I believe they are essentially two different things since SAS gives me two different numerical outputs) Currently I have both output for my independent variables (which are all dummy variables). I don’t know which p-value to look at to decided which $x$ to drop.

SAS syntax to find differences in regards to a control treatment


I am working with a data set of bacterial cell counts, using flow cytometry. I recorded the cell number in 3 different species of bacteria, all treated with 3 different compounds (L-aspartic acid, L-asparagine and D-sorbitol), at 2 time points. My data set looks like this:

enter image description here

The cell count is the total number (tot) and the intact cells (int). I want to know whether the total number of cells is not different among species or among treatments. I would also like to see that there are no differences among time points within species. Further, I have to repeat the same analysis for the intact cells. I have the following syntax in SAS:

proc mixed data=flowcyt
class count comp species;
model cells = comp|species / outp=resid1 ddfm=kr;
repeated / group=comp;
lsmeans comp / pdiff adjust=tukey cl;
ods output diffs=ppp lsmeans=mmm;
run;
;

But with this I do not get the differences by cell type (either total cell number or intact cells number). Therefore, I wrote the following:

proc mixed;
class count tpt comp;
model cells= tpt|comp /ddfm=kr;
random count;
lsmeans tpt|comp/pdiff;
contrast 'LAS - Ctr' Comp -1 0 -1 0 2 0;
contrast 'LAA - Ctr' Comp 1 1 1 -1 -1 -1;
contrast 'Sor - Ctr' Comp 0 0 0 0 1 -1;
run;

However, I do not get the differences among species.

Hence, I thought about using GLM, but if I write it like the above syntax, I will get the differences on each level and I have to separate the data set.

proc glm
class species;
model cells= comp;
means comp/snk duncan scheffe;
run;
quit;

With the following I get a huge output that I am not sure is correct:

proc glm;
class species count tpt comp;
model cells= tpt|comp|species;
means tpt|comp|species/ snk scheffe duncan ;
run;
quit;

Finally, I tried this:

proc mixed data= flowcyt covtest;
class species comp tpt;
model cells= comp|tpt/ddfm= kr;
repeated tpt/ subject=species type=cs;
lsmeans tpt|comp/ pdiff;
run;
quit;

and again, I cannot get the differences in cell count among bacterial species.

I am at odds at this moment and I think I am a bit stuck. Does anyone know what would be the right syntax to get the differences in cell counts among the 3 bacterial species, within each time point and to determine that the cell counts are not significantly different from the control treatment?

Thanks for your time!

Two simple questions regarding GLM


I’m currently doing a modelling project. However, I haven’t taken a bunch of statistics classes, so I have to teach myself generalized linear models. I’m reading Generalized Linear Models for Insurance Data (Heller and de Jong, 2008, CUP), and I have two questions:

1. On page 64, it says:

Given a response $y$, the generalized linear model is $f(y)=c(y,phi)exp{frac{ytheta – a(theta)}{phi}}$. The equation for $f(y)$ specifies that the distribution of the response is in the exponential family.

Is that the equation for the distribution of $E[y_i|x_i]$ or some other thing? If it’s the distribution for $y$ corresponding to a fixed $x_i$, is it possible that even if the plot of $y$ against $x$ looks like a straight line, I should still use GLM instead of simple regression?

update: I guess I should clarify myself a little bit. Currently I have a dataset and my dependent variable is $y$. I made a histogram of $y$ (with frequency on y-axix) and it looks like a gamma curve fits well. Does that essentially imply that I should choose $f(y)$ to be gamma? I kinda doubt it because I suppose $y_i|_{X=x_i}$ and $Y$ are essentially two different things. I hope I’m not confusing you guys.

2. The book suggests that when assuming response $y$ follows a gamma distribution, it is a common practise to use a logarithmic link function. I don’t quite understand the reason behind that.

Any suggestion would be great. Thanks!

Zero inflated model problem: system is computationally singular


I’m using R.After getting an error asking me to provide starting values for a glm (poisson family), I took a look at my data and realized I had quite a bit of zeroes. So, I tried zeroinfl from pscl. I got the “computationally singular” error, so I tried dist=”negbin”. Same error. I looked at my data via with(bytype, table(Events, Type)) and with(bytype, table(Events, Generation)). It looks like my data is wonky. But I don’t know how to deal with that. I think I need to cut part of it out, but the easy attempts (group with most zeroes) didn’t work.

First, my basic model: Events ~ Generation*Type + offset(log(Population))

Second, data summaries:

by Generation:

Events Familicide Home.Intruder Rampage Sabotage School Vehicular Workplace
     0         49            88      59      103     91       111        94
     1         40            21      37       10     11         1        17
     2         18             4      11        0     10         1         2
     3          4             0       5        0      1         0         0
     4          2             0       1        0      0         0         0

by Event Type:

Events Missionary Lost  GI Silent Boomers   X
     0        139  103 110    103     106  34
     1         24   17  15     29      34  18
     2          4    5   1     13      17   6
     3          1    1   0      1       3   4
     4          0    0   0      1       1   1

Due to posting length limits, my entire data won’t fit. I hope this is enough to go on

Added on edit: It might matter that “Generation” is ordinal and I’ve set contrasts to c(“contr.sum”, “contr.poly”). I don’t know if that matters in this case.

Question and Answer is proudly powered by WordPress.
Theme "The Fundamentals of Graphic Design" by Arjuna
Icons by FamFamFam