Statistics Forum Functional Ecology Research School: 2006

Thursday, December 07, 2006

multinom

Hi Tessa,
To compare the fit of two models, you can first fit each one separately, and compare them using the anova() function. It will do a likelihood ratio test.
I think you have to construct your response differently, and not use "replica" to group observations.
The response variable has to be a matrix, with a row for each replica, and the counts on each patch in the columns. The explanatory variables will have as many entries as there are lines in the matrix. I now see that your covariates are patch-specific, which is more involved than what multinom() allows. Multinom() only allows an offset with that structure.
Maybe you will have to fit your data using a log-linear poisson model then, or function vglm() from the "VGAM" library can also handle multinomial data. Agresti, Categorical Data Analysis has a lot of info on the analysis of multinomial data.
Scripts for that book are available at http://www-stat.stanford.edu/~owen/courses/306a/Splusdiscrete2.pdf.
Cheers, Tom

Tuesday, November 28, 2006

planned comparisons

Hi Nicola,
you told me it's data on selection experiments. Are the data rally three blocks, meaning crosses between selection lines started from three different base populations?

If you write it like this:
model<-aov(egg~block+block:cross)
there there are no main effects for cross, and you only want to compare crosses within blocks. That means that you only want contrasts between different groups in the same block.

To compare these groups, first look at the coefficients of the model, and what they represent.
Reformulating the model like this probably makes comparisons easier:
model<-aov(egg~block:cross)
because all estimates are means per level.

With four crosses per block, are the coefficients of block 1, the first four coefficients in the model?
Say, you want to make simultaneous confidence intervals for all pairwise comparisons for this block:

# contrast matrix

cmatrix<-matrix(rep(0,6*4),nrow=6)
cmatrix[1,1]<-1;cmatrix[1,2]<--1;
cmatrix[2,1]<-1;cmatrix[2,3]<--1;
cmatrix[3,1]<-1;cmatrix[3,4]<--1;
cmatrix[4,2]<-1;cmatrix[4,3]<--1;
cmatrix[5,2]<-1;cmatrix[5,4]<--1;
cmatrix[6,3]<-1;cmatrix[6,4]<--1;

cmatrix

# when this matrix is multiplied with the vector of coefficients,
# every element of the result, is an estimated difference between two means.

# library(multcomp)

# psubset selects the parameters of the block you want to investigate.

simint(model,psubset=1:4,cmatrix=cmatrix)

You should probably make a contrast matrix for all 18 pairwise comparisons simultaneously.

cmatrix<-matrix(rep(0,3*6*4*3),nrow=3*6)
cmatrix[1,1]<-1;cmatrix[1,2]<--1;
cmatrix[2,1]<-1;cmatrix[2,3]<--1;
cmatrix[3,1]<-1;cmatrix[3,4]<--1;
cmatrix[4,2]<-1;cmatrix[4,3]<--1;
cmatrix[5,2]<-1;cmatrix[5,4]<--1;
cmatrix[6,3]<-1;cmatrix[6,4]<--1;

cmatrix[7,5]<-1;cmatrix[7,6]<--1;
cmatrix[8,5]<-1;cmatrix[8,7]<--1;
cmatrix[9,5]<-1;cmatrix[9,8]<--1;
cmatrix[10,6]<-1;cmatrix[10,7]<--1;
cmatrix[11,6]<-1;cmatrix[11,8]<--1;
cmatrix[12,7]<-1;cmatrix[12,8]<--1;

cmatrix[13,9]<-1;cmatrix[13,10]<--1;
cmatrix[14,9]<-1;cmatrix[14,11]<--1;
cmatrix[15,9]<-1;cmatrix[15,12]<--1;
cmatrix[16,10]<-1;cmatrix[16,11]<--1;
cmatrix[17,10]<-1;cmatrix[17,12]<--1;
cmatrix[18,11]<-1;cmatrix[18,12]<--1;

simint(model,psubset=1:12,cmatrix=cmatrix)

Wednesday, November 15, 2006

Multinomial

Hi Tessa,

Answers:

1) how do I define my distribution in a glm when I have multinomial data (5 patches)?

Well, the glm() function cannot handle multinomial data, but function multinom() from the "nnet" library can.
Another package which fits a multinomial probit model is "MNP", there is also the "multinomrob" library. I think multinom() is easiest to use.

2) the same question but for a generalized linear mixed model (I would like to include time as a factor)?

When you need to add a random effect, "MNP" is the only option in R as far as I can see. Things become complicated in that case. Research on multinomial random effects models is still ongoing.
Some references: Multinomial logit random effects models, Statistical Modelling (2001) (J. Hartzel, A. Agresti, and B. Caffo); The analysis of ordered categorical data: An overview and a survey of recent developments, invited discussion paper for the Spanish Statistical Journal, TEST (2005) (I. Liu and A. Agresti).

3) In Crawley they talk about GAM’s (general additive models). What I understand is that for these models you don’t need to define a distribution because they use non parametric smoothers. Can these models be used for multinomial data?

Generalized additive models use non-parametric smoothers to model a non-linear dependence of a response variable on a predictor variable.
The distribution of the response variable is among the list of usual suspects: binomial, poisson, normal, gamma.
It has been claimed that package "gbm" (R) or "MART" (S-Plus) can do this for multinomial data, but I haven't tried those yet.

4) Spatial distributions are difficult. Package "spatstat" could contain the solution. I am not very familiar with analysing spatial patterns.

Wednesday, November 08, 2006

random effects

Hi Fransesc,
you seem to have only a small number of levels for the plot effects.
The consequence will be that estimates of variance between these random effects are very imprecise, and that it is not very solid to state that these levels are a representative sample from a large population of plot levels.
I would fit all effects as fixed.
Continuous variables such as time cannot define levels random effects, but they can appear in random regressions. Anyway, if you want to fit higher-order models of time, all you have to do is make new variables, like
timesq<-time^2
and add them to the model.
In a model with interactions of these time variables with treatment (or other factors), different polynomials of time will be fit for each treatment group.
Hope that helps,
Tom

Monday, November 06, 2006

variable lengths

Hi Cas,
something has gone wrong with attaching the data, or calculating new variables from old ones. The error message you get means that different variables in your model have different lenghts (so there's a mismatch).
for example, for a model of the form
glm(y~x+z)
you can check the lengths of x,z and y by
length(y)
length(x)
length(z)
The numbers returned should all be the same for a correct model specification. That makes it easy to spot where things went wrong.
good luck, Tom

Monday, October 30, 2006

mixed model

Hi Barbara,
I guess you want to fit a linear mixed model, nlme is specific for non-linear models.

Did you try this code, which assumes a normally distributed response:

model1<-lme(mfreq~avergTemp+altitude, random=~1|country)
summary(model1)

or this code for poisson response:

library(MASS)
model2<-glmmPQL(mfreq~avergTemp+altitude, random=~1|country, family=poisson)
summary(model2)

library(lme4)
lmer(mfreq~avergTemp+altitude+(1|country), family=poisson, method="Laplace")

but first you should check whether country is a factor!
summary(country)

Cheers, Tom

count data and dispersion

First of all,
long ago I picked up the idea that corrections for overdispersion are only necessary when the scale parameters is above 3 to 4. The idea is understandable, you should only correct if likely to be present, but the exact reason for the "3 to 4" limit is now unclear to me.
I tried to locate where I got the idea from, and it seems to be from here. I did not find any confirmation of that rule of thumb anywhere else.
However, Lindsey (1999) suggests, based on the analysis of examples, that corrections are necessary when the overdispersion parameter is at least two.

You should keep in mind, however, that overdispersion is impossible in some cases (so you should then never correct for it):

- when the dependent variable is a Bernoulli 0 - 1 variable
- when the maximal model considered, is the saturated model

Overdispersion can also be caused by

- using the wrong link function
- a missing covariate
- the necessity to transform covariates
- outliers

so you should check these potential causes.

Concerning underdispersion, Venables and Ripley (2002) show that dispersion estimates indicating underdispersion can be caused by small counts. See e.g. MASS4 p.208.
The ratio of residual deviance to degrees of freedom can be seriously biased downward, for extreme p (and small n) in the case of binomial data, and for low lambda.

Instead of relying on rules of thumb, one can also model distributions for counts which allow both under-and overdispersion.
Some possibilities are the double binomial (poisson) distribution and the multiplicative binomial (poisson) distribution. Using AIC, you can then incorporate the decision to correct for under(over)dispersion into your model selection procedure.
You can do glm with these distributions in R, using the "gnlm" library by (Jim Lindsey. It is not easy to work with.
I have adapted some examples from Lindsey 1999 Models for Repeated Measurements to indicate how you can fit such models.
The R input code is here, and the data file here.

Monday, October 16, 2006

GLM course

The glm course has started this morning. Thomas Tully, Wolf Mooij and Tom Van Dooren are the instructors. All participants have been invited to become members of this blog.
Tom

Wednesday, August 02, 2006

Quickstart R

I've made a one-hour tutorial to get you started in R statistical software. It's stored at:
http://biology.leidenuniv.nl/~dooren/Rglimpse.html
Enjoy!
Tom

Tuesday, February 14, 2006

First Announcement

Welcome to the statistics weblog of the functional ecology school.
This forum can be used to post questions and ideas on the statistical analysis of ecological data.
It can also be used to keep in touch with other participants and instructors of statistics courses by the functional ecology research school.

Statistics Forum Functional Ecology Research School