Thursday, December 07, 2006


Hi Tessa,
To compare the fit of two models, you can first fit each one separately, and compare them using the anova() function. It will do a likelihood ratio test.
I think you have to construct your response differently, and not use "replica" to group observations.
The response variable has to be a matrix, with a row for each replica, and the counts on each patch in the columns. The explanatory variables will have as many entries as there are lines in the matrix. I now see that your covariates are patch-specific, which is more involved than what multinom() allows. Multinom() only allows an offset with that structure.
Maybe you will have to fit your data using a log-linear poisson model then, or function vglm() from the "VGAM" library can also handle multinomial data. Agresti, Categorical Data Analysis has a lot of info on the analysis of multinomial data.
Scripts for that book are available at
Cheers, Tom

Tuesday, December 05, 2006

Statistics Forum Functional Ecology Research School

Dear Tom,

I have tried to use the function multinom on my data. My experiment consists of multiple replicas of batches of 15 mites that distribute themselves over 5 patches. Thus, the response variable is multinomial distributed. I included 3 more explanatory variables as well (idad, nceggs and ideggs, assuming for now that they are independent).

My data looks like this (but then I have some more replica’s)

replica patch ncad idad ideggs nceggs
aaaa a 2 0 0 2
aaaa b 10 5 1 17
aaaa c 0 1 0 2
aaaa d 0 2 12 1
aaaa e 0 7 4 2
bbbb a 1 3 5 1
bbbb b 9 3 1 8
bbbb c 0 0 1 4
bbbb d 2 0 11 6
bbbb e 2 9 4 2

My question are

1. whether I have used the function multinom correctly? I have tried to find documentary which I could understand, but I failed. For example Venables and Ripley spent less then a page on it.

2 how should I define “replica”. Now I have it as a factor, but I wonder if that is correct.

3. how to do modelsimplification on the multinom function? Since the summary is a matrix and the drop1 function doesn’t work for multinom, I don’t know how to get to the minimal model. Maybe the most straightforward thing to do is start with the most simple model and add terms and see if the AIC decreases?

I will sent you my data file by email.

This is what I did:

mites=read.table("NcIdtryout.txt", header=T)
model=multinom(patch~replica*idad*nceggs*ideggs, weight=ncad)

Thanks in advance,


Statistics Forum Functional Ecology Research School

Thursday, November 30, 2006

planned comparisons

Dag Tom,

het idee is om echt planned comparisons te doen, waardoor de error rate niet hoeft te worden aangepast (zoals wel gebeurt bij de simultaneous CI's methode).

Ik vond devolgende manier (op 't net) om planned comparisons te doen, wat denk je daarvan?;

cmat <- rbind("HL v LH" = c(0,1, -1, 0),"HH v LH/HL"=c(2,-1,-1,0), "LL v LH/HL"=c(0,-1,-1,2))
fit <-glm(ei2~kruising , contrasts=list("kruising"=make.contrasts(cmat)))


Tuesday, November 28, 2006

planned comparisons

Hi Nicola,
you told me it's data on selection experiments. Are the data rally three blocks, meaning crosses between selection lines started from three different base populations?

If you write it like this:
there there are no main effects for cross, and you only want to compare crosses within blocks. That means that you only want contrasts between different groups in the same block.

To compare these groups, first look at the coefficients of the model, and what they represent.
Reformulating the model like this probably makes comparisons easier:
because all estimates are means per level.

With four crosses per block, are the coefficients of block 1, the first four coefficients in the model?
Say, you want to make simultaneous confidence intervals for all pairwise comparisons for this block:

# contrast matrix



# when this matrix is multiplied with the vector of coefficients,
# every element of the result, is an estimated difference between two means.

# library(multcomp)

# psubset selects the parameters of the block you want to investigate.


You should probably make a contrast matrix for all 18 pairwise comparisons simultaneously.





Wednesday, November 22, 2006

planned comparisons in a nested ANOVA

Hi Tom and others,

How do you perform planned comparisons within a nested ANOVA? I want to look at the differences between levels of the nested factor.

When trying to build the appropriate contrast coefficient, I couldn't make matrices which were built from a combination of 2 factors (so main factor:nested factor). Is this the right way forward at all? And if so, how? A bookreference would be fine too!
Or would it be easier to just do t-tests?
Also, in the booklet of the course it was mentioned you could "construct appropriate independent variables" for planned comparisons. Is this the same as constructing contrast coefficients?

Thanks in advance,

Wednesday, November 15, 2006


Hi Tessa,


1) how do I define my distribution in a glm when I have multinomial data (5 patches)?

Well, the glm() function cannot handle multinomial data, but function multinom() from the "nnet" library can.
Another package which fits a multinomial probit model is "MNP", there is also the "multinomrob" library. I think multinom() is easiest to use.

2) the same question but for a generalized linear mixed model (I would like to include time as a factor)?

When you need to add a random effect, "MNP" is the only option in R as far as I can see. Things become complicated in that case. Research on multinomial random effects models is still ongoing.
Some references: Multinomial logit random effects models, Statistical Modelling (2001) (J. Hartzel, A. Agresti, and B. Caffo); The analysis of ordered categorical data: An overview and a survey of recent developments, invited discussion paper for the Spanish Statistical Journal, TEST (2005) (I. Liu and A. Agresti).

3) In Crawley they talk about GAM’s (general additive models). What I understand is that for these models you don’t need to define a distribution because they use non parametric smoothers. Can these models be used for multinomial data?

Generalized additive models use non-parametric smoothers to model a non-linear dependence of a response variable on a predictor variable.
The distribution of the response variable is among the list of usual suspects: binomial, poisson, normal, gamma.
It has been claimed that package "gbm" (R) or "MART" (S-Plus) can do this for multinomial data, but I haven't tried those yet.

4) Spatial distributions are difficult. Package "spatstat" could contain the solution. I am not very familiar with analysing spatial patterns.

Wednesday, November 08, 2006

multinomial distribution

Hoi Tom and others

I have a couple of questions about how to analyse my data. The data are on spatial distributions in 2 predatory mite species. The mites were given a choice between five patches. I would like to test for differences in distribution when both species are present and compare it with a control where only one species is present.


1) how do I define my distribution in a glm when I have multinomial data (5 patches)?

2) the same question but for a generalized linear mixed model (I would like to include time as a factor)?

3) In Crawley they talk about GAM’s (general additive models). What I understand is that for these models you don’t need to define a distribution because they use non parametric smoothers. Can these models be used for multinomial data?

4) The mites lay eggs during the experiment. I would like to know how the spatial distribution of the eggs, the eggs of the second species and the second species itself influence the spatial distribution of one mite. I thought about ancova, but my variables are not independent, so I suppose the correct way is to do several correlations? Or maybe multivariate analysis? I am also interested in interactions, so I wondered if it is possible to take this information in my model

Thanks in advance!


random effects

Hi Fransesc,
you seem to have only a small number of levels for the plot effects.
The consequence will be that estimates of variance between these random effects are very imprecise, and that it is not very solid to state that these levels are a representative sample from a large population of plot levels.
I would fit all effects as fixed.
Continuous variables such as time cannot define levels random effects, but they can appear in random regressions. Anyway, if you want to fit higher-order models of time, all you have to do is make new variables, like
and add them to the model.
In a model with interactions of these time variables with treatment (or other factors), different polynomials of time will be fit for each treatment group.
Hope that helps,

mixed effects

Hi Tom, and the rest of the experts ;-)

I've been struggling quite a bit to fit a mixed effects model to my data. It is a RCB (or CR) design, with repeated measurements (RM) and, as said, both fixed and random effects. Treatment(fixed) = 2, plot (random) = 3 and nested in treatment. also there's replicate (random) = 2 and nested in plot, but this is the lowest level of variability, so I won't include this term.
Now here come the uncertainties: Time...I think is fixed...and treatment is also nested in time, because in the course of time the effect of the treatment seems to be different. The graph shows a large hump, which makes me suspect a higher-order term (time^2 or even time^3) . My problem is that I can't seem to find a relatively clear and simple explanation on how to deal with this. Crawley certainly doesn't treat it as it does the glm's. I reckon it goes a bit too far for a glm...
I hope I explained it clear enough...


Monday, November 06, 2006

variable lengths

Hi Cas,
something has gone wrong with attaching the data, or calculating new variables from old ones. The error message you get means that different variables in your model have different lenghts (so there's a mismatch).
for example, for a model of the form
you can check the lengths of x,z and y by
The numbers returned should all be the same for a correct model specification. That makes it easy to spot where things went wrong.
good luck, Tom
Hi Tom and others of the GLM-R course,

I am trying to run a GLM, but get the error saying that "variable lenghts differ (found for 'density')".The variable ''density" simply is a number from 3 to 7. Do you know what I am doing wrong?


Cas Eikenaar

Monday, October 30, 2006

mixed model

Hi Barbara,
I guess you want to fit a linear mixed model, nlme is specific for non-linear models.

Did you try this code, which assumes a normally distributed response:

model1<-lme(mfreq~avergTemp+altitude, random=~1|country)

or this code for poisson response:

model2<-glmmPQL(mfreq~avergTemp+altitude, random=~1|country, family=poisson)

lmer(mfreq~avergTemp+altitude+(1|country), family=poisson, method="Laplace")

but first you should check whether country is a factor!

Cheers, Tom
Hi Tom,
I'm having troubel getting my mixed model to run. The dependent variable gives frequencies, the fixed variables are numbers and the random variable consists of categories:
model1<-nlme(mfreq~avergTemp+altitude+country, fixed=avergTemp+altitude~1,
but then I get the error message:
Error in, eval(parse(text = paste("~1",deparse(groups[[2]]), : Invalid formula for groups.
Unfortunatly I can neither figure out with Crawly nor with help-files what sort of parameter or function I need in this case for the random variable. Any suggestions?
Thanks a lot,Barbara

count data and dispersion

First of all,
long ago I picked up the idea that corrections for overdispersion are only necessary when the scale parameters is above 3 to 4. The idea is understandable, you should only correct if likely to be present, but the exact reason for the "3 to 4" limit is now unclear to me.
I tried to locate where I got the idea from, and it seems to be from here. I did not find any confirmation of that rule of thumb anywhere else.
However, Lindsey (1999) suggests, based on the analysis of examples, that corrections are necessary when the overdispersion parameter is at least two.

You should keep in mind, however, that overdispersion is impossible in some cases (so you should then never correct for it):

- when the dependent variable is a Bernoulli 0 - 1 variable
- when the maximal model considered, is the saturated model

Overdispersion can also be caused by

- using the wrong link function
- a missing covariate
- the necessity to transform covariates
- outliers

so you should check these potential causes.

Concerning underdispersion, Venables and Ripley (2002) show that dispersion estimates indicating underdispersion can be caused by small counts. See e.g. MASS4 p.208.
The ratio of residual deviance to degrees of freedom can be seriously biased downward, for extreme p (and small n) in the case of binomial data, and for low lambda.

Instead of relying on rules of thumb, one can also model distributions for counts which allow both under-and overdispersion.
Some possibilities are the double binomial (poisson) distribution and the multiplicative binomial (poisson) distribution. Using AIC, you can then incorporate the decision to correct for under(over)dispersion into your model selection procedure.
You can do glm with these distributions in R, using the "gnlm" library by (Jim Lindsey. It is not easy to work with.
I have adapted some examples from Lindsey 1999 Models for Repeated Measurements to indicate how you can fit such models.
The R input code is here, and the data file here.

Tuesday, October 24, 2006

count data and underdispersion

Hi Tom!
I have two questions;
1) If you have count data with underdispersion, from which value of the dispersion parameter should you chose to use the quasipoisson distribution? Is there a rule of thumb for underdispersion, like with overdispersion?
2) If you have countdata with a high mean (17) and 'good' variance on the left but a short tail on the right: Is it enough to use the quasipoisson distribution to correct for this, so can you use the quassipoisson distribution when there is only underdisperion on one side? Or should you transform your data and use the normal distribution?

Monday, October 16, 2006

GLM course

The glm course has started this morning. Thomas Tully, Wolf Mooij and Tom Van Dooren are the instructors. All participants have been invited to become members of this blog.

Wednesday, August 02, 2006

Quickstart R

I've made a one-hour tutorial to get you started in R statistical software. It's stored at:

Tuesday, February 14, 2006

First Announcement

Welcome to the statistics weblog of the functional ecology school.
This forum can be used to post questions and ideas on the statistical analysis of ecological data.
It can also be used to keep in touch with other participants and instructors of statistics courses by the functional ecology research school.