Introduction: Generalized linear mixed-effects models (GLMMs) are an extension of linear mixed-effects models (Module M.5) used when there is clustering (i.e., nested data structures) or non-independence (i.e., repeated measurements) among observations and the response variable is not normally distributed. As generalized linear models (Module M. 6), GLMMs consider distributions from an exponential family (e.g., binomial, Poisson) but can also incorporate random effects.
In this module, you will expand your skills on linear models (Modules M.4) to GLMMs while testing associations between Cayo Santiago rhesus macaque male social dispersal and population density.
Upon completion of this module, you will be able to:
References:
Extra training:
Associated literature:
Notation:
Commands:
Base R:
lmer4:
MuMIn:
sjPlot:
tidyverse:
In this module, you will use a subset of the Cayo Santiago rhesus macaque data of male social dispersal events from 1983 to 2018 (disp; Table 1) to explore associations between the annual probability of social dispersal and population density in male rhesus macaques. This is authentic data collected through daily visual censuses performed by the staff of the Caribbean Primate Research Center (CPRC) of the University of Puerto Rico-Medical Sciences Campus.
Before coding, explore the data in Table 1.
Metadata of disp: Social dispersal data of 1,611 adult male rhesus macaques born between mating seasons 1983 and 2018 at Cayo Santiago.
As before, check the data and understand its attributes. Refer to Module M.1 for data structure exploration. Code first before clicking on the answer!
Guiding questions:
disp is of class data.frame.
Click for
Answer!
# structure of disp
str(disp)
disp has 7 variables with up to 8,654 observations;
“ID”, “mating_season”, “onset_mating”, “dob”, “dispersal”, “pop_dens”,
“group_dens”.
Click for
Answer!
Now, take time to think about the research question: Are there associations between social dispersal and population density in relation to the available data.
Guiding questions:
dispersal (response variable), pop_dens
(explanatory variable), group_dens (explanatory variable)
Click for
Answer!
Age at dispersal because age is a strong predictor of dispersal.
Click for
Answer!
Challenge!
Is the dispersal data nested? If yes, what do you propose for the modeling approach?
Another important variable explaining the probability of annual dispersal for males could be age at the onset of the mating season. Estimate this new variable and add it to disp. Hint: You created a similar variable in Module M.1.
When fitting linear models, it is important to ensure that you do not include explanatory variables that correlate strongly in the same model (i.e., multicollinearity). It is difficult for models to differentiate effects between strongly correlated variables because both variables change in unison. Thus, multicollinearity may reduce model inference.
As a refresher, a correlation coefficient that is closer to 1 or -1 indicates a strong correlation, whereas a coefficient closer to 0 indicates a weak correlation. There is no rule of thumb about what level of correlation is considered “too strong” to exclude in the same model. For the purposes of this analysis, you will exclude variables with an r > 0.5 or < -0.5.
Below, you will use the function cor() to estimate the pearson correlation coefficient r among the predictor variables.
# column names in disp
names(disp)
# correlation analysis for density variables in columns 6 to 7
cor(disp[,c(6:7)], method = c("pearson")) # [,c(6:7)] selects columns 6 through 7
Output guidance: the cor() function produces a matrix showing r for all combinations of variables. Each variable is represented by a column and a row. Each variable will correlate with itself by 1 (diagonal = 1).
Guiding questions:
No, group_dens and pop_dens are weakly
correlated.
Click for
Answer!
The strongest correlation coefficient is between group_dens
and pop_dens. This is expected; as the total population size
increases, the size of each group is likely to increase.
Click for
Answer!
Click for
Answer!
No, no predictor variables correlate more than 0.5.
Now that you confirmed that none of the predictor variables are strongly correlated (-0.5 < r < 0.5), you can fit a GLMM. For this, you will use R package lme4 which includes the function glmer().
The function glmer() is very similar to lmer() (refer to Module M.5). The main difference is that glmer() uses different distributions for the family argument. Since your response variable is binary, you will be using the argument family = binomial.
Below you will test associations between age and density and the annual probability of dispersal. For this analysis, you will include a random intercept for mating_season to account for unobserved temporal variation. As you did in Module M.5, for glmer(), you also need to specify the structure of the random effects; (slope | intercept).
# loading lme4
library(lme4)
# GLMM
glmm1 <- glmer(dispersal ~ age_onset + pop_dens + group_dens + (1|mating_season), data = disp, family = "binomial")
# model output
summary(glmm1)
Note. Even though the model ran, a warning message appeared. You should not ignore these messages because the fitted model may not produce reliable coefficients. Below, you will learn how to deal with this issue.
After fitting glmm1, you got the following warning message:
# Warning: Model is nearly unidentifiable: very large eigenvalue
# - Rescale variables?
This warning suggests you rescale the variables. This is likely caused by the different ranges of the predictor variables; while age_onset and group_dens range from ~1 to 32, pop_dens ranges from 45 to 103. The simplest way to rescale variables is to subtract each observation by its mean (i.e., \(X_i\) \(-\) \(\bar{X}\)̄). This is called centering the variable. Only numeric variables should be centered.
To rescale the variables, subtract the entire column from the variable mean.
# Centering age at onset of mating season
disp$c_age_onset <- disp$age_onset-mean(disp$age_onset)
# range
range(disp$c_age_onset)
# Centering population density
disp$c_pop_dens <- disp$pop_dens-mean(disp$pop_dens)
# range
range(disp$c_pop_dens)
# Centering group density
disp$c_group_dens <- disp$group_dens-mean(disp$group_dens)
# range
range(disp$c_group_dens)
# viewing the updated data frame
View(disp)
Now, fit a new GLMM with the centered predictors to see if this fixed the issues.
# Creating a test model with centered predictor variables.
glmm2 <- glmer(dispersal ~ c_age_onset + c_pop_dens + c_group_dens + (1|mating_season), data = disp, family = "binomial")
# model output
summary(glmm2)
The warning no longer appears. You can now move on to creating models using the centered predictors.
To address the research question, you identified three predictor variables (age_onset, pop_dens, and group_dens). However, you do not know which combination of these variables best describes the dispersal data (i.e., none of these variables, only one of these variables, two of them, or all three). Therefore, you must create a model for each possible combination of the variables and compare them. This also includes an intercept-only model (i.e., the null model which has no fixed effects).
First, create the intercept-only model. When creating such model with no fixed effects, you add a 1 to the model argument. As our GLMM includes random effects, you still need to include them in the null model.
# Null model
glmm.null <- glmer(dispersal ~ 1 + (1|mating_season), data = disp, family = "binomial")
# model output
summary(glmm.null)
This null model will be compared to each model representing all combinations of predictors.
Challenge!
Now that you have fitted all competing models, you will compare them using the Akaike’s Information Criterion (AIC). The AIC is a widely used criterion that evaluates how well a model fits the data while also penalizing it for using a larger number of predictor variables (parameters). In other words, the AIC allows you to determine which model has the best balance between fit and parsimony. For this, you will use R package MuMIn which includes the function model.sel(). This function will produce an AIC table.
# installing MuMIn
install.packages("MuMIN",repos="http://cran.us.r-project.org")
# loading MuMIn
library(MuMIn)
# Model selection
model.sel(glmm.null,m1,m2,m3,m4)
Model selection output guide: The first column includes the name of each model. The next few columns list the intercept estimates (Int) and the fixed effects. The rest of the columns are as follows:
The top model is the one with the lowest AICc. Prior studies suggest that models with a \(\Delta\)AIC > 2 are significantly different. Therefore you can conclude that the model including the fixed effects of age at the onset of mating season and population density is the most supported model (df = 4, AICc = 5035.2, weight = 0.586)
Challenge!
Use your acquired skills from prior modules to complete the following:
Discussion question: Does social dispersal appear to be driven by population and or social group density? Explain.
Acknowledgements: The creation of this module was funded by the National Science Foundation DBI BRC-BIO award 2217812. Cayo Santiago is supported by the Office of Research Infrastructure Programs (ORIP) of the National Institutes of Health, grant 2 P40 OD012217, and the University of Puerto Rico (UPR), Medical Sciences Campus. We acknowledge the use of BioRender.com to build Fig 1.
Authors: Alexandra L. Bland and Raisa Hernández-Pacheco, California State University, Long Beach