Introduction: Generalized linear mixed-effects models (GLMMs) are an extension of linear mixed-effects models (Module M.5) used when there is clustering (i.e., nested data structures) or non-independence (i.e., repeated measurements) among observations and the response variable is not normally distributed. As generalized linear models (Module M. 6), GLMMs consider distributions from an exponential family (e.g., binomial, Poisson) but can also incorporate random effects.

In this module, you will expand your skills on linear models (Modules M.4) to GLMMs while testing associations between Cayo Santiago rhesus macaque male social dispersal and population density.



Upon completion of this module, you will be able to:


References:


Extra training:


Associated literature:


Expand for R Notation and functions index

Notation:


Commands:




Testing associations between social dispersal and population density


Rhesus macaque populations are socially organized into multi-male, multi-female social groups (Fig 1). At Cayo Santiago, these groups formed naturally and most females remain in their natal groups their entire life. However, males migrate out of their natal social group at puberty (social dispersal; Fig 1). This behavior likely evolved to avoid inbreeding. Although natal social group dispersal is expected, individual males show high variability in the timing of migration. One important factor driving annual social dispersal in males could be high population density due to increased competition. As Cayo Santiago is an 15.2 ha island, metrics of population density are accurately estimated.
**Fig 1**. Multi-male, multi-female social group of rhesus macaques in Cayo Santiago (left) and a diagram for male social group dispersal (right). Photo by Raisa Hernandez-Pacheco.

Fig 1. Multi-male, multi-female social group of rhesus macaques in Cayo Santiago (left) and a diagram for male social group dispersal (right). Photo by Raisa Hernandez-Pacheco.


In this module, you will use a subset of the Cayo Santiago rhesus macaque data of male social dispersal events from 1983 to 2018 (disp; Table 1) to explore associations between the annual probability of social dispersal and population density in male rhesus macaques. This is authentic data collected through daily visual censuses performed by the staff of the Caribbean Primate Research Center (CPRC) of the University of Puerto Rico-Medical Sciences Campus.

Before coding, explore the data in Table 1.


Table 1: Cayo Santiago rhesus macaque male social dispersal data

Metadata of disp: Social dispersal data of 1,611 adult male rhesus macaques born between mating seasons 1983 and 2018 at Cayo Santiago.



1. Exploring data structure and performing quality check

As before, check the data and understand its attributes. Refer to Module M.1 for data structure exploration. Code first before clicking on the answer!


Guiding questions:

  1. What type of data is disp?

Click for Answer!
# structure of disp
str(disp)

disp is of class data.frame.


  1. How many variables and observations does disp have?

Click for Answer!

disp has 7 variables with up to 8,654 observations; “ID”, “mating_season”, “onset_mating”, “dob”, “dispersal”, “pop_dens”, “group_dens”.


Now, take time to think about the research question: Are there associations between social dispersal and population density in relation to the available data.


Guiding questions:

  1. Which variables can be explored to answer the research question?

Click for Answer!

dispersal (response variable), pop_dens (explanatory variable), group_dens (explanatory variable)


  1. Are there other important variables in the dataset that could or should be controlled for in the model?

Click for Answer!

Age at dispersal because age is a strong predictor of dispersal.


Challenge!

  1. Is the dispersal data nested? If yes, what do you propose for the modeling approach?

  2. Another important variable explaining the probability of annual dispersal for males could be age at the onset of the mating season. Estimate this new variable and add it to disp. Hint: You created a similar variable in Module M.1.


2. Testing for multicollinearity

When fitting linear models, it is important to ensure that you do not include explanatory variables that correlate strongly in the same model (i.e., multicollinearity). It is difficult for models to differentiate effects between strongly correlated variables because both variables change in unison. Thus, multicollinearity may reduce model inference.

As a refresher, a correlation coefficient that is closer to 1 or -1 indicates a strong correlation, whereas a coefficient closer to 0 indicates a weak correlation. There is no rule of thumb about what level of correlation is considered “too strong” to exclude in the same model. For the purposes of this analysis, you will exclude variables with an r > 0.5 or < -0.5.

Below, you will use the function cor() to estimate the pearson correlation coefficient r among the predictor variables.

# column names in disp
names(disp)

# correlation analysis for density variables in columns 6 to 7
cor(disp[,c(6:7)], method = c("pearson")) # [,c(6:7)] selects columns 6 through 7

Output guidance: the cor() function produces a matrix showing r for all combinations of variables. Each variable is represented by a column and a row. Each variable will correlate with itself by 1 (diagonal = 1).


Guiding questions:

  1. Do the density variables strongly correlate? Note, you may also want to correlate age at the onset of the mating season.

Click for Answer!

No, group_dens and pop_dens are weakly correlated.


  1. After considering age, what is the relative strongest correlation? Why do you think this is the strongest one?

Click for Answer!

The strongest correlation coefficient is between group_dens and pop_dens. This is expected; as the total population size increases, the size of each group is likely to increase.


  1. Do any predictor variables correlate more than 0.5?

Click for Answer!


No, no predictor variables correlate more than 0.5.


3. Fitting generalized linear mixed-effects models

Now that you confirmed that none of the predictor variables are strongly correlated (-0.5 < r < 0.5), you can fit a GLMM. For this, you will use R package lme4 which includes the function glmer().


A. Fitting a generalized linear mixed-effects model

The function glmer() is very similar to lmer() (refer to Module M.5). The main difference is that glmer() uses different distributions for the family argument. Since your response variable is binary, you will be using the argument family = binomial.

Below you will test associations between age and density and the annual probability of dispersal. For this analysis, you will include a random intercept for mating_season to account for unobserved temporal variation. As you did in Module M.5, for glmer(), you also need to specify the structure of the random effects; (slope | intercept).

# loading lme4
library(lme4)

# GLMM
glmm1 <- glmer(dispersal ~ age_onset + pop_dens + group_dens + (1|mating_season), data = disp, family = "binomial")

# model output
summary(glmm1)

Note. Even though the model ran, a warning message appeared. You should not ignore these messages because the fitted model may not produce reliable coefficients. Below, you will learn how to deal with this issue.


B. Troubleshooting: rescaling predictors

After fitting glmm1, you got the following warning message:

# Warning: Model is nearly unidentifiable: very large eigenvalue
#  - Rescale variables?

This warning suggests you rescale the variables. This is likely caused by the different ranges of the predictor variables; while age_onset and group_dens range from ~1 to 32, pop_dens ranges from 45 to 103. The simplest way to rescale variables is to subtract each observation by its mean (i.e., \(X_i\) \(-\) \(\bar{X}\)̄). This is called centering the variable. Only numeric variables should be centered.


To rescale the variables, subtract the entire column from the variable mean.

# Centering age at onset of mating season
disp$c_age_onset <- disp$age_onset-mean(disp$age_onset)

# range
range(disp$c_age_onset)

# Centering population density
disp$c_pop_dens <- disp$pop_dens-mean(disp$pop_dens)

# range
range(disp$c_pop_dens)

# Centering group density
disp$c_group_dens <- disp$group_dens-mean(disp$group_dens)
   
# range
range(disp$c_group_dens)  

# viewing the updated data frame 
View(disp)


Now, fit a new GLMM with the centered predictors to see if this fixed the issues.

# Creating a test model with centered predictor variables.
glmm2 <- glmer(dispersal ~ c_age_onset + c_pop_dens + c_group_dens + (1|mating_season), data = disp, family = "binomial")

# model output
summary(glmm2)

The warning no longer appears. You can now move on to creating models using the centered predictors.


C. Model selection with AIC for multiple competing models

To address the research question, you identified three predictor variables (age_onset, pop_dens, and group_dens). However, you do not know which combination of these variables best describes the dispersal data (i.e., none of these variables, only one of these variables, two of them, or all three). Therefore, you must create a model for each possible combination of the variables and compare them. This also includes an intercept-only model (i.e., the null model which has no fixed effects).

First, create the intercept-only model. When creating such model with no fixed effects, you add a 1 to the model argument. As our GLMM includes random effects, you still need to include them in the null model.

# Null  model
glmm.null <- glmer(dispersal ~ 1 + (1|mating_season), data = disp, family = "binomial")

# model output
summary(glmm.null)

This null model will be compared to each model representing all combinations of predictors.


Challenge!

  1. Create a list of competing models using centered predictors. Given your knowledge on the life history of rhesus macaques, you can assume that a male’s age will influence their annual probability of dispersal. Because of this, it is recommended to control for age in all competing models. When comparing models, the models should also have the same random effects structure. Thus, all competing models should also include a random intercept for mating_season. Hint: There is a maximum of 5 possible models, including the null model. Label the first model as m1, your second as m2, etc.


Now that you have fitted all competing models, you will compare them using the Akaike’s Information Criterion (AIC). The AIC is a widely used criterion that evaluates how well a model fits the data while also penalizing it for using a larger number of predictor variables (parameters). In other words, the AIC allows you to determine which model has the best balance between fit and parsimony. For this, you will use R package MuMIn which includes the function model.sel(). This function will produce an AIC table.

# installing MuMIn
install.packages("MuMIN",repos="http://cran.us.r-project.org")

# loading MuMIn
library(MuMIn)

# Model selection
model.sel(glmm.null,m1,m2,m3,m4)

Model selection output guide: The first column includes the name of each model. The next few columns list the intercept estimates (Int) and the fixed effects. The rest of the columns are as follows:

  • df: degrees of freedom of the model
  • logLik: log-likelihood
  • AICc: Corrected Alkaike information criterion
  • delta: the difference in AICc points between the top model and the current model
  • weight: model weight is a value between 0 and 1 that represents a proportion of the total amount of predictive power provided by the full set of models contained in the model being assessed)


The top model is the one with the lowest AICc. Prior studies suggest that models with a \(\Delta\)AIC > 2 are significantly different. Therefore you can conclude that the model including the fixed effects of age at the onset of mating season and population density is the most supported model (df = 4, AICc = 5035.2, weight = 0.586)


Challenge!


Use your acquired skills from prior modules to complete the following:

  1. Interpret the top model output.


  1. Plot the model results with sjPlot. Hint: Adding the argument: terms = “c_age_onset [all]” creates a smooth curve.


Discussion question: Does social dispersal appear to be driven by population and or social group density? Explain.


FIN



Acknowledgements: The creation of this module was funded by the National Science Foundation DBI BRC-BIO award 2217812. Cayo Santiago is supported by the Office of Research Infrastructure Programs (ORIP) of the National Institutes of Health, grant 2 P40 OD012217, and the University of Puerto Rico (UPR), Medical Sciences Campus. We acknowledge the use of BioRender.com to build Fig 1.


Authors: Alexandra L. Bland and Raisa Hernández-Pacheco, California State University, Long Beach