Palmetto binary logistic regression

Overview:

In this report we explore, a dataset of palmetto species. The dataset was originally collected by the author Abrahamson, W.G. to investigate survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 to 2017. Using a binary logistic regression to test feasibility of using different variables of plant height (height), canopy length (length), canopy width (width), and number of green leaves (green_lvs)(i.e.explanatory variables); and their relationship with the independent variable. We try to evaluate via two models whether these independent variables/ plant characteristics can help classify a correct palmetto specie (Serenoa repens or Sabal etonia).

Data source: Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5

knitr::opts_chunk$set(echo = TRUE,message = FALSE,warning = FALSE)
library(tidyverse)
library(here)
library(GGally)
library(broom)
library(jtools)
library(caret)
library(AICcmodavg)
library(equatiomatic) 
library(kableExtra)
library(kimisc)
library(sjPlot)

Data wrangling

# Convert species to a 'factor'
palmetto_clean <- read_csv(here('data','palmetto.csv'))  %>% 
  rename(length_cm = length, height_cm= height, width_cm=width) %>% 
  
  mutate(species = case_when(
                     species == 1 ~ "Serenoa repens",
                     species == 2 ~ "Sabal etonia",)) %>%
   mutate(species = as.factor(species))%>%
   # select(- comments, -year) %>% 
  select(height_cm,length_cm,width_cm,green_lvs,species)  %>% 
  drop_na()

Data visualization

p1 <-ggplot(data = palmetto_clean, aes(x = species, y = green_lvs, fill = species)) + 
  geom_violin(scale = "count", color = "#abbe9e") + # mirrored density plot
  geom_boxplot(color = "#abbe9e", fill = NA, width = 0.1, outlier.color = NA) + 
  stat_summary(fun=mean, 
               geom="point", 
               shape=20, 
               size=2, 
               color="#abbe9e", 
               fill="#abbe9e") +
  scale_fill_manual(values = c("#304545", "#597567")) + 
  theme_minimal(11) + # change theme and font size
  theme(legend.position = "none",
        axis.text.x = element_text(vjust = 5, size = 12)) + 
  labs(x = element_blank(), y = "Count of green leaves")

p2 <-ggplot(data = palmetto_clean, aes(x = species, y = height_cm, fill = species)) + 
  geom_violin(scale = "count", color = "#1a4da9") + 
  geom_boxplot(color = "#1a4da9", fill = NA, width = 0.1, outlier.color = NA) + 
  stat_summary(fun=mean,
               geom="point", 
               shape=20, 
               size=2, 
               color="#1a4da9", 
               fill="#1a4da9") +
  scale_fill_manual(values = c("#cceeff", "#b4bcdc")) + 
  theme_minimal(11) + 
  theme(legend.position = "none",
        axis.text.x = element_text(vjust = 5, size = 12)) + 
  labs(x = element_blank(), y = "Height of plant (cm)")

p3 <-ggplot(data = palmetto_clean, aes(x = species, y = length_cm, fill = species)) + 
  geom_violin(scale = "count", color = "#eed7da") + 
  geom_boxplot(color = "#eed7da", fill = NA, width = 0.1, outlier.color = NA) + 
  stat_summary(fun=mean,
               geom="point", 
               shape=25, 
               size=2, 
               color="#eed7da", 
               fill="#eed7da") +
  scale_fill_manual(values = c("#674655", "#94657a")) + 
  theme_minimal(11) + 
  theme(legend.position = "none",
        axis.text.x = element_text(vjust = 5, size = 12)) + 
  labs(x = element_blank(), y = "Canopy length (cm)")

p4 <-ggplot(data = palmetto_clean, aes(x = species, y = width_cm, fill = species)) + 
  geom_violin(scale = "count", color = "#e7d0d0") + 
  geom_boxplot(color = "#e7d0d0", fill = NA, width = 0.1, outlier.color = NA) + 
  stat_summary(fun=mean,
               geom="point", 
               shape=25, 
               size=2, 
               color="#e7d0d0", 
               fill="#e7d0d0") +
  scale_fill_manual(values = c("#e55763", "#5d040b")) + 
  theme_minimal(11) + 
  theme(legend.position = "none",
        axis.text.x = element_text(vjust = 5, size = 12)) + 
  labs(x = element_blank(), y = "Canopy width (cm)")



cowplot::plot_grid(
          p1, p2, p3, p4,
          nrow = 2,
          labels = c('A', 'B', 'C', 'D'), 
          label_size = 12, 
      
          align="hv")

Figure 1: Exploratory graphs of palmetto species vs other independent variables

Plot (A): Sabal etonial has on average lower green leaves count compared to Serenoa repens.
Plot(B): Both palmetto plant species seem to have similar plant heights.
Plot(C): Sabal etonia seems to have a slight longer canopy compared to Serenoa repens.
Plot(D): In terms of canopy width, both plant species have almost comparable canopy width.

Data analysis

Binary logistic regression

In this section, we explore the probability of a plant being either “Serenoa repens” or Sabal etonia” based on predictor variables. We perform the analysis twice, using cross validation to compare two models:

1. Log odds of plant type using plant height, canopy length, canopy width and green leaves as predictor variable.
1. Log odds of plant type using plant height, canopy width and green leaves (i.e., drop canopy length for this model)

# start by defining 
f1 <- species ~ height_cm + length_cm + width_cm + green_lvs
f2 <- species~ height_cm +  width_cm + green_lvs
                    
palmetto_blr1 <- glm(formula = f1,
                    data = palmetto_clean,
                    family = "binomial")

# palmetto_blr1
# summary(palmetto_blr1)

palmetto_blr2 <- glm(formula = f2,
                    data = palmetto_clean,
                    family = "binomial")

# palmetto_blr2
# summary(palmetto_blr2)

Model selection

Let’s compare the models using AICc

# AICcmodavg::aictab(list(palmetto_blr1, palmetto_blr2))
model_list <- nlist(palmetto_blr1, palmetto_blr2)
aic_table <- aictab(model_list)
aic_table_stats <-  aic_table %>%
  kable(caption="Table 1.Summary statistics of AIC values from Model 1 & Model 2 ") %>% 
  kable_styling(
                bootstrap_options = c("striped","hover","bordered","italic"), full_width = TRUE,position = "left", html_font="Cambria", font_size = 14) %>% 
  add_footnote(c("Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two
dominant palmetto species of south-central Florida from 1981 - 2017."), notation = "symbol") %>% 
  remove_column(1)

aic_table_stats

Table 1.Summary statistics of AIC values from Model 1 & Model 2
K	AICc	Delta_AICc	ModelLik	AICcWt	LL	Cum.Wt
5	5194.567	0.0000	1	1	-2592.281	1
4	5987.475	792.9086	0	0	-2989.736	1
^* Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017.

From table 1, we can see that the value of AICc from model 1 is lower with a value of 5194.57 and the AICc from model 2 is higher with a value of 5987.48.

Let’s compare the models using caret (“Classification And REgression Training”):

set.seed(123) 
# tr_ctrl <- trainControl(method = "cv", number = 10)
tr_ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
 
# Train the model
model1 <- train(f1, data = palmetto_clean, 
               method = "glm", family = 'binomial',
               trControl = tr_ctrl)
# model1

model2 <- train(f2, data = palmetto_clean, 
               method = "glm", family = 'binomial',
               trControl = tr_ctrl)
# model2

Even though the two models were very close in performance, both AIC and cross-validation indicate model 1 as “best” model (i.e. lower values). To summarize our analysis, we will use the entire dataset, rather than testing/training sets, to identify the coefficients for the final predictive model, based on model 1.

final_mdl <- glm(formula = f1,
                    data = palmetto_clean,
                    family = "binomial")

broom::tidy(final_mdl) %>% 
   mutate(p.value = case_when( # rename p-values for readability
    p.value <0.001 ~ "<0.001"
  )) %>% 
  mutate(estimate = round(estimate,3),std.error = round(std.error,3)) %>% 
  mutate(term = case_when(
         term == "(Intercept)" ~ "Intercept",
         term == "height_cm" ~ "Height (cm)",
         term == "length_cm" ~ "Length (cm)",
         term == "width_cm" ~ "Width (cm)",
         term == "green_lvs" ~ "Green leaves count")) %>% 
  select(-statistic) %>% 
  rename( " " = term, "Coefficient" = estimate, "Standard error" = std.error, "P-Value"=p.value
) %>% 
  
  kable(align="lccc",caption="Table 2.Summary statistics of Model 1") %>% 
  kable_styling(
                bootstrap_options = c("striped","hover","bordered","italic"), full_width = TRUE,position = "left", html_font="Cambria", font_size = 14) %>% 
  add_footnote(c("Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two
dominant palmetto species of south-central Florida from 1981 - 2017."), notation = "symbol")

Table 2.Summary statistics of Model 1
	Coefficient	Standard error	P-Value
Intercept	-3.227	0.142	<0.001
Height (cm)	0.029	0.002	<0.001
Length (cm)	-0.046	0.002	<0.001
Width (cm)	-0.039	0.002	<0.001
Green leaves count	1.908	0.039	<0.001
^* Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017.

Our final model: \[ \begin{aligned} \log\left[ \frac { P( \operatorname{species} = \operatorname{Serenoa\ repens} ) }{ 1 - P( \operatorname{species} = \operatorname{Serenoa\ repens} ) } \right] &= \alpha + \beta_{1}(\operatorname{height\_cm}) + \beta_{2}(\operatorname{length\_cm}) + \beta_{3}(\operatorname{width\_cm})\ + \\ &\quad \beta_{4}(\operatorname{green\_lvs}) \end{aligned} \]

and with coefficients in place: \[ \begin{aligned} \log\left[ \frac { \widehat{P( \operatorname{species} = \operatorname{Serenoa\ repens} )} }{ 1 - \widehat{P( \operatorname{species} = \operatorname{Serenoa\ repens} )} } \right] &= -3.23 + 0.03(\operatorname{height\_cm}) - 0.05(\operatorname{length\_cm}) - 0.04(\operatorname{width\_cm})\ + \\ &\quad 1.91(\operatorname{green\_lvs}) \end{aligned} \]

# make predictions

predict_acc <- function(x, y) {
  accurate <- ifelse(x == y, 1, 0)
  return(accurate)
}


blr1_fitted <- palmetto_blr1 %>%
  broom::augment(type.predict = "response") %>% 
    mutate(predicted_species = case_when(
    .fitted >= 0.5 ~ "Serenoa repens", 
    .fitted < 0.5 ~ "Sabal etonia" 
  )) %>% 
  mutate(prediction_result = predict_acc(species, predicted_species))
  
  
blr2_fitted <- palmetto_blr2 %>%
  broom::augment(type.predict = "response") %>% 
    mutate(predicted_species = case_when(
    .fitted >= 0.5 ~ "Serenoa repens", 
    .fitted < 0.5 ~ "Sabal etonia" 
  )) %>% 
  mutate(prediction_result = predict_acc(species, predicted_species))


# See what percentage of the time the model predicted correctly
blr1_acc <- blr1_fitted  %>% 
  group_by(species) %>% 
  summarize(pred_accuracy_v = round(mean(prediction_result), 2)) %>% 
  kable(align = "ll",
      col.names = c("Species", "% correctly classified"),caption="Table 3. Model 1 probability of correctly classifying palmetto species") %>% kable_styling(
                bootstrap_options = c("striped","hover","bordered","italic"), full_width = TRUE,position = "left", html_font="Cambria", font_size = 14) %>% 
  add_footnote(c("Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two
dominant palmetto species of south-central Florida from 1981 - 2017."), notation = "symbol") 

blr2_acc <- blr2_fitted  %>% 
  group_by(species) %>% 
  summarize(pred_accuracy_v = round(mean(prediction_result), 2)) %>% 
  kable(align = "ll",
      col.names = c("Species", "% correctly classified"),caption="Table 4. Model 2 probability of correctly classifying palmetto species") %>% kable_styling(
                bootstrap_options = c("striped","hover","bordered","italic"), full_width = TRUE,position = "left", html_font="Cambria", font_size = 14) %>% 
  add_footnote(c("Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two
dominant palmetto species of south-central Florida from 1981 - 2017."), notation = "symbol") 

blr1_acc

Table 3. Model 1 probability of correctly classifying palmetto species
Species	% correctly classified
Sabal etonia	0.93
Serenoa repens	0.91
^* Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017.

blr2_acc

Table 4. Model 2 probability of correctly classifying palmetto species
Species	% correctly classified
Sabal etonia	0.91
Serenoa repens	0.89
^* Data source:Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017.

*Model 1 was more accurate in predicting Sabal etonia than Serenoa repens. The model correctly predicted that a palmetto was Sabal etonia.

Summary

Model 1 AIC was lower that model 2.
Training dataset for model1 and model2 showed better results for model1