Introduction
Set up R and read in data
Step 1: Predict 2023 bloom dates for Washington, Liestal, and Kyoto
Step 2: Predict 2024 Vancouver and New York City Bloom Dates
Last Step: Produce csv-document of results

Introduction

Goal of this contest is to predict the 2024 cherry tree bloom dates at five different locations around the world. I will predict this year’s bloom dates using four different predictors in a linear regression:

Past bloom dates starting in 1954
Each year’s estimated dormacy release date (DRD; explanation see below)
The modeled temperature at the DRD
The mean temperature in January

For Vancouver and New York City, I have to proceed differently, because there exist no (or only few) records of past bloom dates. I thus use the USA-NPN dataset (containing records of bloom dates in different years and different cities in the US) to model past Vancouver and New York City bloom dates. I then use this modeled data to predict future bloom dates.

Set up R and read in data

setwd("/Users/lunafrauhammer/Library/CloudStorage/OneDrive-UniversitaetDuisburg-Essen/Sonstiges/CherryBlossom/CherryBlossom24")
# set working directory to folder CherryBlossom24 to run this code. 
library(tidyverse)
library(data.table)
library(lubridate)
library(Metrics)
library(showtext)
library(sysfonts)#
library(caret)

doy_to_date <- function (year, doy) {
  strptime(paste(year, doy, sep = '-'), '%Y-%j') %>% # create date object
    strftime('%Y-%m-%d') # translate back to date string in ISO 8601 format
}

cherry1 <- read.csv("peak-bloom-prediction/data/washingtondc.csv") %>% 
  bind_rows(read.csv("peak-bloom-prediction/data/liestal.csv")) %>% 
  bind_rows(read.csv("peak-bloom-prediction/data/kyoto.csv")) %>%
  bind_rows(read.csv("peak-bloom-prediction/data/vancouver.csv"))

cherry1 <- subset(cherry1, year >= 1954) 
# I will only use data from 1954 onwards. 

# Because there are only two dates for Vancouver, I will remove this for now: 
cherry <- subset(cherry1, location != "vancouver")

# Let's take a look at the historic bloom dates: 
city_Colors<- c(kyoto = "lightpink1", liestal = "lightskyblue", 
                washingtondc = "palegreen3")
ggplot(cherry, aes(x = year, y = bloom_doy)) +
    geom_text(label = "\u273F", aes(color = location),
              size=4, family = "Arial Unicode MS", alpha = 1) +
   #  scale_color_discrete(name = "location") +
    theme_classic() +
    labs(x = "Year", y = "Full Blossom Day of the Year") +
    geom_smooth(method = "lm", fullrange = TRUE, se = FALSE,
                aes(color = location), show.legend = FALSE) +
   # geom_smooth(method = "lm", fullrange = TRUE, aes(fill = location), alpha = 0.2,)
    scale_color_manual(values=city_Colors, name = "Location", 
                       labels = c("Kyoto", "Liestal-Weideli", "Washington, D.C.")) + 
    ggtitle("Historic Bloom Dates") + 
    theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
    ylim(70, 120)

## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 1 rows containing missing values (`geom_text()`).

# ggsave("historicDates.png", p1)
# get the weather information: 
library(rnoaa) 
# stations <- ghcnd_stations()
weatherWash <- ghcnd_search("USC00186350")
weatherLies <- ghcnd_search("GME00127786")
weatherKyot <- ghcnd_search("JA000047759")
weatherVan <- ghcnd_search("CA001108395")
weatherNY <- ghcnd_search("USW00094728")
weather <- rbind(weatherWash$tmax, weatherLies$tmax, weatherKyot$tmax, 
                 weatherVan$tmax, weatherNY$tmax)
# I am using the maximum daily temperatures. 

# Prepare data set: 
weather$location <- NA
weather[weather$id == "USC00186350", ]$location <- "Washington"
weather[weather$id == "GME00127786", ]$location <- "Liestal"
weather[weather$id == "JA000047759", ]$location <- "Kyoto"
weather[weather$id == "CA001108395", ]$location <- "Vancouver"
weather[weather$id == "USW00094728", ]$location <- "NewYork"

summary(weather[weather$location == "Washington", ]$date) # data from 10/01/1948 - 12/31/2023
summary(weather[weather$location == "Liestal", ]$date) # data from 10/01/1953 - 12/31/2023
summary(weather[weather$location == "Kyoto", ]$date) # data from 01/01/1951 - 02/29/2024
summary(weather[weather$location == "Vancouver", ]$date) # data from 01/01/1957 - 02/29/2024
summary(weather[weather$location == "NewYork", ]$date) # data from 01/01/1869 - 02/29/2024

weather$date <- as.Date(weather$date, "%Y/%m/%d")
weather$year <- year(weather$date)
weather$date.within <- format(weather$date, format="%m-%d") 
weather$doy <- yday(weather$date) # day of the year

Step 1: Predict 2023 bloom dates for Washington, Liestal, and Kyoto

In the first step, I only predict the 2023 bloom dates for those cities for which I have historic bloom data, i.e., Washington, Liestal, and Kyoto. Like stated above, I will use the historic bloom dates as well as each year’s DRD, modeled DRD temperature and mean January temperature as predictors.

Calculate Historic Dormacy Release Dates

The dormacy release date (DRD) is the day in late Winter / early Spring when the temperature trend reverses from going down to going up again. Since cherry blossoms need both periods of decreasing / colder and increasing / warmer temperatures to bloom, the DRD can be an important predictor of the cherry blossom bloom date (Chung et al., 2011).

In order to get each year’s DRD, I fit a quadratic regression on the temperatures between October 1st and March 31st. I then use the minimum of this function as estimation of the DRD.

# I first need a new yearly organizing variable that changes on October 1st 
# instead of on January 1st:  
weather$yearWinter <- NA
for(y in 1:nrow(weather)){
  year.margin <- as.Date(paste(weather$year[y], "10-01", sep = "-"))
  weather$yearWinter[y] <- ifelse(weather$date[y] >= year.margin, 
                                  weather$year[y] + 1, weather$year[y])
}

# I then exclude all data points that lie in between June and October because 
# I don't need them
weather <- weather[!(month(weather$date) %in% c(6:9)), ]

# I now add a new DOY variable with negative numbers from October 1st to 
# December 31st
weather$doy.n <- 0
for(y in unique(weather$yearWinter)){
  weather[weather$yearWinter == y & weather$location == "Washington", ]$doy.n <- 
    seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y & 
    weather$location == "Washington", ]))
  weather[weather$yearWinter == y & weather$location == "Liestal", ]$doy.n <- 
    seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y & 
    weather$location == "Liestal", ]))
  weather[weather$yearWinter == y & weather$location == "Kyoto", ]$doy.n <- 
    seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y & 
    weather$location == "Kyoto", ]))
  weather[weather$yearWinter == y & weather$location == "Vancouver", ]$doy.n <- 
    seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y & 
    weather$location == "Vancouver", ]))
  weather[weather$yearWinter == y & weather$location == "NewYork", ]$doy.n <- 
    seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y & 
    weather$location == "NewYork", ]))
}

# I now estimate the DRDs. 
# Illustration for the year 2022 in Liestal: 
temp23 <- subset(weather, location == "Liestal" & yearWinter == 2023)
fit23 <- lm(tmax ~ doy.n + I(doy.n^2), temp23)
coefs23 <- coef(fit23)
doy.p <- -91: nrow(temp23) - 92
fit.p <- predict(fit23, newdata = data.frame(doy.n = doy.p))
fun23 <- function(x){
  coefs23[1] + coefs23[2]*x + coefs23[3]*x^2
}
DRD23 <- as.numeric(optimize(fun23, interval = c(1, 365), maximum = FALSE)[1])
temp_pred23 <- as.numeric(optimize(fun23, interval = c(1, 365), maximum = FALSE)[2])

with(temp23, plot(tmax ~ doy.n, ylab = "Maximum Temperature (1/10 °Celsius)", 
                  xlab = "Day of the Year", pch = 19, cex = 0.8, bty = "L", 
                  main = "Illustration: 2023 DRD in Liestal"))
lines(fit.p ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD23, col = "red")
text(DRD23,  220, substitute(paste(bold("Dormancy Release Date"))), cex = 0.6, col = "red", pos = 4)
abline(h = temp_pred23, col = "red")
text(DRD23 +60,  temp_pred23 - 10, substitute(paste(bold("Modeled DRD Temperature"))), cex = 0.6, col = "red", pos = 4)

Now, I model the DRD for each year in the temperature data set for each of the three locations.

for(l in c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork")){
  DRD <- year <- DRD.t.pred <- NULL
  weather_Loc <- subset(weather, location == l)
  for(y in 1:(length(unique(weather_Loc$yearWinter))-1)){
    year.y <- unique(weather_Loc$yearWinter)[y]
    temp.year <- weather_Loc[weather_Loc$yearWinter == year.y, ]
    if(nrow(temp.year) > 200){ # not all years are complete
      coefs <- coef(lm(tmax ~ doy.n + I(doy.n^2), temp.year))
      fun <- function(x){
        coefs[1] + coefs[2]*x + coefs[3]*x^2
      }
      DRD[y] <- as.numeric(optimize(fun, interval = c(-91, 200), 
                                    maximum = FALSE)[1])
      year[y] <- unique(weather_Loc$yearWinter)[y]
      DRD.t.pred[y] <- as.numeric(optimize(fun, interval = c(-91, 200), 
                                           maximum = FALSE)[2]) / 10
    }
  }
  assign(paste("DRD_", l, sep = ""), DRD)
  assign(paste("DRD.t.pred_", l, sep = ""), DRD.t.pred)
  assign(paste("year_", l, sep = ""), year)
}

temp_Wash <- data.frame(year = year_Washington, DRD = DRD_Washington, 
                        t.pred = DRD.t.pred_Washington)
# boxplot.stats(temp_Wash$DRD)$out
temp_Lies <- data.frame(year = year_Liestal, DRD = DRD_Liestal, 
                        t.pred = DRD.t.pred_Liestal)
# boxplot.stats(temp_Wash$DRD)$out
temp_Kyot <- data.frame(year = year_Kyoto, DRD = DRD_Kyoto, 
                        t.pred = DRD.t.pred_Kyoto)
# boxplot.stats(temp_Kyot$DRD)$out
temp_Vanc <- data.frame(year = year_Vancouver, DRD = DRD_Vancouver, 
                        t.pred = DRD.t.pred_Vancouver)
# boxplot.stats(temp_Vanc$DRD)$out
temp_NY <- data.frame(year = year_NewYork, DRD = DRD_NewYork, 
                        t.pred = DRD.t.pred_NewYork)


temp <- rbind(temp_Wash, temp_Lies, temp_Kyot, temp_Vanc, temp_NY)
temp$location <- rep(c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork"), 
                     c(nrow(temp_Wash), nrow(temp_Lies), nrow(temp_Kyot), 
                       nrow(temp_Vanc), nrow(temp_NY)))

ggplot(temp, aes(x = year, y = DRD)) +
  geom_point() +
  ylim(-5, 30) +
  xlim(1950, 2020) + 
  facet_grid(~location) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + 
  ylab("Estimated DRD") + 
  xlab("Year") +
  ggtitle("Historic DRDs") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

ggplot(temp, aes(x = year, y = t.pred)) +
  geom_point() +
  facet_grid(~location) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  ylim(-5, 15) + 
  xlim(1950, 2020) + 
  ylab("Modeled temperature at DRD (°C)") + 
  xlab("Year") + 
  ggtitle("Historic DRD Temperatures") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

Calculate Mean January Temperature

As an additional predictor, I will use each year’s mean temperature in January. In the following steps, I get this data from the historic weather data set.

## Washington, D.C. 
mean.jan_Wash <- NULL
year_Wash <- NULL
weather_Wash <- subset(weather, location == "Washington")
weather_Wash_jan <- subset(weather_Wash, month(date) == 1)
for(y in 1:(length(unique(weather_Wash_jan$yearWinter))-1)){
  year <- unique(weather_Wash_jan$yearWinter)[y]
  mean.jan_Wash[y] <- mean(weather_Wash_jan[weather_Wash_jan$yearWinter == year,
                                            ]$tmax,na.rm = TRUE) / 10
  year_Wash[y] <- year
}
jan_Wash <- data.frame(mean.jan = mean.jan_Wash, year = year_Wash)

## Liestal:
mean.jan_Lies <- NULL
year_Lies <- NULL
weather_Lies <- subset(weather, location == "Liestal")
weather_Lies_jan <- subset(weather_Lies, month(date) == 1)
for(y in 1:(length(unique(weather_Lies_jan$yearWinter))-1)){
  year <- unique(weather_Lies_jan$yearWinter)[y]
  mean.jan_Lies[y] <- mean(weather_Lies_jan[weather_Lies_jan$yearWinter == year,
                                            ]$tmax,na.rm = TRUE) / 10
  year_Lies[y] <- year
}
jan_Lies <- data.frame(mean.jan = mean.jan_Lies, year = year_Lies)


## Kyoto: 
mean.jan_kyot <- NULL
year_kyot <- NULL
weather_kyot <- subset(weather, location == "Kyoto")
weather_kyot_jan <- subset(weather_kyot, month(date) == 1)
for(y in 1:(length(unique(weather_kyot_jan$yearWinter))-1)){
  year <- unique(weather_kyot_jan$yearWinter)[y]
  mean.jan_kyot[y] <- mean(weather_kyot_jan[weather_kyot_jan$yearWinter == year,
                                            ]$tmax,na.rm = TRUE) / 10
  year_kyot[y] <- year
}
jan_Kyot <- data.frame(mean.jan = mean.jan_kyot, year = year_kyot)


## Vancouver: 
mean.jan_Vanc <- NULL
year_Vanc <- NULL
weather_Vanc <- subset(weather, location == "Vancouver")
weather_Vanc_jan <- subset(weather_Vanc, month(date) == 1)
for(y in 1:(length(unique(weather_Vanc_jan$yearWinter))-1)){
  year <- unique(weather_Vanc_jan$yearWinter)[y]
  mean.jan_Vanc[y] <- mean(weather_Vanc_jan[weather_Vanc_jan$yearWinter == year,
                                            ]$tmax,na.rm = TRUE) / 10
  year_Vanc[y] <- year
}
jan_Vanc <- data.frame(mean.jan = mean.jan_Vanc, year = year_Vanc)


### New York: 
mean.jan_NY <- NULL
year_NY <- NULL
weather_NY <- subset(weather, location == "NewYork")
weather_NY_jan <- subset(weather_NY, month(date) == 1)
for(y in 1:(length(unique(weather_NY_jan$yearWinter))-1)){
  year <- unique(weather_NY_jan$yearWinter)[y]
  mean.jan_NY[y] <- mean(weather_NY_jan[weather_NY_jan$yearWinter == year,
                                            ]$tmax,na.rm = TRUE) / 10
  year_NY[y] <- year
}
jan_NY <- data.frame(mean.jan = mean.jan_NY, year = year_NY)


temp_Wash <- merge(temp_Wash, jan_Wash, by = "year")
temp_Lies <- merge(temp_Lies, jan_Lies, by = "year")
temp_Kyot <- merge(temp_Kyot, jan_Kyot, by = "year")
temp_Vanc <- merge(temp_Vanc, jan_Vanc, by = "year")
temp_NY <- merge(temp_NY, jan_NY, by = "year")
temp <- rbind(temp_Wash, temp_Lies, temp_Kyot, temp_Vanc, temp_NY)
temp$location <- rep(c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork"), 
                     c(nrow(temp_Wash), nrow(temp_Lies), nrow(temp_Kyot), 
                       nrow(temp_Vanc), nrow(temp_NY)))

ggplot(temp, aes(x = year, y = mean.jan)) +
  geom_point() +
  facet_grid(~location) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  ylim(-5, 15) + 
  xlim(1950, 2020) +
  ylab("Mean January temperature (°C)") + 
  xlab("Year") + 
  ggtitle("Historic Mean January Temperatures") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

Estimate 2024 DRD

Before I can estimate the 2024 bloom dates, I need to estimate this year’s DRD. However, the ghcnd weather data ends in 2023, so I looked for more current temperature data online and retrieved free temperature data from https://www.visualcrossing.com. These data sets include the weather data between 10/01/2023 and 02/29/2024 (forecast) for each city.

1. Washington:

Wash_2024 <- read.csv("peak-bloom-prediction/data/washington dc 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)

# check whether both data sources match: 
wGHCN <- subset(weather, location == "Washington" & 
                  date >= as.Date("2023-10-01") & date <=
                  as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Wash_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # r = .83 so this seems okay!

## [1] 0.8289739

Wash_2024$doy <- yday(as.Date(Wash_2024$datetime))
Wash_2024$doy.n <- -91:60
Wash_2024$tmax <- Wash_2024$tempmax

# the problem is that I have to little 2023 data to sensibly model the DRD. I will
# thus use the last 10 years' mean temperature data between March and June to 
# improve the DRD prediction. However, these values will be weighted less strongly 
# than the actually observed temperature data. 

yday(as.Date("2024-03-01"))

## [1] 61

yday(as.Date("2024-05-31"))

## [1] 152

# I get the 61th - 153rd days of the year; for leap years this is March 1st - June 1st 
# for other years it is March 2nd - June 2nd. 

# I now get the mean temperature of the last 10 years for all of these days: 
mw_Wash <- subset(aggregate(tmax ~ doy, subset(weather_Wash, year >= 2013), mean), 
                  doy %in% 61:152)
mw_Wash$tmax <- mw_Wash$tmax / 10

w24_Wash <- Wash_2024[, c("doy.n", "tmax")]
names(mw_Wash) <- c("doy.n", "tmax")
w_Wash <- rbind(w24_Wash, mw_Wash)

# fit quadratic regression with specified weights: 
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Wash <- lm(tmax ~ doy.n + I(doy.n^2), w_Wash, weights = weights)

coefs24_Wash <- coef(fit24_Wash)
doy.p <- -91: 152
fit.p_Wash <- predict(fit24_Wash, newdata = data.frame(doy.n = doy.p))
fun24_Wash <- function(x){
  coefs24_Wash[1] + coefs24_Wash[2]*x + coefs24_Wash[3]*x^2
}

DRD24_Wash <- as.numeric(optimize(fun24_Wash, interval = c(1, 365), 
                                  maximum = FALSE)[1])
t.DRD24_Wash <- w_Wash[w_Wash$doy.n == round(DRD24_Wash), ]$tmax
tpred24_Wash <- as.numeric(optimize(fun24_Wash, interval = c(1, 365), 
                                    maximum = FALSE)[2])
with(subset(w_Wash, doy.n <= 60), 
     plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)", 
                     xlab = "Day of the Year", pch = 19, cex = 0.8, 
                     xlim = c(-91, 150), bty = "L", 
                     main = "Predicted 2024 DRD: Washington, D.C."))
with(subset(w_Wash, doy.n > 60), 
     points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Wash ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Wash, col = "red")
text(DRD24_Wash,  20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)

DRD_Wash <- c(DRD_Washington, DRD24_Wash) 
t.pred_Wash <- c(temp_Wash$t.pred, tpred24_Wash)

# Calculate mean January temperature
j24_Wash <- mean(subset(Wash_2024, as.Date(datetime) >= as.Date("2024-01-01") &
              as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)

2. Liestal:

Lies_2024 <- read.csv("peak-bloom-prediction/data/liestal 2023-10-01 to 2024-02-29.csv", 
                      sep = ",", header = TRUE)

# check whether both data sources match: 
wGHCN <- subset(weather, location == "Liestal" & 
                  date >= as.Date("2023-10-01") & date <=
                  as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Lies_2024, as.Date(datetime) >= as.Date("2023-10-01") & 
                as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC) # very high correlation -> looks good!

## [1] 0.9899423

Lies_2024$doy <- yday(as.Date(Lies_2024$datetime))
Lies_2024$doy.n <- -91:60
Lies_2024$tmax <- Lies_2024$tempmax


mw_Lies <- subset(aggregate(tmax ~ doy, subset(weather_Lies, year >= 2013), mean), 
                  doy %in% 61:152)
mw_Lies$tmax <- mw_Lies$tmax / 10

w24_Lies <- Lies_2024[, c("doy.n", "tmax")]
names(mw_Lies) <- c("doy.n", "tmax")
w_Lies <- rbind(w24_Lies, mw_Lies)

# fit quadratic regression with specified weights: 
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Lies <- lm(tmax ~ doy.n + I(doy.n^2), w_Lies, weights = weights)

coefs24_Lies <- coef(fit24_Lies)
doy.p <- -91: 153
fit.p_Lies <- predict(fit24_Lies, newdata = data.frame(doy.n = doy.p))
fun24_Lies <- function(x){
  coefs24_Lies[1] + coefs24_Lies[2]*x + coefs24_Lies[3]*x^2
}

DRD24_Lies <- as.numeric(optimize(fun24_Lies, interval = c(1, 365), 
                                  maximum = FALSE)[1])
t.DRD24_Lies <- w_Lies[w_Lies$doy.n == round(DRD24_Lies), ]$tmax
tpred24_Lies <- as.numeric(optimize(fun24_Lies, interval = c(1, 365), 
                                    maximum = FALSE)[2])
with(subset(w_Lies, doy.n <= 60), 
     plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)", 
                     xlab = "Day of the Year", pch = 19, cex = 0.8, 
                     xlim = c(-91, 150), bty = "L", 
                     main = "Predicted 2023 DRD: Liestal"))
with(subset(w_Lies, doy.n > 60), 
     points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Lies ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Lies, col = "red")
text(DRD24_Lies,  20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)

DRD_Lies <- c(DRD_Liestal, DRD24_Lies) 
t.pred_Lies <- c(temp_Lies$t.pred, tpred24_Lies)


# Calculate mean January temperature: 
j24_Lies <- mean(subset(Lies_2024, as.Date(datetime) >= as.Date("2024-01-01") &
              as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)

3. Kyoto:

Kyot_2024 <- read.csv("peak-bloom-prediction/data/kyoto 2023-10-01 to 2024-02-29.csv", 
                      sep = ",", header = TRUE)

# check whether both data sources match: 
wGHCN <- subset(weather, location == "Kyoto" & 
                  date >= as.Date("2023-10-01") & date <=
                  as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Kyot_2024, as.Date(datetime) >= as.Date("2023-10-01") & 
                as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # very high correlation -> no problem!

## [1] 0.9914435

Kyot_2024$doy <- yday(as.Date(Kyot_2024$datetime))
Kyot_2024$doy.n <- -91:60
Kyot_2024$tmax <- Kyot_2024$tempmax

weather_Kyot <- weather_kyot

mw_Kyot <- subset(aggregate(tmax ~ doy, subset(weather_Kyot, year >= 2013), mean), 
                  doy %in% 61:152)
mw_Kyot$tmax <- mw_Kyot$tmax / 10

w24_Kyot <- Kyot_2024[, c("doy.n", "tmax")]
names(mw_Kyot) <- c("doy.n", "tmax")
w_Kyot <- rbind(w24_Kyot, mw_Kyot)

# fit quadratic regression with specified weights: 
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Kyot <- lm(tmax ~ doy.n + I(doy.n^2), w_Kyot, weights = weights)

coefs24_Kyot <- coef(fit24_Kyot)
doy.p <- -91: 152
fit.p_Kyot <- predict(fit24_Kyot, newdata = data.frame(doy.n = doy.p))
fun24_Kyot <- function(x){
  coefs24_Kyot[1] + coefs24_Kyot[2]*x + coefs24_Kyot[3]*x^2
}

DRD24_Kyot <- as.numeric(optimize(fun24_Kyot, interval = c(1, 365), 
                                  maximum = FALSE)[1])
t.DRD24_Kyot <- w_Kyot[w_Kyot$doy.n == round(DRD24_Kyot), ]$tmax
tpred24_Kyot <- as.numeric(optimize(fun24_Kyot, interval = c(1, 365), 
                                    maximum = FALSE)[2])
with(subset(w_Kyot, doy.n <= 60), 
     plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)", 
                     xlab = "Day of the Year", pch = 19, cex = 0.8, 
                     xlim = c(-91, 150), bty = "L", 
                     main = "Predicted 2023 DRD: Kyoto"))
with(subset(w_Kyot, doy.n > 60), 
     points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Kyot ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Kyot, col = "red")
text(DRD24_Kyot,  20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)

DRD_Kyot <- c(DRD_Kyoto, DRD24_Kyot) 
t.pred_Kyot <- c(temp_Kyot$t.pred, tpred24_Kyot)


# Calculate mean January temperature
# 2024: 
j24_Kyot <- mean(subset(Kyot_2024, as.Date(datetime) >= as.Date("2024-01-01") &
              as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)

Prediction: Washington, Liestal, Kyoto

After modeling the DRDs, DRD temperatures, and calculating each year’s mean January temperature, I can now predict the 2024 bloom dates. I do this seperately for each city since I want to consider the goodness of fit indices for all cities.

1. Washington

tc <- trainControl(method = "LOOCV") # set caret train function to leave one out


cherry_Wash <- subset(cherry, location == "washingtondc")
# now use data from temperature data set: temp_Lies
dat_Wash <- merge(cherry_Wash, temp_Wash, by = "year")

### Different Models
mYearW <- train(bloom_doy ~ year, data = dat_Wash,
           method = "lm", trControl = tc)
# mYearW
# summary(mYearW) # R^2 = 0.23

mDRDW <- train(bloom_doy ~ DRD, data = dat_Wash, 
               method = "lm", trControl = tc)
# summary(mDRDW) # R^2 = 0.13

mtpredW <- train(bloom_doy ~ t.pred, data = dat_Wash, 
                 method = "lm", trControl = tc)
# summary(mtpredW) # R^2 = 0.25

mjanW <- train(bloom_doy ~ mean.jan, data = dat_Wash, 
               method = "lm", trControl = tc)
# summary(mjanW) # R^2 = 0.02

mAllW <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_Wash, 
               method = "lm", trControl = tc)
summary(mAllW) # R^2 = 0.59 -> I will take this model!

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3155 -3.0399 -0.1244  3.0013 10.1582 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 385.31877   57.72019   6.676 8.34e-09 ***
## year         -0.14336    0.02929  -4.894 7.57e-06 ***
## DRD           0.50699    0.12226   4.147 0.000106 ***
## t.pred       -3.00484    0.53672  -5.598 5.46e-07 ***
## mean.jan      0.99505    0.34480   2.886 0.005390 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.722 on 61 degrees of freedom
## Multiple R-squared:  0.5857, Adjusted R-squared:  0.5585 
## F-statistic: 21.56 on 4 and 61 DF,  p-value: 4.011e-11

# check cross validation results (using leave-one-out method): 
mAllW

## Linear Regression 
## 
## 66 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 65, 65, 65, 65, 65, 65, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.906488  0.5168813  4.131986
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# RMSE = 4.91, R^2 = 0.52, MAE = 4.13

plot(bloom_doy ~ year, dat_Wash, cex = 0.5, pch = 19, 
     ylab = "Full Bloom Day of the Year", bty = "L",
     xlab = "Year", main = "Best Predictor for Washington, D.C.: Year")
mYearW <- lm(bloom_doy ~ year, data = dat_Wash)
abline(mYearW)

### Predict Bloom Date 2024
data24_Wash <- data.frame(year = 2024, DRD = DRD24_Wash, t.pred = tpred24_Wash, 
                          mean.jan = j24_Wash)
predict(mAllW, newdata = data24_Wash) # 85.86

##        1 
## 85.85945

doy_to_date(2024, 86) # I rounded it up to 86 since 85.86 is pretty late on March 25th.

## [1] "2024-03-26"

# I will compare this multiple regression approach to a random forest model: 
library(caret)
rf_Wash_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
               data = dat_Wash, 
               method = "cforest", 
               tuneGrid = data.frame(.mtry = 2),
               trControl = trainControl(method = "oob"))
rf_Wash_1

## Conditional Inference Random Forest 
## 
## 66 samples
##  4 predictor
## 
## No pre-processing
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   5.872503  0.321994  4.767238
## 
## Tuning parameter 'mtry' was held constant at a value of 2

# OOB RMSR = 5.78 -> this is higher than the multiple regression model. 
predict(rf_Wash_1, newdata = data24_Wash)

## .outcome 
## 89.19813

doy_to_date(2024, 89)

## [1] "2024-03-29"

# random forest prediction would be March 29th. However, I will stick to the 
# regression prediction, since it has better fit indices.

My prediction for Washington is March 25th 2024

2. Liestal

cherry_lies <- subset(cherry, location == "liestal")
# now use data from temperature data set: temp_Lies
dat_lies <- merge(cherry_lies, temp_Lies, by = "year")

mYearL <- train(bloom_doy ~ year, data = dat_lies, 
                method = "lm", trControl = tc)
# summary(mYearL) # R^2 = 0.22
mDRDL <- train(bloom_doy ~ DRD, data = dat_lies, 
               method = "lm", trControl = tc)
# summary(mDRDL) # R^2 = 0.34
mtpredL <- train(bloom_doy ~ t.pred, data = dat_lies, 
                 method = "lm", trControl = tc)
# summary(mtpredL) # R^2 = 0.57
mjanL <- train(bloom_doy ~ mean.jan, data = dat_lies, 
               method = "lm", trControl = tc)
# summary(mjanL) # R^2 = 0.08
mAllL <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_lies,
               method = "lm", trControl = tc)
# summary(mAllL) # R^2 = 0.72
mAllL2 <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_lies, 
                method = "lm", trControl = tc)
summary(mAllL2) # R^2 = 0.73 -> I will take this model.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3303  -4.2909   0.5589   3.5102  18.0839 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.90449   90.28503   3.798 0.000339 ***
## year          -0.12232    0.04597  -2.661 0.009946 ** 
## DRD            1.26673    0.58305   2.173 0.033707 *  
## t.pred        -3.28782    1.65645  -1.985 0.051662 .  
## mean.jan       1.71900    0.42677   4.028 0.000158 ***
## `DRD:t.pred`  -0.12579    0.09684  -1.299 0.198834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.411 on 61 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7067 
## F-statistic: 32.81 on 5 and 61 DF,  p-value: 4.36e-16

# check cross validation results (using leave-one-out method): 
mAllL2

## Linear Regression 
## 
## 67 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 66, 66, 66, 66, 66, 66, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.823537  0.6653337  5.274706
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# RMSE = 6.82, R^2 = 0.67, MAE = 5.27


# For Liestal, the modeled temperature at DRD is the best single predictor: 
rbPal <- colorRampPalette(c('cadetblue2','maroon3'))
dat_lies$Col <- rbPal(10)[as.numeric(cut(dat_lies$year,breaks = 10))]

plot(bloom_doy ~ t.pred, dat_lies, cex = 0.5, pch = 19, 
     ylab = "Full Bloom Day of the Year", 
     xlab = "ModeledTemperature at Dormancy Release Date", 
     main = "Best Predictor for Liestal: DRD Temperature",
     bty = "L", col = dat_lies$Col)
with(dat_lies, text((bloom_doy -1.75) ~ t.pred, labels = year, cex = 0.58, col = scales::alpha(dat_lies$Col, alpha = 0.9)))
mtpredL <- lm(bloom_doy ~ t.pred, data = dat_lies)
abline(mtpredL)

### Predict Bloom Date 2023
data24_Lies <- data.frame(year = 2024, DRD = DRD24_Lies, t.pred = tpred24_Lies, 
                          mean.jan = j24_Lies)
predict(mAllL2, newdata = data24_Lies) # 91.40

##        1 
## 91.39672

doy_to_date(2024, 91)

## [1] "2024-03-31"

# random forest method: 
rf_Lies_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
               data = dat_lies, 
               method = "cforest", 
               tuneGrid = data.frame(.mtry = 2),
               trControl = trainControl(method = "oob"))
rf_Lies_1

## Conditional Inference Random Forest 
## 
## 67 samples
##  4 predictor
## 
## No pre-processing
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.491085  0.6069835  5.727468
## 
## Tuning parameter 'mtry' was held constant at a value of 2

# RMSE = 7.43, R^2 = 0.61, MAE = 5.67 -> Again, the regression approach gives the 
# better results

My prediction for Liestal is March 31st 2024

3. Kyoto

cherry_kyot <- subset(cherry, location == "kyoto")
# now use data from temperature data set: temp_kyot
dat_kyot <- merge(cherry_kyot, temp_Kyot, by = "year")

mYearK <- train(bloom_doy ~ year, data = dat_kyot, 
                method = "lm", trControl = tc)
# summary(mYearK) # R^2 = 0.24
mDRDK <- train(bloom_doy ~ DRD, data = dat_kyot, 
               method = "lm", trControl = tc)
# summary(mDRDK) # R^2 = 0.40
mtpredK <- train(bloom_doy ~ t.pred, data = dat_kyot, 
              method = "lm", trControl = tc)
# summary(mtpredK) # R^2 = 0.33
mjanK <- train(bloom_doy ~ mean.jan, data = dat_kyot, 
            method = "lm", trControl = tc)
# summary(mjanK) # R^2 = 0.05
mAllK <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_kyot, 
            method = "lm", trControl = tc)
# summary(mAllK) # R^2 = 0.76
mAllK2 <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_kyot, 
                method = "lm", trControl = tc)
summary(mAllK2) # R^2 = 0.78-> I will take this model.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.975 -1.716 -0.111  1.573  6.235 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  168.49585   40.80485   4.129 0.000111 ***
## year          -0.07749    0.01492  -5.194 2.43e-06 ***
## DRD            4.19033    1.29115   3.245 0.001893 ** 
## t.pred         5.02142    3.27446   1.534 0.130237    
## mean.jan       1.44701    0.38591   3.750 0.000392 ***
## `DRD:t.pred`  -0.34390    0.13219  -2.601 0.011593 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.353 on 62 degrees of freedom
## Multiple R-squared:  0.7847, Adjusted R-squared:  0.7673 
## F-statistic: 45.18 on 5 and 62 DF,  p-value: < 2.2e-16

# check cross validation results (using leave-one-out method): 
mAllK2

## Linear Regression 
## 
## 68 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 67, 67, 67, 67, 67, 67, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.463822  0.7414536  1.987877
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# RMSE = 2.46, R^2 = 0.74, MAE = 1.99

rbPal <- colorRampPalette(c('cadetblue2','maroon3'))
dat_kyot$Col <- rbPal(10)[as.numeric(cut(dat_kyot$year,breaks = 10))]

plot(bloom_doy ~ DRD, dat_kyot, cex = 0.5, pch = 19, 
     ylab = "Full Bloom Day of the Year", main = "Best Predictor for Kyoto: DRD",
     xlab = "Dormacy Release Date", bty = "n", col = dat_kyot$Col)
box(bty = "L")
with(dat_kyot, text((bloom_doy -1) ~ DRD, labels = year, cex = 0.58, 
     col = scales::alpha(dat_kyot$Col, alpha = 0.8)))
mDRDK <- lm(bloom_doy ~ DRD, data = dat_kyot)
abline(mDRDK)

# check prediction accuracy: 
pbloomK <- predict(mAllK2, newdata = data.frame(year = dat_kyot$year, 
  DRD = dat_kyot$DRD, t.pred = dat_kyot$t.pred, mean.jan = dat_kyot$t.pred))
rmse(dat_kyot$bloom_doy, pbloomK) # 2.83

## [1] 2.826971

mae(dat_kyot$bloom_doy, pbloomK) # 2.28

## [1] 2.283398

cor(dat_kyot$bloom_doy, pbloomK) # 0.86

## [1] 0.8562401

### Predict Bloom Date 2024
data24_Kyot <- data.frame(year = 2024, DRD = DRD24_Kyot, t.pred = tpred24_Kyot, 
                          mean.jan = j24_Kyot)
predict(mAllK2, newdata = data24_Kyot) # 93.20

##        1 
## 93.20421

doy_to_date(2024, 93)

## [1] "2024-04-02"

rf_Kyot_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
               data = dat_kyot, 
               method = "cforest", 
               tuneGrid = data.frame(.mtry = 2),
               trControl = trainControl(method = "oob"))
rf_Kyot_1 # RMSE = 3.21, R  2 = 0.63, MAE = 2.32 -> Again, regression is better.

## Conditional Inference Random Forest 
## 
## 68 samples
##  4 predictor
## 
## No pre-processing
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.188488  0.6394563  2.292674
## 
## Tuning parameter 'mtry' was held constant at a value of 2

My prediction for Kyoto is April 2nd 2024.

Step 2: Predict 2024 Vancouver and New York City Bloom Dates

For Vancouver and New York City, we do not have a lot of historic bloom dates. I therefore try to predict these dates from the USA NPN dataset. To evaluate how good this works, I first do the same for Washington and compare the predicted with the actual dates.

To estimate model fit: Model historic bloom dates in Washington, D.C.

dat <- read.csv("peak-bloom-prediction/data/USA-NPN_individual_phenometrics_data.csv")
dat$year <- dat$First_Yes_Year
dat$Species <- as.factor(dat$Species)
m_wash <- train(First_Yes_DOY ~ year + Latitude + Longitude + 
               Elevation_in_Meters + Species + Tmax_Winter, data = dat, 
               method = "lm", trControl = tc)

w_wash <- subset(weather, location == "Washington")
w_wash <- subset(w_wash, month(date) %in% c(12, 1, 2))
tmaxWinter_wash <- aggregate(tmax ~ yearWinter, w_wash, mean)
tmaxWinter_wash$tmax <- tmaxWinter_wash$tmax / 10

data_wash <- data.frame(year = unique(dat$year), Latitude = 38.8853, 
                        Longitude = -77.0386, Elevation_in_Meters = 0, 
                        Species = "yedoensis", 
                        Tmax_Winter = tmaxWinter_wash[60:74, ]$tmax)
doy.pred <- predict(m_wash, newdata = data_wash)
doy.actual <- subset(cherry, year %in% unique(dat$year) & location == "washingtondc")$bloom_doy

doy <- data.frame(doy.pred = doy.pred, doy.actual = doy.actual, 
                  year = unique(dat$year))

cor(doy.pred, doy.actual) # 0.40

## [1] 0.3954145

rmse(doy.pred, doy.actual) # 8.36

## [1] 8.357482

mae(doy.pred, doy.actual) # 7.36

## [1] 7.356402

plot(doy.pred ~ year, doy, ylim = c(80, 100), pch = 19, ylab = 
       "Bloom DOY (modeled / actual)", xlab = "Year", bty = "L", 
     main = "Modeled vs. actual bloom dates: Wahsington, D.C.")
points(doy.actual ~ year, doy, col = "red", pch = 19)
arrows(x0 = doy$year, x1 = doy$year, y0 = doy$doy.pred, y1 = doy$doy.actual, 
      length = 0, col = "red", lty = "dashed")
points(doy.pred ~ year, doy, col = "black", pch = 19)
legend("topright", legend = c("predicted", "actual"), title = "Washington D.C. Bloom DOY", bty = "n", col = c("black", "red"), pch = 19, cex = 0.6)

rf_old_Wash <- train(First_Yes_DOY ~ year + Latitude + Longitude + 
               Elevation_in_Meters + Species + Tmax_Winter, 
               data = dat, 
               method = "cforest", 
               tuneGrid = data.frame(.mtry = 5),
               trControl = trainControl(method = "oob"))
# again, regression performs better than random forest.

The predictions are on average wrong by 7 days which isn’t very good. However, I tried a few different things and nothing seemed to work better, so I will go with it.

I can now estimate the historic bloom dates in Vancouver from the available data.

Model historic bloom dates in Vancouver

m_vanc <- lm(First_Yes_DOY ~ year + Latitude + Longitude + 
               Elevation_in_Meters + Species + Tmax_Winter, dat)

w_vanc <- subset(weather, location == "Vancouver")
w_vanc <- subset(w_vanc, month(date) %in% c(12, 1, 2))
tmaxWinter_vanc <- aggregate(tmax ~ yearWinter, w_vanc, mean)
tmaxWinter_vanc$tmax <- tmaxWinter_vanc$tmax / 10


data_vanc <- data.frame(year = unique(dat$year), Latitude = 49.2237, 
                        Longitude =  -123.1636, Elevation_in_Meters = 24, 
                        Species = "yedoensis", 
                        Tmax_Winter = tmaxWinter_vanc[35:49, ]$tmax)
doy.pred <- predict(m_vanc, newdata = data_vanc)
doy.pred

##        1        2        3        4        5        6        7        8 
## 113.1107 112.4823 111.8566 111.2300 110.6037 109.9780 109.3497 108.7234 
##        9       10       11       12       13       14       15 
## 108.0989 107.4716 106.8449 106.2179 105.5914 104.9657 104.3390

data_vanc$bloom_doy <- doy.pred

# For the last two years, I do, in fact, have the true bloom dates. 
# I thus change the last two doys to the actual ones. 

data_vanc[14:15, ]$bloom_doy <- c(86, 97)

With these modeled bloom dates, I can now estimate the 2024 bloom date like I did for the other locations. But first, I have to estimate the Vancouver DRDs and calculate the past mean temperatures for January.

Estimate 2024 DRD and mean January temperature:

Vanc_2024 <- read.csv("peak-bloom-prediction/data/vancouver 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)

# check whether both data sources match: 
wGHCN <- subset(weather, location == "Vancouver" & 
                  date >= as.Date("2023-10-01") & date <=
                  as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Vanc_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # correlation is very high -> everything okay!

## [1] 0.9629359

Vanc_2024$doy <- yday(as.Date(Vanc_2024$datetime))
Vanc_2024$doy.n <- -91:60
Vanc_2024$tmax <- Vanc_2024$tempmax

mw_Vanc <- subset(aggregate(tmax ~ doy, subset(weather_Vanc, year >= 2013), mean), 
                  doy %in% 61:152)
mw_Vanc$tmax <- mw_Vanc$tmax / 10

w24_Vanc <- Vanc_2024[, c("doy.n", "tmax")]
names(mw_Vanc) <- c("doy.n", "tmax")
w_Vanc <- rbind(w24_Vanc, mw_Vanc)

# fit quadratic regression with specified weights: 
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Vanc <- lm(tmax ~ doy.n + I(doy.n^2), w_Vanc, weights = weights)

coefs24_Vanc <- coef(fit24_Vanc)
doy.p <- -91: 153
fit.p_Vanc <- predict(fit24_Vanc, newdata = data.frame(doy.n = doy.p))
fun24_Vanc <- function(x){
  coefs24_Vanc[1] + coefs24_Vanc[2]*x + coefs24_Vanc[3]*x^2
}

DRD24_Vanc <- as.numeric(optimize(fun24_Vanc, interval = c(1, 365), 
                                  maximum = FALSE)[1])
t.DRD24_Vanc <- w_Vanc[w_Vanc$doy.n == round(DRD24_Vanc), ]$tmax
tpred24_Vanc <- as.numeric(optimize(fun24_Vanc, interval = c(1, 365), 
                                    maximum = FALSE)[2])
with(subset(w_Vanc, doy.n <= 60), 
     plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)", 
                     xlab = "Day of the Year", pch = 19, cex = 0.8, 
                     xlim = c(-91, 150), bty = "L", 
                     main = "Predicted 2023 DRD: Vancouver"))
with(subset(w_Vanc, doy.n > 60), 
     points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Vanc ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Vanc, col = "red")
text(DRD24_Vanc,  20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)

DRD_Vanc <- c(DRD_Vancouver, DRD24_Vanc) 
t.pred_Vanc <- c(temp_Vanc$t.pred, tpred24_Vanc)

# Calculate mean January temperature
j24_Vanc <- mean(subset(Vanc_2024, as.Date(datetime) >= as.Date("2024-01-01") &
              as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)

Predict 2024 Bloom Date in Vancouver

I now have all the data to predict the 2024 bloom date in Vancouver.

dat_Vanc <- merge(data_vanc, temp_Vanc, by = "year")

### Different Models
mYearV <- train(bloom_doy ~ year, data = dat_Vanc, 
                method = "lm", trControl = tc)
# summary(mYearV) # R^2 = 0.61
mDRDV <- train(bloom_doy ~ DRD, data = dat_Vanc, 
            method = "lm", trControl = tc)
# summary(mDRDV) # R^2 = 0.01
mtpredV <- train(bloom_doy ~ t.pred, data = dat_Vanc, 
                 method = "lm", trControl = tc)
# summary(mtpredV) # R^2 = 0.08
mjanV <- train(bloom_doy ~ mean.jan, data = dat_Vanc, 
            method = "lm", trControl = tc)
# summary(mjanV) # R^2 < .01
mAllV <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_Vanc, 
            method = "lm", trControl = tc)
summary(mAllV) # R^2 = 0.63

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9958  -1.2376   0.4798   2.6473   4.7052 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2738.7915   779.9940   3.511  0.00661 **
## year           -1.2962     0.3814  -3.398  0.00789 **
## DRD            -1.2371     2.5400  -0.487  0.63787   
## t.pred         -2.8939     5.8467  -0.495  0.63248   
## mean.jan        0.5430     1.5139   0.359  0.72810   
## `DRD:t.pred`    0.1582     0.3623   0.437  0.67263   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.327 on 9 degrees of freedom
## Multiple R-squared:  0.6268, Adjusted R-squared:  0.4194 
## F-statistic: 3.023 on 5 and 9 DF,  p-value: 0.0712

# I will use the model with everything but year as predictors. 

# check prediction accuracy: 
mAllV

## Linear Regression 
## 
## 15 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 14, 14, 14, 14, 14, 14, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.946136  0.2909384  5.019634
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# RMSE = 6.95, R^2 = 0.29, MAE = 5.02
# of course, this needs to be interpreted with caution, since the DV (past blooom
# dates) are not the real ones. I am not sure why R^2 is so low, but the other
# values seem okay. 

### Predict Bloom Date 2024
data24_Vanc <- data.frame(year = 2024, DRD = DRD24_Vanc, t.pred = tpred24_Vanc, 
                          mean.jan = j24_Vanc)
predict(mYearV, newdata = data24_Vanc) # 97.24

##        1 
## 97.23967

doy_to_date(2024, 97)

## [1] "2024-04-06"

My prediction for Vancouver is April 6th, 2024.

Model historic bloom dates in New York City

m_ny <- lm(First_Yes_DOY ~ year + Latitude + Longitude + 
               Elevation_in_Meters + Species + Tmax_Winter, dat)

w_ny <- subset(weather, location == "NewYork")
w_ny <- subset(w_ny, month(date) %in% c(12, 1, 2))
tmaxWinter_ny <- aggregate(tmax ~ yearWinter, w_ny, mean)
tmaxWinter_ny$tmax <- tmaxWinter_ny$tmax / 10


data_ny <- data.frame(year = unique(dat$year), Latitude = 40.73040, 
                        Longitude =  -73.99809, Elevation_in_Meters = 8.5, 
                        Species = "yedoensis", 
                        Tmax_Winter = tmaxWinter_ny[35:49, ]$tmax)
doy.pred <- predict(m_ny, newdata = data_ny)
doy.pred

##         1         2         3         4         5         6         7         8 
## 102.64284 102.01767 101.39106 100.76176 100.13646  99.50985  98.88219  98.25755 
##         9        10        11        12        13        14        15 
##  97.63052  97.00433  96.37551  95.75077  95.12393  94.49763  93.87138

data_ny$bloom_doy <- doy.pred

With these modeled bloom dates, I can now estimate the 2024 bloom date like I did for the other locations. But first, I have to estimate the New York City DRDs and calculate the past mean temperatures for January.

Estimate 2023 DRD and mean January temperature:

NY_2024 <- read.csv("peak-bloom-prediction/data/new york city 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)

# check whether both data sources match: 
wGHCN <- subset(weather, location == "NewYork" & 
                  date >= as.Date("2023-10-01") & date <=
                  as.Date("2023-10-31"))$tmax / 10
wVC <- subset(NY_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # correlation is very high -> everything okay!

## [1] 0.9814348

NY_2024$doy <- yday(as.Date(NY_2024$datetime))
NY_2024$doy.n <- -91:60
NY_2024$tmax <- NY_2024$tempmax

mw_NY <- subset(aggregate(tmax ~ doy, subset(weather_NY, year >= 2013), mean), 
                  doy %in% 61:152)
mw_NY$tmax <- mw_NY$tmax / 10

w24_NY <- NY_2024[, c("doy.n", "tmax")]
names(mw_NY) <- c("doy.n", "tmax")
w_NY <- rbind(w24_NY, mw_NY)

# fit quadratic regression with specified weights: 
weights <- rep(c(1, 0.2), c(153, 91))
fit24_NY <- lm(tmax ~ doy.n + I(doy.n^2), w_NY, weights = weights)

coefs24_NY <- coef(fit24_NY)
doy.p <- -91: 153
fit.p_NY <- predict(fit24_NY, newdata = data.frame(doy.n = doy.p))
fun24_NY <- function(x){
  coefs24_NY[1] + coefs24_NY[2]*x + coefs24_NY[3]*x^2
}

DRD24_NY <- as.numeric(optimize(fun24_NY, interval = c(1, 365), 
                                  maximum = FALSE)[1])
t.DRD24_NY <- w_NY[w_NY$doy.n == round(DRD24_NY), ]$tmax
tpred24_NY <- as.numeric(optimize(fun24_NY, interval = c(1, 365), 
                                    maximum = FALSE)[2])
with(subset(w_NY, doy.n <= 60), 
     plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)", 
                     xlab = "Day of the Year", pch = 19, cex = 0.8, 
                     xlim = c(-91, 150), bty = "L", 
                     main = "Predicted 2023 DRD: New York City"))
with(subset(w_NY, doy.n > 60), 
     points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_NY ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_NY, col = "red")
text(DRD24_NY,  20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)

DRD_NY <- c(DRD_NewYork, DRD24_NY) 
t.pred_NY <- c(temp_NY$t.pred, tpred24_NY)

# Calculate mean January temperature
j24_NY <- mean(subset(NY_2024, as.Date(datetime) >= as.Date("2024-01-01") &
              as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)

I now have all the data to predict the 2024 bloom date in New York City

dat_NY <- merge(data_ny, temp_NY, by = "year")

### Different Models
mYearN <- train(bloom_doy ~ year, data = dat_NY, 
                method = "lm", trControl = tc)
# summary(mYearV) # R^2 = 0.61
mDRDN <- train(bloom_doy ~ DRD, data = dat_NY, 
               method = "lm", trControl = tc)
# summary(mDRDV) # R^2 = 0.11
mtpredN <- train(bloom_doy ~ t.pred, data = dat_NY, 
                 method = "lm", trControl = tc)
# summary(mtpredV) # R^2 = 0.13
mjanN <- train(bloom_doy ~ mean.jan, data = dat_NY, 
               method = "lm", trControl = tc)
# summary(mjanV) # R^2 = 0.27
mAllN <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_NY, 
               method = "lm", trControl = tc)
summary(mAllV) # R^2 = 0.63

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9958  -1.2376   0.4798   2.6473   4.7052 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2738.7915   779.9940   3.511  0.00661 **
## year           -1.2962     0.3814  -3.398  0.00789 **
## DRD            -1.2371     2.5400  -0.487  0.63787   
## t.pred         -2.8939     5.8467  -0.495  0.63248   
## mean.jan        0.5430     1.5139   0.359  0.72810   
## `DRD:t.pred`    0.1582     0.3623   0.437  0.67263   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.327 on 9 degrees of freedom
## Multiple R-squared:  0.6268, Adjusted R-squared:  0.4194 
## F-statistic: 3.023 on 5 and 9 DF,  p-value: 0.0712

# I will use the model with everything but year as predictors. 

# check prediction accuracy: 
mAllV

## Linear Regression 
## 
## 15 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 14, 14, 14, 14, 14, 14, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.946136  0.2909384  5.019634
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# RMSE = 6.95, R^2 = 0.29, MAE = 5.02
# of course, this needs to be interpreted with caution, since the DV (past blooom
# dates) are not the real ones. 

### Predict Bloom Date 2024
data24_NY <- data.frame(year = 2024, DRD = DRD24_NY, t.pred = tpred24_NY, 
                          mean.jan = j24_NY)
predict(mYearN, newdata = data24_NY) # 93.24

##        1 
## 93.24417

doy_to_date(2024, 93)

## [1] "2024-04-02"

My prediction for New York City is April 2nd, 2024.

I know that this is probably too late for New York City (news articles on cherry blossom in NY say it is in March rather than April). However, without any past dates, I don’t know how to improve my prediction.

Last Step: Produce csv-document of results

I now put all predictions into one csv document:

location <- c("washingtondc", "liestal", "kyoto", "vancouver", "newyorkcity")
prediction <- c(86, 91, 93, 97, 93)
prediction_df <- data.frame(location = location, prediction = prediction)
write_csv(prediction_df, file = "CherryBlossom.csv")

prediction_df

##       location prediction
## 1 washingtondc         86
## 2      liestal         91
## 3        kyoto         93
## 4    vancouver         97
## 5  newyorkcity         93

Cherry Blossom Prediction Contest 2024

Luna Frauhammer

2/23/2024