Goal of this contest is to predict the 2024 cherry tree bloom dates
at five different locations around the world. I will predict this year’s
bloom dates using four different predictors in a linear regression:
For Vancouver and New York City, I have to proceed differently,
because there exist no (or only few) records of past bloom dates. I thus
use the USA-NPN dataset (containing records of bloom dates in different
years and different cities in the US) to model past Vancouver and New
York City bloom dates. I then use this modeled data to predict future
bloom dates.
setwd("/Users/lunafrauhammer/Library/CloudStorage/OneDrive-UniversitaetDuisburg-Essen/Sonstiges/CherryBlossom/CherryBlossom24")
# set working directory to folder CherryBlossom24 to run this code.
library(tidyverse)
library(data.table)
library(lubridate)
library(Metrics)
library(showtext)
library(sysfonts)#
library(caret)
doy_to_date <- function (year, doy) {
strptime(paste(year, doy, sep = '-'), '%Y-%j') %>% # create date object
strftime('%Y-%m-%d') # translate back to date string in ISO 8601 format
}
cherry1 <- read.csv("peak-bloom-prediction/data/washingtondc.csv") %>%
bind_rows(read.csv("peak-bloom-prediction/data/liestal.csv")) %>%
bind_rows(read.csv("peak-bloom-prediction/data/kyoto.csv")) %>%
bind_rows(read.csv("peak-bloom-prediction/data/vancouver.csv"))
cherry1 <- subset(cherry1, year >= 1954)
# I will only use data from 1954 onwards.
# Because there are only two dates for Vancouver, I will remove this for now:
cherry <- subset(cherry1, location != "vancouver")
# Let's take a look at the historic bloom dates:
city_Colors<- c(kyoto = "lightpink1", liestal = "lightskyblue",
washingtondc = "palegreen3")
ggplot(cherry, aes(x = year, y = bloom_doy)) +
geom_text(label = "\u273F", aes(color = location),
size=4, family = "Arial Unicode MS", alpha = 1) +
# scale_color_discrete(name = "location") +
theme_classic() +
labs(x = "Year", y = "Full Blossom Day of the Year") +
geom_smooth(method = "lm", fullrange = TRUE, se = FALSE,
aes(color = location), show.legend = FALSE) +
# geom_smooth(method = "lm", fullrange = TRUE, aes(fill = location), alpha = 0.2,)
scale_color_manual(values=city_Colors, name = "Location",
labels = c("Kyoto", "Liestal-Weideli", "Washington, D.C.")) +
ggtitle("Historic Bloom Dates") +
theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
ylim(70, 120)
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_text()`).
# ggsave("historicDates.png", p1)
# get the weather information:
library(rnoaa)
# stations <- ghcnd_stations()
weatherWash <- ghcnd_search("USC00186350")
weatherLies <- ghcnd_search("GME00127786")
weatherKyot <- ghcnd_search("JA000047759")
weatherVan <- ghcnd_search("CA001108395")
weatherNY <- ghcnd_search("USW00094728")
weather <- rbind(weatherWash$tmax, weatherLies$tmax, weatherKyot$tmax,
weatherVan$tmax, weatherNY$tmax)
# I am using the maximum daily temperatures.
# Prepare data set:
weather$location <- NA
weather[weather$id == "USC00186350", ]$location <- "Washington"
weather[weather$id == "GME00127786", ]$location <- "Liestal"
weather[weather$id == "JA000047759", ]$location <- "Kyoto"
weather[weather$id == "CA001108395", ]$location <- "Vancouver"
weather[weather$id == "USW00094728", ]$location <- "NewYork"
summary(weather[weather$location == "Washington", ]$date) # data from 10/01/1948 - 12/31/2023
summary(weather[weather$location == "Liestal", ]$date) # data from 10/01/1953 - 12/31/2023
summary(weather[weather$location == "Kyoto", ]$date) # data from 01/01/1951 - 02/29/2024
summary(weather[weather$location == "Vancouver", ]$date) # data from 01/01/1957 - 02/29/2024
summary(weather[weather$location == "NewYork", ]$date) # data from 01/01/1869 - 02/29/2024
weather$date <- as.Date(weather$date, "%Y/%m/%d")
weather$year <- year(weather$date)
weather$date.within <- format(weather$date, format="%m-%d")
weather$doy <- yday(weather$date) # day of the year
In the first step, I only predict the 2023 bloom dates for those cities for which I have historic bloom data, i.e., Washington, Liestal, and Kyoto. Like stated above, I will use the historic bloom dates as well as each year’s DRD, modeled DRD temperature and mean January temperature as predictors.
The dormacy release date (DRD) is the day in late Winter / early
Spring when the temperature trend reverses from going down to going up
again. Since cherry blossoms need both periods of decreasing / colder
and increasing / warmer temperatures to bloom, the DRD can be an
important predictor of the cherry blossom bloom date (Chung et al.,
2011).
In order to get each year’s DRD, I fit a quadratic regression on the temperatures between October 1st and March 31st. I then use the minimum of this function as estimation of the DRD.
# I first need a new yearly organizing variable that changes on October 1st
# instead of on January 1st:
weather$yearWinter <- NA
for(y in 1:nrow(weather)){
year.margin <- as.Date(paste(weather$year[y], "10-01", sep = "-"))
weather$yearWinter[y] <- ifelse(weather$date[y] >= year.margin,
weather$year[y] + 1, weather$year[y])
}
# I then exclude all data points that lie in between June and October because
# I don't need them
weather <- weather[!(month(weather$date) %in% c(6:9)), ]
# I now add a new DOY variable with negative numbers from October 1st to
# December 31st
weather$doy.n <- 0
for(y in unique(weather$yearWinter)){
weather[weather$yearWinter == y & weather$location == "Washington", ]$doy.n <-
seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y &
weather$location == "Washington", ]))
weather[weather$yearWinter == y & weather$location == "Liestal", ]$doy.n <-
seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y &
weather$location == "Liestal", ]))
weather[weather$yearWinter == y & weather$location == "Kyoto", ]$doy.n <-
seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y &
weather$location == "Kyoto", ]))
weather[weather$yearWinter == y & weather$location == "Vancouver", ]$doy.n <-
seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y &
weather$location == "Vancouver", ]))
weather[weather$yearWinter == y & weather$location == "NewYork", ]$doy.n <-
seq(-91, by = 1, length.out = nrow(weather[weather$yearWinter == y &
weather$location == "NewYork", ]))
}
# I now estimate the DRDs.
# Illustration for the year 2022 in Liestal:
temp23 <- subset(weather, location == "Liestal" & yearWinter == 2023)
fit23 <- lm(tmax ~ doy.n + I(doy.n^2), temp23)
coefs23 <- coef(fit23)
doy.p <- -91: nrow(temp23) - 92
fit.p <- predict(fit23, newdata = data.frame(doy.n = doy.p))
fun23 <- function(x){
coefs23[1] + coefs23[2]*x + coefs23[3]*x^2
}
DRD23 <- as.numeric(optimize(fun23, interval = c(1, 365), maximum = FALSE)[1])
temp_pred23 <- as.numeric(optimize(fun23, interval = c(1, 365), maximum = FALSE)[2])
with(temp23, plot(tmax ~ doy.n, ylab = "Maximum Temperature (1/10 °Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8, bty = "L",
main = "Illustration: 2023 DRD in Liestal"))
lines(fit.p ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD23, col = "red")
text(DRD23, 220, substitute(paste(bold("Dormancy Release Date"))), cex = 0.6, col = "red", pos = 4)
abline(h = temp_pred23, col = "red")
text(DRD23 +60, temp_pred23 - 10, substitute(paste(bold("Modeled DRD Temperature"))), cex = 0.6, col = "red", pos = 4)
Now, I model the DRD for each year in the temperature data set for each of the three locations.
for(l in c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork")){
DRD <- year <- DRD.t.pred <- NULL
weather_Loc <- subset(weather, location == l)
for(y in 1:(length(unique(weather_Loc$yearWinter))-1)){
year.y <- unique(weather_Loc$yearWinter)[y]
temp.year <- weather_Loc[weather_Loc$yearWinter == year.y, ]
if(nrow(temp.year) > 200){ # not all years are complete
coefs <- coef(lm(tmax ~ doy.n + I(doy.n^2), temp.year))
fun <- function(x){
coefs[1] + coefs[2]*x + coefs[3]*x^2
}
DRD[y] <- as.numeric(optimize(fun, interval = c(-91, 200),
maximum = FALSE)[1])
year[y] <- unique(weather_Loc$yearWinter)[y]
DRD.t.pred[y] <- as.numeric(optimize(fun, interval = c(-91, 200),
maximum = FALSE)[2]) / 10
}
}
assign(paste("DRD_", l, sep = ""), DRD)
assign(paste("DRD.t.pred_", l, sep = ""), DRD.t.pred)
assign(paste("year_", l, sep = ""), year)
}
temp_Wash <- data.frame(year = year_Washington, DRD = DRD_Washington,
t.pred = DRD.t.pred_Washington)
# boxplot.stats(temp_Wash$DRD)$out
temp_Lies <- data.frame(year = year_Liestal, DRD = DRD_Liestal,
t.pred = DRD.t.pred_Liestal)
# boxplot.stats(temp_Wash$DRD)$out
temp_Kyot <- data.frame(year = year_Kyoto, DRD = DRD_Kyoto,
t.pred = DRD.t.pred_Kyoto)
# boxplot.stats(temp_Kyot$DRD)$out
temp_Vanc <- data.frame(year = year_Vancouver, DRD = DRD_Vancouver,
t.pred = DRD.t.pred_Vancouver)
# boxplot.stats(temp_Vanc$DRD)$out
temp_NY <- data.frame(year = year_NewYork, DRD = DRD_NewYork,
t.pred = DRD.t.pred_NewYork)
temp <- rbind(temp_Wash, temp_Lies, temp_Kyot, temp_Vanc, temp_NY)
temp$location <- rep(c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork"),
c(nrow(temp_Wash), nrow(temp_Lies), nrow(temp_Kyot),
nrow(temp_Vanc), nrow(temp_NY)))
ggplot(temp, aes(x = year, y = DRD)) +
geom_point() +
ylim(-5, 30) +
xlim(1950, 2020) +
facet_grid(~location) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
ylab("Estimated DRD") +
xlab("Year") +
ggtitle("Historic DRDs") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(temp, aes(x = year, y = t.pred)) +
geom_point() +
facet_grid(~location) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
ylim(-5, 15) +
xlim(1950, 2020) +
ylab("Modeled temperature at DRD (°C)") +
xlab("Year") +
ggtitle("Historic DRD Temperatures") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
As an additional predictor, I will use each year’s mean temperature in January. In the following steps, I get this data from the historic weather data set.
## Washington, D.C.
mean.jan_Wash <- NULL
year_Wash <- NULL
weather_Wash <- subset(weather, location == "Washington")
weather_Wash_jan <- subset(weather_Wash, month(date) == 1)
for(y in 1:(length(unique(weather_Wash_jan$yearWinter))-1)){
year <- unique(weather_Wash_jan$yearWinter)[y]
mean.jan_Wash[y] <- mean(weather_Wash_jan[weather_Wash_jan$yearWinter == year,
]$tmax,na.rm = TRUE) / 10
year_Wash[y] <- year
}
jan_Wash <- data.frame(mean.jan = mean.jan_Wash, year = year_Wash)
## Liestal:
mean.jan_Lies <- NULL
year_Lies <- NULL
weather_Lies <- subset(weather, location == "Liestal")
weather_Lies_jan <- subset(weather_Lies, month(date) == 1)
for(y in 1:(length(unique(weather_Lies_jan$yearWinter))-1)){
year <- unique(weather_Lies_jan$yearWinter)[y]
mean.jan_Lies[y] <- mean(weather_Lies_jan[weather_Lies_jan$yearWinter == year,
]$tmax,na.rm = TRUE) / 10
year_Lies[y] <- year
}
jan_Lies <- data.frame(mean.jan = mean.jan_Lies, year = year_Lies)
## Kyoto:
mean.jan_kyot <- NULL
year_kyot <- NULL
weather_kyot <- subset(weather, location == "Kyoto")
weather_kyot_jan <- subset(weather_kyot, month(date) == 1)
for(y in 1:(length(unique(weather_kyot_jan$yearWinter))-1)){
year <- unique(weather_kyot_jan$yearWinter)[y]
mean.jan_kyot[y] <- mean(weather_kyot_jan[weather_kyot_jan$yearWinter == year,
]$tmax,na.rm = TRUE) / 10
year_kyot[y] <- year
}
jan_Kyot <- data.frame(mean.jan = mean.jan_kyot, year = year_kyot)
## Vancouver:
mean.jan_Vanc <- NULL
year_Vanc <- NULL
weather_Vanc <- subset(weather, location == "Vancouver")
weather_Vanc_jan <- subset(weather_Vanc, month(date) == 1)
for(y in 1:(length(unique(weather_Vanc_jan$yearWinter))-1)){
year <- unique(weather_Vanc_jan$yearWinter)[y]
mean.jan_Vanc[y] <- mean(weather_Vanc_jan[weather_Vanc_jan$yearWinter == year,
]$tmax,na.rm = TRUE) / 10
year_Vanc[y] <- year
}
jan_Vanc <- data.frame(mean.jan = mean.jan_Vanc, year = year_Vanc)
### New York:
mean.jan_NY <- NULL
year_NY <- NULL
weather_NY <- subset(weather, location == "NewYork")
weather_NY_jan <- subset(weather_NY, month(date) == 1)
for(y in 1:(length(unique(weather_NY_jan$yearWinter))-1)){
year <- unique(weather_NY_jan$yearWinter)[y]
mean.jan_NY[y] <- mean(weather_NY_jan[weather_NY_jan$yearWinter == year,
]$tmax,na.rm = TRUE) / 10
year_NY[y] <- year
}
jan_NY <- data.frame(mean.jan = mean.jan_NY, year = year_NY)
temp_Wash <- merge(temp_Wash, jan_Wash, by = "year")
temp_Lies <- merge(temp_Lies, jan_Lies, by = "year")
temp_Kyot <- merge(temp_Kyot, jan_Kyot, by = "year")
temp_Vanc <- merge(temp_Vanc, jan_Vanc, by = "year")
temp_NY <- merge(temp_NY, jan_NY, by = "year")
temp <- rbind(temp_Wash, temp_Lies, temp_Kyot, temp_Vanc, temp_NY)
temp$location <- rep(c("Washington", "Liestal", "Kyoto", "Vancouver", "NewYork"),
c(nrow(temp_Wash), nrow(temp_Lies), nrow(temp_Kyot),
nrow(temp_Vanc), nrow(temp_NY)))
ggplot(temp, aes(x = year, y = mean.jan)) +
geom_point() +
facet_grid(~location) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
ylim(-5, 15) +
xlim(1950, 2020) +
ylab("Mean January temperature (°C)") +
xlab("Year") +
ggtitle("Historic Mean January Temperatures") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Before I can estimate the 2024 bloom dates, I need to estimate this
year’s DRD. However, the ghcnd weather data ends in 2023, so I looked
for more current temperature data online and retrieved free temperature
data from https://www.visualcrossing.com. These data sets include
the weather data between 10/01/2023 and 02/29/2024 (forecast) for each
city.
Wash_2024 <- read.csv("peak-bloom-prediction/data/washington dc 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)
# check whether both data sources match:
wGHCN <- subset(weather, location == "Washington" &
date >= as.Date("2023-10-01") & date <=
as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Wash_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # r = .83 so this seems okay!
## [1] 0.8289739
Wash_2024$doy <- yday(as.Date(Wash_2024$datetime))
Wash_2024$doy.n <- -91:60
Wash_2024$tmax <- Wash_2024$tempmax
# the problem is that I have to little 2023 data to sensibly model the DRD. I will
# thus use the last 10 years' mean temperature data between March and June to
# improve the DRD prediction. However, these values will be weighted less strongly
# than the actually observed temperature data.
yday(as.Date("2024-03-01"))
## [1] 61
yday(as.Date("2024-05-31"))
## [1] 152
# I get the 61th - 153rd days of the year; for leap years this is March 1st - June 1st
# for other years it is March 2nd - June 2nd.
# I now get the mean temperature of the last 10 years for all of these days:
mw_Wash <- subset(aggregate(tmax ~ doy, subset(weather_Wash, year >= 2013), mean),
doy %in% 61:152)
mw_Wash$tmax <- mw_Wash$tmax / 10
w24_Wash <- Wash_2024[, c("doy.n", "tmax")]
names(mw_Wash) <- c("doy.n", "tmax")
w_Wash <- rbind(w24_Wash, mw_Wash)
# fit quadratic regression with specified weights:
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Wash <- lm(tmax ~ doy.n + I(doy.n^2), w_Wash, weights = weights)
coefs24_Wash <- coef(fit24_Wash)
doy.p <- -91: 152
fit.p_Wash <- predict(fit24_Wash, newdata = data.frame(doy.n = doy.p))
fun24_Wash <- function(x){
coefs24_Wash[1] + coefs24_Wash[2]*x + coefs24_Wash[3]*x^2
}
DRD24_Wash <- as.numeric(optimize(fun24_Wash, interval = c(1, 365),
maximum = FALSE)[1])
t.DRD24_Wash <- w_Wash[w_Wash$doy.n == round(DRD24_Wash), ]$tmax
tpred24_Wash <- as.numeric(optimize(fun24_Wash, interval = c(1, 365),
maximum = FALSE)[2])
with(subset(w_Wash, doy.n <= 60),
plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8,
xlim = c(-91, 150), bty = "L",
main = "Predicted 2024 DRD: Washington, D.C."))
with(subset(w_Wash, doy.n > 60),
points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Wash ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Wash, col = "red")
text(DRD24_Wash, 20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)
DRD_Wash <- c(DRD_Washington, DRD24_Wash)
t.pred_Wash <- c(temp_Wash$t.pred, tpred24_Wash)
# Calculate mean January temperature
j24_Wash <- mean(subset(Wash_2024, as.Date(datetime) >= as.Date("2024-01-01") &
as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)
Lies_2024 <- read.csv("peak-bloom-prediction/data/liestal 2023-10-01 to 2024-02-29.csv",
sep = ",", header = TRUE)
# check whether both data sources match:
wGHCN <- subset(weather, location == "Liestal" &
date >= as.Date("2023-10-01") & date <=
as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Lies_2024, as.Date(datetime) >= as.Date("2023-10-01") &
as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC) # very high correlation -> looks good!
## [1] 0.9899423
Lies_2024$doy <- yday(as.Date(Lies_2024$datetime))
Lies_2024$doy.n <- -91:60
Lies_2024$tmax <- Lies_2024$tempmax
mw_Lies <- subset(aggregate(tmax ~ doy, subset(weather_Lies, year >= 2013), mean),
doy %in% 61:152)
mw_Lies$tmax <- mw_Lies$tmax / 10
w24_Lies <- Lies_2024[, c("doy.n", "tmax")]
names(mw_Lies) <- c("doy.n", "tmax")
w_Lies <- rbind(w24_Lies, mw_Lies)
# fit quadratic regression with specified weights:
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Lies <- lm(tmax ~ doy.n + I(doy.n^2), w_Lies, weights = weights)
coefs24_Lies <- coef(fit24_Lies)
doy.p <- -91: 153
fit.p_Lies <- predict(fit24_Lies, newdata = data.frame(doy.n = doy.p))
fun24_Lies <- function(x){
coefs24_Lies[1] + coefs24_Lies[2]*x + coefs24_Lies[3]*x^2
}
DRD24_Lies <- as.numeric(optimize(fun24_Lies, interval = c(1, 365),
maximum = FALSE)[1])
t.DRD24_Lies <- w_Lies[w_Lies$doy.n == round(DRD24_Lies), ]$tmax
tpred24_Lies <- as.numeric(optimize(fun24_Lies, interval = c(1, 365),
maximum = FALSE)[2])
with(subset(w_Lies, doy.n <= 60),
plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8,
xlim = c(-91, 150), bty = "L",
main = "Predicted 2023 DRD: Liestal"))
with(subset(w_Lies, doy.n > 60),
points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Lies ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Lies, col = "red")
text(DRD24_Lies, 20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)
DRD_Lies <- c(DRD_Liestal, DRD24_Lies)
t.pred_Lies <- c(temp_Lies$t.pred, tpred24_Lies)
# Calculate mean January temperature:
j24_Lies <- mean(subset(Lies_2024, as.Date(datetime) >= as.Date("2024-01-01") &
as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)
Kyot_2024 <- read.csv("peak-bloom-prediction/data/kyoto 2023-10-01 to 2024-02-29.csv",
sep = ",", header = TRUE)
# check whether both data sources match:
wGHCN <- subset(weather, location == "Kyoto" &
date >= as.Date("2023-10-01") & date <=
as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Kyot_2024, as.Date(datetime) >= as.Date("2023-10-01") &
as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # very high correlation -> no problem!
## [1] 0.9914435
Kyot_2024$doy <- yday(as.Date(Kyot_2024$datetime))
Kyot_2024$doy.n <- -91:60
Kyot_2024$tmax <- Kyot_2024$tempmax
weather_Kyot <- weather_kyot
mw_Kyot <- subset(aggregate(tmax ~ doy, subset(weather_Kyot, year >= 2013), mean),
doy %in% 61:152)
mw_Kyot$tmax <- mw_Kyot$tmax / 10
w24_Kyot <- Kyot_2024[, c("doy.n", "tmax")]
names(mw_Kyot) <- c("doy.n", "tmax")
w_Kyot <- rbind(w24_Kyot, mw_Kyot)
# fit quadratic regression with specified weights:
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Kyot <- lm(tmax ~ doy.n + I(doy.n^2), w_Kyot, weights = weights)
coefs24_Kyot <- coef(fit24_Kyot)
doy.p <- -91: 152
fit.p_Kyot <- predict(fit24_Kyot, newdata = data.frame(doy.n = doy.p))
fun24_Kyot <- function(x){
coefs24_Kyot[1] + coefs24_Kyot[2]*x + coefs24_Kyot[3]*x^2
}
DRD24_Kyot <- as.numeric(optimize(fun24_Kyot, interval = c(1, 365),
maximum = FALSE)[1])
t.DRD24_Kyot <- w_Kyot[w_Kyot$doy.n == round(DRD24_Kyot), ]$tmax
tpred24_Kyot <- as.numeric(optimize(fun24_Kyot, interval = c(1, 365),
maximum = FALSE)[2])
with(subset(w_Kyot, doy.n <= 60),
plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8,
xlim = c(-91, 150), bty = "L",
main = "Predicted 2023 DRD: Kyoto"))
with(subset(w_Kyot, doy.n > 60),
points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Kyot ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Kyot, col = "red")
text(DRD24_Kyot, 20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)
DRD_Kyot <- c(DRD_Kyoto, DRD24_Kyot)
t.pred_Kyot <- c(temp_Kyot$t.pred, tpred24_Kyot)
# Calculate mean January temperature
# 2024:
j24_Kyot <- mean(subset(Kyot_2024, as.Date(datetime) >= as.Date("2024-01-01") &
as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)
After modeling the DRDs, DRD temperatures, and calculating each year’s mean January temperature, I can now predict the 2024 bloom dates. I do this seperately for each city since I want to consider the goodness of fit indices for all cities.
tc <- trainControl(method = "LOOCV") # set caret train function to leave one out
cherry_Wash <- subset(cherry, location == "washingtondc")
# now use data from temperature data set: temp_Lies
dat_Wash <- merge(cherry_Wash, temp_Wash, by = "year")
### Different Models
mYearW <- train(bloom_doy ~ year, data = dat_Wash,
method = "lm", trControl = tc)
# mYearW
# summary(mYearW) # R^2 = 0.23
mDRDW <- train(bloom_doy ~ DRD, data = dat_Wash,
method = "lm", trControl = tc)
# summary(mDRDW) # R^2 = 0.13
mtpredW <- train(bloom_doy ~ t.pred, data = dat_Wash,
method = "lm", trControl = tc)
# summary(mtpredW) # R^2 = 0.25
mjanW <- train(bloom_doy ~ mean.jan, data = dat_Wash,
method = "lm", trControl = tc)
# summary(mjanW) # R^2 = 0.02
mAllW <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_Wash,
method = "lm", trControl = tc)
summary(mAllW) # R^2 = 0.59 -> I will take this model!
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3155 -3.0399 -0.1244 3.0013 10.1582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 385.31877 57.72019 6.676 8.34e-09 ***
## year -0.14336 0.02929 -4.894 7.57e-06 ***
## DRD 0.50699 0.12226 4.147 0.000106 ***
## t.pred -3.00484 0.53672 -5.598 5.46e-07 ***
## mean.jan 0.99505 0.34480 2.886 0.005390 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.722 on 61 degrees of freedom
## Multiple R-squared: 0.5857, Adjusted R-squared: 0.5585
## F-statistic: 21.56 on 4 and 61 DF, p-value: 4.011e-11
# check cross validation results (using leave-one-out method):
mAllW
## Linear Regression
##
## 66 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 65, 65, 65, 65, 65, 65, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.906488 0.5168813 4.131986
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE = 4.91, R^2 = 0.52, MAE = 4.13
plot(bloom_doy ~ year, dat_Wash, cex = 0.5, pch = 19,
ylab = "Full Bloom Day of the Year", bty = "L",
xlab = "Year", main = "Best Predictor for Washington, D.C.: Year")
mYearW <- lm(bloom_doy ~ year, data = dat_Wash)
abline(mYearW)
### Predict Bloom Date 2024
data24_Wash <- data.frame(year = 2024, DRD = DRD24_Wash, t.pred = tpred24_Wash,
mean.jan = j24_Wash)
predict(mAllW, newdata = data24_Wash) # 85.86
## 1
## 85.85945
doy_to_date(2024, 86) # I rounded it up to 86 since 85.86 is pretty late on March 25th.
## [1] "2024-03-26"
# I will compare this multiple regression approach to a random forest model:
library(caret)
rf_Wash_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
data = dat_Wash,
method = "cforest",
tuneGrid = data.frame(.mtry = 2),
trControl = trainControl(method = "oob"))
rf_Wash_1
## Conditional Inference Random Forest
##
## 66 samples
## 4 predictor
##
## No pre-processing
## Resampling results:
##
## RMSE Rsquared MAE
## 5.872503 0.321994 4.767238
##
## Tuning parameter 'mtry' was held constant at a value of 2
# OOB RMSR = 5.78 -> this is higher than the multiple regression model.
predict(rf_Wash_1, newdata = data24_Wash)
## .outcome
## 89.19813
doy_to_date(2024, 89)
## [1] "2024-03-29"
# random forest prediction would be March 29th. However, I will stick to the
# regression prediction, since it has better fit indices.
My prediction for Washington is March 25th 2024
cherry_lies <- subset(cherry, location == "liestal")
# now use data from temperature data set: temp_Lies
dat_lies <- merge(cherry_lies, temp_Lies, by = "year")
mYearL <- train(bloom_doy ~ year, data = dat_lies,
method = "lm", trControl = tc)
# summary(mYearL) # R^2 = 0.22
mDRDL <- train(bloom_doy ~ DRD, data = dat_lies,
method = "lm", trControl = tc)
# summary(mDRDL) # R^2 = 0.34
mtpredL <- train(bloom_doy ~ t.pred, data = dat_lies,
method = "lm", trControl = tc)
# summary(mtpredL) # R^2 = 0.57
mjanL <- train(bloom_doy ~ mean.jan, data = dat_lies,
method = "lm", trControl = tc)
# summary(mjanL) # R^2 = 0.08
mAllL <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_lies,
method = "lm", trControl = tc)
# summary(mAllL) # R^2 = 0.72
mAllL2 <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_lies,
method = "lm", trControl = tc)
summary(mAllL2) # R^2 = 0.73 -> I will take this model.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.3303 -4.2909 0.5589 3.5102 18.0839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.90449 90.28503 3.798 0.000339 ***
## year -0.12232 0.04597 -2.661 0.009946 **
## DRD 1.26673 0.58305 2.173 0.033707 *
## t.pred -3.28782 1.65645 -1.985 0.051662 .
## mean.jan 1.71900 0.42677 4.028 0.000158 ***
## `DRD:t.pred` -0.12579 0.09684 -1.299 0.198834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.411 on 61 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7067
## F-statistic: 32.81 on 5 and 61 DF, p-value: 4.36e-16
# check cross validation results (using leave-one-out method):
mAllL2
## Linear Regression
##
## 67 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 66, 66, 66, 66, 66, 66, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.823537 0.6653337 5.274706
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE = 6.82, R^2 = 0.67, MAE = 5.27
# For Liestal, the modeled temperature at DRD is the best single predictor:
rbPal <- colorRampPalette(c('cadetblue2','maroon3'))
dat_lies$Col <- rbPal(10)[as.numeric(cut(dat_lies$year,breaks = 10))]
plot(bloom_doy ~ t.pred, dat_lies, cex = 0.5, pch = 19,
ylab = "Full Bloom Day of the Year",
xlab = "ModeledTemperature at Dormancy Release Date",
main = "Best Predictor for Liestal: DRD Temperature",
bty = "L", col = dat_lies$Col)
with(dat_lies, text((bloom_doy -1.75) ~ t.pred, labels = year, cex = 0.58, col = scales::alpha(dat_lies$Col, alpha = 0.9)))
mtpredL <- lm(bloom_doy ~ t.pred, data = dat_lies)
abline(mtpredL)
### Predict Bloom Date 2023
data24_Lies <- data.frame(year = 2024, DRD = DRD24_Lies, t.pred = tpred24_Lies,
mean.jan = j24_Lies)
predict(mAllL2, newdata = data24_Lies) # 91.40
## 1
## 91.39672
doy_to_date(2024, 91)
## [1] "2024-03-31"
# random forest method:
rf_Lies_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
data = dat_lies,
method = "cforest",
tuneGrid = data.frame(.mtry = 2),
trControl = trainControl(method = "oob"))
rf_Lies_1
## Conditional Inference Random Forest
##
## 67 samples
## 4 predictor
##
## No pre-processing
## Resampling results:
##
## RMSE Rsquared MAE
## 7.491085 0.6069835 5.727468
##
## Tuning parameter 'mtry' was held constant at a value of 2
# RMSE = 7.43, R^2 = 0.61, MAE = 5.67 -> Again, the regression approach gives the
# better results
My prediction for Liestal is March 31st 2024
cherry_kyot <- subset(cherry, location == "kyoto")
# now use data from temperature data set: temp_kyot
dat_kyot <- merge(cherry_kyot, temp_Kyot, by = "year")
mYearK <- train(bloom_doy ~ year, data = dat_kyot,
method = "lm", trControl = tc)
# summary(mYearK) # R^2 = 0.24
mDRDK <- train(bloom_doy ~ DRD, data = dat_kyot,
method = "lm", trControl = tc)
# summary(mDRDK) # R^2 = 0.40
mtpredK <- train(bloom_doy ~ t.pred, data = dat_kyot,
method = "lm", trControl = tc)
# summary(mtpredK) # R^2 = 0.33
mjanK <- train(bloom_doy ~ mean.jan, data = dat_kyot,
method = "lm", trControl = tc)
# summary(mjanK) # R^2 = 0.05
mAllK <- train(bloom_doy ~ year + DRD + t.pred + mean.jan, data = dat_kyot,
method = "lm", trControl = tc)
# summary(mAllK) # R^2 = 0.76
mAllK2 <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_kyot,
method = "lm", trControl = tc)
summary(mAllK2) # R^2 = 0.78-> I will take this model.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.975 -1.716 -0.111 1.573 6.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.49585 40.80485 4.129 0.000111 ***
## year -0.07749 0.01492 -5.194 2.43e-06 ***
## DRD 4.19033 1.29115 3.245 0.001893 **
## t.pred 5.02142 3.27446 1.534 0.130237
## mean.jan 1.44701 0.38591 3.750 0.000392 ***
## `DRD:t.pred` -0.34390 0.13219 -2.601 0.011593 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.353 on 62 degrees of freedom
## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7673
## F-statistic: 45.18 on 5 and 62 DF, p-value: < 2.2e-16
# check cross validation results (using leave-one-out method):
mAllK2
## Linear Regression
##
## 68 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 67, 67, 67, 67, 67, 67, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.463822 0.7414536 1.987877
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE = 2.46, R^2 = 0.74, MAE = 1.99
rbPal <- colorRampPalette(c('cadetblue2','maroon3'))
dat_kyot$Col <- rbPal(10)[as.numeric(cut(dat_kyot$year,breaks = 10))]
plot(bloom_doy ~ DRD, dat_kyot, cex = 0.5, pch = 19,
ylab = "Full Bloom Day of the Year", main = "Best Predictor for Kyoto: DRD",
xlab = "Dormacy Release Date", bty = "n", col = dat_kyot$Col)
box(bty = "L")
with(dat_kyot, text((bloom_doy -1) ~ DRD, labels = year, cex = 0.58,
col = scales::alpha(dat_kyot$Col, alpha = 0.8)))
mDRDK <- lm(bloom_doy ~ DRD, data = dat_kyot)
abline(mDRDK)
# check prediction accuracy:
pbloomK <- predict(mAllK2, newdata = data.frame(year = dat_kyot$year,
DRD = dat_kyot$DRD, t.pred = dat_kyot$t.pred, mean.jan = dat_kyot$t.pred))
rmse(dat_kyot$bloom_doy, pbloomK) # 2.83
## [1] 2.826971
mae(dat_kyot$bloom_doy, pbloomK) # 2.28
## [1] 2.283398
cor(dat_kyot$bloom_doy, pbloomK) # 0.86
## [1] 0.8562401
### Predict Bloom Date 2024
data24_Kyot <- data.frame(year = 2024, DRD = DRD24_Kyot, t.pred = tpred24_Kyot,
mean.jan = j24_Kyot)
predict(mAllK2, newdata = data24_Kyot) # 93.20
## 1
## 93.20421
doy_to_date(2024, 93)
## [1] "2024-04-02"
rf_Kyot_1 <- train(bloom_doy ~ year + DRD + t.pred + mean.jan,
data = dat_kyot,
method = "cforest",
tuneGrid = data.frame(.mtry = 2),
trControl = trainControl(method = "oob"))
rf_Kyot_1 # RMSE = 3.21, R 2 = 0.63, MAE = 2.32 -> Again, regression is better.
## Conditional Inference Random Forest
##
## 68 samples
## 4 predictor
##
## No pre-processing
## Resampling results:
##
## RMSE Rsquared MAE
## 3.188488 0.6394563 2.292674
##
## Tuning parameter 'mtry' was held constant at a value of 2
My prediction for Kyoto is April 2nd 2024.
For Vancouver and New York City, we do not have a lot of historic bloom dates. I therefore try to predict these dates from the USA NPN dataset. To evaluate how good this works, I first do the same for Washington and compare the predicted with the actual dates.
dat <- read.csv("peak-bloom-prediction/data/USA-NPN_individual_phenometrics_data.csv")
dat$year <- dat$First_Yes_Year
dat$Species <- as.factor(dat$Species)
m_wash <- train(First_Yes_DOY ~ year + Latitude + Longitude +
Elevation_in_Meters + Species + Tmax_Winter, data = dat,
method = "lm", trControl = tc)
w_wash <- subset(weather, location == "Washington")
w_wash <- subset(w_wash, month(date) %in% c(12, 1, 2))
tmaxWinter_wash <- aggregate(tmax ~ yearWinter, w_wash, mean)
tmaxWinter_wash$tmax <- tmaxWinter_wash$tmax / 10
data_wash <- data.frame(year = unique(dat$year), Latitude = 38.8853,
Longitude = -77.0386, Elevation_in_Meters = 0,
Species = "yedoensis",
Tmax_Winter = tmaxWinter_wash[60:74, ]$tmax)
doy.pred <- predict(m_wash, newdata = data_wash)
doy.actual <- subset(cherry, year %in% unique(dat$year) & location == "washingtondc")$bloom_doy
doy <- data.frame(doy.pred = doy.pred, doy.actual = doy.actual,
year = unique(dat$year))
cor(doy.pred, doy.actual) # 0.40
## [1] 0.3954145
rmse(doy.pred, doy.actual) # 8.36
## [1] 8.357482
mae(doy.pred, doy.actual) # 7.36
## [1] 7.356402
plot(doy.pred ~ year, doy, ylim = c(80, 100), pch = 19, ylab =
"Bloom DOY (modeled / actual)", xlab = "Year", bty = "L",
main = "Modeled vs. actual bloom dates: Wahsington, D.C.")
points(doy.actual ~ year, doy, col = "red", pch = 19)
arrows(x0 = doy$year, x1 = doy$year, y0 = doy$doy.pred, y1 = doy$doy.actual,
length = 0, col = "red", lty = "dashed")
points(doy.pred ~ year, doy, col = "black", pch = 19)
legend("topright", legend = c("predicted", "actual"), title = "Washington D.C. Bloom DOY", bty = "n", col = c("black", "red"), pch = 19, cex = 0.6)
rf_old_Wash <- train(First_Yes_DOY ~ year + Latitude + Longitude +
Elevation_in_Meters + Species + Tmax_Winter,
data = dat,
method = "cforest",
tuneGrid = data.frame(.mtry = 5),
trControl = trainControl(method = "oob"))
# again, regression performs better than random forest.
The predictions are on average wrong by 7 days which isn’t very good.
However, I tried a few different things and nothing seemed to work
better, so I will go with it.
I can now estimate the historic bloom dates in Vancouver from the available data.
m_vanc <- lm(First_Yes_DOY ~ year + Latitude + Longitude +
Elevation_in_Meters + Species + Tmax_Winter, dat)
w_vanc <- subset(weather, location == "Vancouver")
w_vanc <- subset(w_vanc, month(date) %in% c(12, 1, 2))
tmaxWinter_vanc <- aggregate(tmax ~ yearWinter, w_vanc, mean)
tmaxWinter_vanc$tmax <- tmaxWinter_vanc$tmax / 10
data_vanc <- data.frame(year = unique(dat$year), Latitude = 49.2237,
Longitude = -123.1636, Elevation_in_Meters = 24,
Species = "yedoensis",
Tmax_Winter = tmaxWinter_vanc[35:49, ]$tmax)
doy.pred <- predict(m_vanc, newdata = data_vanc)
doy.pred
## 1 2 3 4 5 6 7 8
## 113.1107 112.4823 111.8566 111.2300 110.6037 109.9780 109.3497 108.7234
## 9 10 11 12 13 14 15
## 108.0989 107.4716 106.8449 106.2179 105.5914 104.9657 104.3390
data_vanc$bloom_doy <- doy.pred
# For the last two years, I do, in fact, have the true bloom dates.
# I thus change the last two doys to the actual ones.
data_vanc[14:15, ]$bloom_doy <- c(86, 97)
With these modeled bloom dates, I can now estimate the 2024 bloom date like I did for the other locations. But first, I have to estimate the Vancouver DRDs and calculate the past mean temperatures for January.
Vanc_2024 <- read.csv("peak-bloom-prediction/data/vancouver 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)
# check whether both data sources match:
wGHCN <- subset(weather, location == "Vancouver" &
date >= as.Date("2023-10-01") & date <=
as.Date("2023-10-31"))$tmax / 10
wVC <- subset(Vanc_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # correlation is very high -> everything okay!
## [1] 0.9629359
Vanc_2024$doy <- yday(as.Date(Vanc_2024$datetime))
Vanc_2024$doy.n <- -91:60
Vanc_2024$tmax <- Vanc_2024$tempmax
mw_Vanc <- subset(aggregate(tmax ~ doy, subset(weather_Vanc, year >= 2013), mean),
doy %in% 61:152)
mw_Vanc$tmax <- mw_Vanc$tmax / 10
w24_Vanc <- Vanc_2024[, c("doy.n", "tmax")]
names(mw_Vanc) <- c("doy.n", "tmax")
w_Vanc <- rbind(w24_Vanc, mw_Vanc)
# fit quadratic regression with specified weights:
weights <- rep(c(1, 0.2), c(153, 91))
fit24_Vanc <- lm(tmax ~ doy.n + I(doy.n^2), w_Vanc, weights = weights)
coefs24_Vanc <- coef(fit24_Vanc)
doy.p <- -91: 153
fit.p_Vanc <- predict(fit24_Vanc, newdata = data.frame(doy.n = doy.p))
fun24_Vanc <- function(x){
coefs24_Vanc[1] + coefs24_Vanc[2]*x + coefs24_Vanc[3]*x^2
}
DRD24_Vanc <- as.numeric(optimize(fun24_Vanc, interval = c(1, 365),
maximum = FALSE)[1])
t.DRD24_Vanc <- w_Vanc[w_Vanc$doy.n == round(DRD24_Vanc), ]$tmax
tpred24_Vanc <- as.numeric(optimize(fun24_Vanc, interval = c(1, 365),
maximum = FALSE)[2])
with(subset(w_Vanc, doy.n <= 60),
plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8,
xlim = c(-91, 150), bty = "L",
main = "Predicted 2023 DRD: Vancouver"))
with(subset(w_Vanc, doy.n > 60),
points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_Vanc ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_Vanc, col = "red")
text(DRD24_Vanc, 20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)
DRD_Vanc <- c(DRD_Vancouver, DRD24_Vanc)
t.pred_Vanc <- c(temp_Vanc$t.pred, tpred24_Vanc)
# Calculate mean January temperature
j24_Vanc <- mean(subset(Vanc_2024, as.Date(datetime) >= as.Date("2024-01-01") &
as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)
I now have all the data to predict the 2024 bloom date in Vancouver.
dat_Vanc <- merge(data_vanc, temp_Vanc, by = "year")
### Different Models
mYearV <- train(bloom_doy ~ year, data = dat_Vanc,
method = "lm", trControl = tc)
# summary(mYearV) # R^2 = 0.61
mDRDV <- train(bloom_doy ~ DRD, data = dat_Vanc,
method = "lm", trControl = tc)
# summary(mDRDV) # R^2 = 0.01
mtpredV <- train(bloom_doy ~ t.pred, data = dat_Vanc,
method = "lm", trControl = tc)
# summary(mtpredV) # R^2 = 0.08
mjanV <- train(bloom_doy ~ mean.jan, data = dat_Vanc,
method = "lm", trControl = tc)
# summary(mjanV) # R^2 < .01
mAllV <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_Vanc,
method = "lm", trControl = tc)
summary(mAllV) # R^2 = 0.63
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9958 -1.2376 0.4798 2.6473 4.7052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2738.7915 779.9940 3.511 0.00661 **
## year -1.2962 0.3814 -3.398 0.00789 **
## DRD -1.2371 2.5400 -0.487 0.63787
## t.pred -2.8939 5.8467 -0.495 0.63248
## mean.jan 0.5430 1.5139 0.359 0.72810
## `DRD:t.pred` 0.1582 0.3623 0.437 0.67263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.327 on 9 degrees of freedom
## Multiple R-squared: 0.6268, Adjusted R-squared: 0.4194
## F-statistic: 3.023 on 5 and 9 DF, p-value: 0.0712
# I will use the model with everything but year as predictors.
# check prediction accuracy:
mAllV
## Linear Regression
##
## 15 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 14, 14, 14, 14, 14, 14, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.946136 0.2909384 5.019634
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE = 6.95, R^2 = 0.29, MAE = 5.02
# of course, this needs to be interpreted with caution, since the DV (past blooom
# dates) are not the real ones. I am not sure why R^2 is so low, but the other
# values seem okay.
### Predict Bloom Date 2024
data24_Vanc <- data.frame(year = 2024, DRD = DRD24_Vanc, t.pred = tpred24_Vanc,
mean.jan = j24_Vanc)
predict(mYearV, newdata = data24_Vanc) # 97.24
## 1
## 97.23967
doy_to_date(2024, 97)
## [1] "2024-04-06"
My prediction for Vancouver is April 6th, 2024.
m_ny <- lm(First_Yes_DOY ~ year + Latitude + Longitude +
Elevation_in_Meters + Species + Tmax_Winter, dat)
w_ny <- subset(weather, location == "NewYork")
w_ny <- subset(w_ny, month(date) %in% c(12, 1, 2))
tmaxWinter_ny <- aggregate(tmax ~ yearWinter, w_ny, mean)
tmaxWinter_ny$tmax <- tmaxWinter_ny$tmax / 10
data_ny <- data.frame(year = unique(dat$year), Latitude = 40.73040,
Longitude = -73.99809, Elevation_in_Meters = 8.5,
Species = "yedoensis",
Tmax_Winter = tmaxWinter_ny[35:49, ]$tmax)
doy.pred <- predict(m_ny, newdata = data_ny)
doy.pred
## 1 2 3 4 5 6 7 8
## 102.64284 102.01767 101.39106 100.76176 100.13646 99.50985 98.88219 98.25755
## 9 10 11 12 13 14 15
## 97.63052 97.00433 96.37551 95.75077 95.12393 94.49763 93.87138
data_ny$bloom_doy <- doy.pred
With these modeled bloom dates, I can now estimate the 2024 bloom date like I did for the other locations. But first, I have to estimate the New York City DRDs and calculate the past mean temperatures for January.
NY_2024 <- read.csv("peak-bloom-prediction/data/new york city 2023-10-01 to 2024-02-29.csv", sep = ",", header = TRUE)
# check whether both data sources match:
wGHCN <- subset(weather, location == "NewYork" &
date >= as.Date("2023-10-01") & date <=
as.Date("2023-10-31"))$tmax / 10
wVC <- subset(NY_2024, as.Date(datetime) >= as.Date("2023-10-01") & as.Date(datetime) <= as.Date("2023-10-31"))$tempmax
cor(wGHCN, wVC, use = "complete.obs") # correlation is very high -> everything okay!
## [1] 0.9814348
NY_2024$doy <- yday(as.Date(NY_2024$datetime))
NY_2024$doy.n <- -91:60
NY_2024$tmax <- NY_2024$tempmax
mw_NY <- subset(aggregate(tmax ~ doy, subset(weather_NY, year >= 2013), mean),
doy %in% 61:152)
mw_NY$tmax <- mw_NY$tmax / 10
w24_NY <- NY_2024[, c("doy.n", "tmax")]
names(mw_NY) <- c("doy.n", "tmax")
w_NY <- rbind(w24_NY, mw_NY)
# fit quadratic regression with specified weights:
weights <- rep(c(1, 0.2), c(153, 91))
fit24_NY <- lm(tmax ~ doy.n + I(doy.n^2), w_NY, weights = weights)
coefs24_NY <- coef(fit24_NY)
doy.p <- -91: 153
fit.p_NY <- predict(fit24_NY, newdata = data.frame(doy.n = doy.p))
fun24_NY <- function(x){
coefs24_NY[1] + coefs24_NY[2]*x + coefs24_NY[3]*x^2
}
DRD24_NY <- as.numeric(optimize(fun24_NY, interval = c(1, 365),
maximum = FALSE)[1])
t.DRD24_NY <- w_NY[w_NY$doy.n == round(DRD24_NY), ]$tmax
tpred24_NY <- as.numeric(optimize(fun24_NY, interval = c(1, 365),
maximum = FALSE)[2])
with(subset(w_NY, doy.n <= 60),
plot(tmax ~ doy.n, ylab = "Maximum Temperature (°Celsius)",
xlab = "Day of the Year", pch = 19, cex = 0.8,
xlim = c(-91, 150), bty = "L",
main = "Predicted 2023 DRD: New York City"))
with(subset(w_NY, doy.n > 60),
points(tmax ~ doy.n, pch = 19, col = alpha("grey", alpha = 0.8)))
lines(fit.p_NY ~ doy.p, col = "forestgreen", lwd = 2)
abline(v = DRD24_NY, col = "red")
text(DRD24_NY, 20, "Dormacy Release Date", cex = 0.6, col = "red", pos = 4)
DRD_NY <- c(DRD_NewYork, DRD24_NY)
t.pred_NY <- c(temp_NY$t.pred, tpred24_NY)
# Calculate mean January temperature
j24_NY <- mean(subset(NY_2024, as.Date(datetime) >= as.Date("2024-01-01") &
as.Date(datetime) <= as.Date("2024-01-31"))$tempmax)
I now have all the data to predict the 2024 bloom date in New York City
dat_NY <- merge(data_ny, temp_NY, by = "year")
### Different Models
mYearN <- train(bloom_doy ~ year, data = dat_NY,
method = "lm", trControl = tc)
# summary(mYearV) # R^2 = 0.61
mDRDN <- train(bloom_doy ~ DRD, data = dat_NY,
method = "lm", trControl = tc)
# summary(mDRDV) # R^2 = 0.11
mtpredN <- train(bloom_doy ~ t.pred, data = dat_NY,
method = "lm", trControl = tc)
# summary(mtpredV) # R^2 = 0.13
mjanN <- train(bloom_doy ~ mean.jan, data = dat_NY,
method = "lm", trControl = tc)
# summary(mjanV) # R^2 = 0.27
mAllN <- train(bloom_doy ~ year + DRD * t.pred + mean.jan, data = dat_NY,
method = "lm", trControl = tc)
summary(mAllV) # R^2 = 0.63
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9958 -1.2376 0.4798 2.6473 4.7052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2738.7915 779.9940 3.511 0.00661 **
## year -1.2962 0.3814 -3.398 0.00789 **
## DRD -1.2371 2.5400 -0.487 0.63787
## t.pred -2.8939 5.8467 -0.495 0.63248
## mean.jan 0.5430 1.5139 0.359 0.72810
## `DRD:t.pred` 0.1582 0.3623 0.437 0.67263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.327 on 9 degrees of freedom
## Multiple R-squared: 0.6268, Adjusted R-squared: 0.4194
## F-statistic: 3.023 on 5 and 9 DF, p-value: 0.0712
# I will use the model with everything but year as predictors.
# check prediction accuracy:
mAllV
## Linear Regression
##
## 15 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 14, 14, 14, 14, 14, 14, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.946136 0.2909384 5.019634
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE = 6.95, R^2 = 0.29, MAE = 5.02
# of course, this needs to be interpreted with caution, since the DV (past blooom
# dates) are not the real ones.
### Predict Bloom Date 2024
data24_NY <- data.frame(year = 2024, DRD = DRD24_NY, t.pred = tpred24_NY,
mean.jan = j24_NY)
predict(mYearN, newdata = data24_NY) # 93.24
## 1
## 93.24417
doy_to_date(2024, 93)
## [1] "2024-04-02"
My prediction for New York City is April 2nd, 2024.
I know that this is probably too late for New York City (news articles on cherry blossom in NY say it is in March rather than April). However, without any past dates, I don’t know how to improve my prediction.
I now put all predictions into one csv document:
location <- c("washingtondc", "liestal", "kyoto", "vancouver", "newyorkcity")
prediction <- c(86, 91, 93, 97, 93)
prediction_df <- data.frame(location = location, prediction = prediction)
write_csv(prediction_df, file = "CherryBlossom.csv")
prediction_df
## location prediction
## 1 washingtondc 86
## 2 liestal 91
## 3 kyoto 93
## 4 vancouver 97
## 5 newyorkcity 93