vignettes/IntroTVGEV.Rmd
IntroTVGEV.Rmd
This document was created with NSGEV 0.1.8.
The NSGEV package provides a class of R objects
representing models for observations following a Time-Varying
Generalized Extreme Value (GEV) distribution, usually block maxima, as
described in the chapter 6 of the book by Coles (Coles 2001). This class of models is named
"TVGEV"
. These models are specific Non-Stationary GEV
models: only slowly varying functions of the time can be used as
covariates. They can be fitted by using CRAN packages such as
ismev (Heffernan and Stephenson
2018), evd (Stephenson
2002), extRemes (Gilleland and Katz 2016) and more. However with
these packages, the particularities of the Time-Varying context are not
taken into account, and it is quite difficult for instance to compute
predictions. The TVGEV
class and its methods have the
following features.
Return levels are easily computed for a given time.
The inference on return levels can be obtained by profile-likelihood, using a dedicated algorithm.
The return level specific to the time-varying framework of Parey et al (Parey, Sylvie and Hoang, Thi Thu Huong and Dacunha-Castelle, Didier 2010) can be computed as well as profile-likelihood confidence intervals on it.
Graphical diagnostics favour the ggplot system
(Wickham 2016). The autoplot
and autolayer
S3 methods are consequently used instead of
the plot
and lines
S3 methods of the classical
graphics package.
The present vignette shows how models can be specified and fitted, and illustrates the use of the principal methods.
We begin by using the dataset TXMax_Dijon
shipped with
the package.
head(TXMax_Dijon)
## Year TXMax
## 1 1921 NA
## 2 1922 NA
## 3 1923 NA
## 4 1924 33.6
## 5 1925 34.2
## 6 1926 34.8
The data represent the annual maxima of the daily maximal temperature
(TX
) in Dijon (France). The time series is shown on the
next plot (R code not shown). The data are provided by European Climate Assessment and
Dataset.
The observations from 1980 on seem to be larger than that of the beginning of the series, motivating the use of a time-varying model.
As a possible model, a time-varying block maxima model relates the distribution of the response \(Y_b\) for the block (year) \(b\) to the corresponding time \(t_b\), assuming that the observations are independent across blocks. We will assume here that \(Y_b\) has a GEV distribution with constant scale \(\sigma\), a constant shape \(\xi\) and a time-varying location \(\mu(t) = \mu_0 + \mu_1 t\). This model involves a vector \(\boldsymbol{\psi} := [\mu_0, \,\mu_1, \, \sigma, \, \xi]\) of four unknown parameters.
We can fit this model by using several CRAN packages, for instance the extRemes package. With this package we need to exclude the missing values for the response
fite1 <- extRemes::fevd(x = TXMax,
data = subset(TXMax_Dijon, !is.na(TXMax)),
type = "GEV", location.fun = ~Year)
Note that a classical formula is used to specify the model
as it is done for linear models with lm
or
glm
. By default, a formula implies the use of a constant.
So the formula ~ Year
indeed implies the existence of the
coefficient \(\mu_0\) in \(\mu(t) = \mu_0 + \mu_1 t\). We could have
used equivalently an explicit constant with ~ 1 + Year
. On
the contrary, a model with no constant \(\mu(t) = \mu_1 t\) would have been
specified by using the formula ~ Year - 1
. This convention
is used by most model fitting functions requiring a formula in the
linear model fashion, and TVGEV
will be no exception.
By inspecting the content of the object or by reading the help, we can find the estimated values of the parameters
fite1$results$par
## mu0 mu1 scale shape
## 5.07814887 0.01414038 1.84726804 -0.19990410
fite1$results$value
## [1] 180.6135
We see that the estimated location increases by about \(100 \times \widehat{\mu}_1 =\) 1.41 Celsius
by century, with an intercept value \(\widehat{\mu}_0 =\) 5.08 Celsius
corresponding to the year \(0\). We
extracted as well the negative maximized log-likelihood stored in the
element results$value
.
To achieve nearly the same thing with the NSGEV
package, we need a little extra work: adding a column with class
"Date"
to the data frame. Each observation of this date
column represents the beginning of a block. We can simply paste a month
and day indication -mm-dd
to the year in order to get a
POSIX format
## [1] "Date"
head(df$Date, n = 5)
## [1] "1921-01-01" "1922-01-01" "1923-01-01" "1924-01-01" "1925-01-01"
Note that we no longer have to remove the observations with a missing response. Now we can fit the same model
fitNLin <- TVGEV(data = df, response = "TXMax", date = "Date",
design = polynomX(date = Date, degree = 1),
loc = ~ t1)
coef(fitNLin)
## mu_0 mu_t1 sigma_0 xi_0
## 32.93752186 0.01527735 1.84567285 -0.20471258
logLik(fitNLin)
## [1] -180.6005
## attr(,"df")
## [1] 4
## attr(,"nobs")
## [1] 87
Note that we used a formula as before but now the covariate is a
function of the date which is computed by the design function, here
polynomX
. The design function essentially returns a numeric
matrix, the column of which can be used as covariates. So the name of
the covariates depend on the design function used. This will be further
explained later.
The estimated parameters are nearly the same as before, except for
the first one - due to a different origin taken for the linear trend.
Indeed, the time origin for R Dates is "1970-01-01"
. The
maximized log-likelihood is extracted by using the logLik
method which works for many classes of fitted models.
The “fitted model” object has class "TVGEV"
, and a
number of classical S3 methods with simple interpretation are
implemented
class(fitNLin)
## [1] "TVGEV"
methods(class = "TVGEV")
## [1] anova autoplot bs cdf cdfMaxFun
## [6] coef confint density logLik mean
## [11] modelMatrices moment plot predict print
## [16] profLik psi2theta quantile quantMax quantMaxFun
## [21] residuals simulate summary vcov
## see '?methods' for accessing help and source code
For example we can use the classical confint
method to
display confidence intervals on the parameters
confint(fitNLin)
## , , 95%
##
## L U
## mu_0 32.5048 33.3702
## mu_t1 -0.0005 0.0310
## sigma_0 1.5412 2.1502
## xi_0 -0.3489 -0.0605
The results (by default based on the delta method) show that \(0\) is within the \(95\%\) interval on the slope \(\mu_1\), though being very close to the lower bound. The confidence interval on the shape \(\xi\) suggest that this parameter is negative.
As is generally the case with the S3 class/methods system for R,
the methods should be used by invoking the generic function,
e.g. confint
. However the help for the methods can be found
by using the “dotted” name such as ?confint.TVGEV
. A method
implemented for a specific class usually has specific arguments that are
not found in the generic. For instance, the confint
method
for the class "TVGEV"
has a specific method
argument that allows choosing the inference method, with the three
choices "delta"
(default), "boot"
and
"proflik"
.
Quite obviously, we can fit a model with a constant trend by using the relevant formula
fitNConst <- TVGEV(data = df, response = "TXMax", date = "Date",
design = polynomX(date = Date, degree = 1),
loc = ~ 1)
coef(fitNConst)
## mu_0 sigma_0 xi_0
## 32.94616 1.87935 -0.19645
Since the design function is not used when the three GEV parameters
are constant, we can use TVGEV
without its
design
argument.
## mu_0 sigma_0 xi_0
## 32.94616 1.87935 -0.19645
Note however that we still require that the response and the date are provided as columns of a data frame.
As a general rule, a time-varying model should always be compared to
a simpler model with three constant GEV parameters. For that aim, we can
perform a likelihood-ratio test, which is by convention attached to the
anova
method in R.
anova(fitNConst, fitNLin)
## Analysis of Deviance Table
##
## df deviance W Pr(>W)
## fitNConst 3 364.74
## fitNLin 4 361.20 3.5412 0.05986 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(p\)-value is > 0.05, so if we assume that there could be a global linear trend on the location, we would accept the null hypothesis \(H_0: \mu_1 = 0\) at the \(5\%\) level.
We can investigate different forms of trend, using different designs.
Now let us have a look at the design function polynomX
which was used before. We can call it using the date column of the data
frame df
passed to the formal argument date
of
the function.
## [1] "bts" "matrix"
head(X)
## Block time series
##
## o First block: 1921-01-01
## o Last block: 2016-01-01
##
## date Cst t1
## 1921-01-01 1.00000 -48.99932
## 1922-01-01 1.00000 -48.00000
## 1923-01-01 1.00000 -47.00068
## 1924-01-01 1.00000 -46.00137
## 1925-01-01 1.00000 -44.99932
## 1926-01-01 1.00000 -44.00000
The result is an object with class "bts"
, inheriting
from the class "matrix"
, with two columns. We guess that
the column named Cst
is a column of ones as commonly used
in linear regression, and that the column t1
is a linear
trend in years with its origin set 49 years after 1921-01-01, so near
1970-01-01. Note that a TVGEV
model does not need to use
all the basis functions of the design: we did not use the constant
Cst
, because a constant basis function is generated by
default when the formula is parsed.
The class "bts"
describes objects representing block
times series. It is quite similar to the "mts"
class of the
forecast package. A few methods are implemented for
this class e.g. the autoplot
method. To illustrate this
method, we consider a different design function involving broken line
splines, namely the function breaksX
The basis functions are now splines with degree \(1\). The first function named
t1
is a linear function and for each “break” date given in
breaks
, a broken line function is added to the basis.
There are other design functions. These are consistently used in
TVGEV
as follows.
TVGEV
With TVGEV
When the model is fitted, the data frame specified with the
data
formal argument must contain a column giving the
date of the beginning of each block. The column name is specified
by using the date
formal argument of
TVGEV
.
The date column is passed to a design function given in
the design
formal argument. The value of
design
is a call to the chosen function, using the chosen
column of data
as the date
of the design
function.
The block maxima model is specified through formulas linking the GEV parameters to the columns of the design matrix.
The fitted model object keeps track of the design function used and therefore can use it on a “new” period of time to make a prediction.
We can use the breaksX
design function to specify a
continuous broken line form for the location \(\mu(t)\).
fitNBreak <- TVGEV(data = df, response = "TXMax", date = "Date",
design = breaksX(date = Date, breaks = "1970-01-01", degree = 1),
loc = ~ t1 + t1_1970)
coef(fitNBreak)
## mu_0 mu_t1 mu_t1_1970 sigma_0 xi_0
## 32.06638460 -0.02392656 0.07728411 1.75541862 -0.18112018
The location parameter now involves three parameters according to
\[ \mu(t) = \mu_0 + \mu_1 t + \mu_2 [t - t_0]_+ \]
where \(x_+ := \max(0,\, x)\) is the
“positive part” function and where \(t_0\) corresponds to the break point in
time. Note that the slope of the location \(\mu(t)\) is given by \(\mu_0\) before the break point, and by
\(\mu_0 + \mu_1\) after that point.
This corresponds to the two estimated slopes in Celsius by century:
\(100 \times \widehat{\mu}_1 =\) -2.4
(before break) and \(100 \times
[\widehat{\mu}_1 + \widehat{\mu}_2] =\) 5.3 (after). Since \(\mu_2\) represents the change of slope, we
could decide whether the change of slope is significant or not by
querying whether a confidence interval on \(\mu_2\) contains \(0\) or not. Another solution based on
anova
is described later.
To assess the change in the distribution of \(Y_b\), many graphical diagnostics can be
produced with NSGEV. For instance we can use the
quantile
method, which produces a bts
block
time-series that can be autoplotted
Since the estimated value of the shape parameter \(\xi\) is negative, all the GEV distributions for the observations \(Y_b\) are estimated as having a finite upper end-point \(\omega := \mu - \sigma / \xi\). We can find this value by setting the probability to \(1.0\) in the quantile function. Remind that we use the estimated distribution, so the shown upper end-point is the estimate \(\widehat{\omega}_b\). We could infer on the unknown value \(\omega_b\) by using confidence intervals. The value of \(\omega_b\) is the return level (conditional on \(t_b\)) corresponding to an infinite return period \(T=\infty\).
The two models corresponding to the objects fitNLin
and
fitNBreak
are nested: by setting \(\mu_2 = 0\) in the second model, we
retrieve the first one. So we can perform a likelihood-ratio test as
above.
anova(fitNLin, fitNBreak)
## Analysis of Deviance Table
##
## df deviance W Pr(>W)
## fitNLin 4 361.2
## fitNBreak 5 354.4 6.7982 0.009125 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see from the very small \(p\)-value labelled Pr(>W)
that the null hypothesis \(\mu_2 = 0\)
must be rejected in favour of a break at the chosen break time as
defended by fitNBreak
.
A limitation of the fitNBreak
model is that the break
time can not be estimated simply. The same problem arises for the
Gaussian regression known as kink regression. So we can try
another basis of slowly varying functions which is flexible enough to
describe a smooth transition between two linear trends with different
slopes.
As a third example of design, we will use the natural spline
design as implemented by natSplineX
. With this design, we
need to give two boundary knots as well as a number of knots between the
boundary knots. Along with its flexibility, an interesting property of
this basis is that the extrapolation outside of the boundary knots is
“only” linear. By using polynomials of degree \(2\), we would get a quadratic trend which
becomes unrealistic even for a small amount of extrapolation.
X1 <- natSplineX(date = df$Date, knots = "1970-01-01",
boundaryKnots = c("1921-01-01", "2017-01-01"))
autoplot(X1, facets = TRUE)
The basis functions are linearly independent but they span a linear
space which contains the constant \(1\); in other words, if we add a column of
ones to unclass(X1)
we get a matrix with four columns but
with rank \(3\). So we can either use a
constant and two of the basis functions, or use the three basis function
but we then have to remove the default constant as follows.
fitNNS <- TVGEV(data = df, response = "TXMax", date = "Date",
design = natSplineX(date = Date, knots = "1970-01-01",
boundaryKnots = c("1920-01-01", "2017-01-01")),
loc = ~ ns1 + ns2 + ns3 - 1)
As before the models corresponding to fitNLin
and
fitNNS
are nested because the basis functions used in
fitNLin
(the implicit constant \(1\) and the linear function
t1
) both are in the linear space generated by the natural
spline basis. So we can use the anova
method.
anova(fitNLin, fitNNS)
## Analysis of Deviance Table
##
## df deviance W Pr(>W)
## fitNLin 4 361.20
## fitNNS 5 354.14 7.0651 0.00786 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again we see that fitNNS
fits the data significantly
better than fitNLin
does. We can not compare the two models
fitNBreak
and fitNNS
by using
anova
because they are no longer nested. However since
these two models have the same number of parameters, using AIC or BIC
would conclude to a slight advantage in favour of fitNNS
which has a slightly smaller deviance, hence a slightly larger
log-likelihood.
## Breaks Natural Spline
## 376.7323 376.4654
A possible source of problems, the design matrices created by the
design functions can take large values. Indeed, a date is considered as
a number of days for a given origin; for a long series we will get basis
functions with values of several thousands or even much more when
polynomials with degree \(>1\) are
used. In order to facilitate the task of maximizing the likelihood, it
is worth using the origin
formal argument of the provided
design functions, setting it to a date close to the mean date of the
series.
With time-varying models or non-stationary models, the concept of return level becomes quite messy, because many reasonable definitions can be given. The simplest return levels are the conditional ones, obtained by fixing a time \(t\) and by considering the quantiles of the fitted distribution for that specific time.
pBreak <- predict(fitNBreak)
## Since 'object' is really time-varying, the Return Levels
## depend on the date. A default choice of dates is made here.
## Use the 'newdate' formal to change this.
autoplot(pBreak)
As suggested by the warning message, three “round” values are chosen
by default for the time \(t\) and the
return level plots are shown for each of these in a specific facet of
the plot. Note that the axis could have used a probability of exceedance
on a yearly basis rather than a period in years. We can make a different
choice by using the newdate
optional argument of
predict
, see ?predict.TVGEV
.
Each of the return levels is computed with a confidence interval with
its default level set to \(0.95\). By
default, the delta method is used, resulting in symmetrical confidence
intervals. Two alternative methods can be used by setting the optional
argument confintMethod
: profile-likelihood and bootstrap.
With the profile-likelihood method, a constrained optimization method is
repeatedly used and convergence diagnostics are printed by default. To
save space we remove these by setting the trace
argument.
pBreakPL <- predict(fitNBreak, confintMethod = "proflik", trace = 0)
## Since 'object' is really time-varying, the Return Levels
## depend on the date. A default choice of dates is made here.
## Use the 'newdate' formal to change this.
plot(pBreakPL)
## Warning in plot.predict.TVGEV(pBreakPL): Inasmuch the plot is actually a
## 'ggplot', it is better to use the 'autoplot' method for consistency
Unconditional return levels can be defined for the time-varying framework. These are obtained by aggregating several times corresponding to a prediction period. Many definitions are possible; a natural requirement is that the usual definition continues to hold when a stationary model is considered as time-varying.
Following Parey et al (Parey, Sylvie and Hoang, Thi Thu Huong and Dacunha-Castelle, Didier 2010), we can define the return level corresponding to \(T\) years as the level \(\rho\) such that the number \(N\) of exceedances over \(\rho\) in the \(T\) “next years” has unit expectation \(\mathbb{E}(N) = 1\). The notion of “next” year is relative to a time origin that can be chosen.
pUCBreak <- predictUncond(fitNBreak, newdateFrom = "2020-01-01",
confintMethod = "proflik", trace = 0)
plot(pUCBreak)
## Warning in plot.predict.TVGEV(pUCBreak): Inasmuch the plot is actually a
## 'ggplot', it is better to use the 'autoplot' method for consistency
kable(pUCBreak, digits = 2)
Date | Period | Level | Quant | L | U |
---|---|---|---|---|---|
2020-01-01 | 5 | 95% | 37.15 | 36.13 | 38.21 |
2020-01-01 | 10 | 95% | 38.23 | 37.15 | 39.50 |
2020-01-01 | 20 | 95% | 39.31 | 37.99 | 40.93 |
2020-01-01 | 30 | 95% | 40.04 | 38.49 | 42.00 |
2020-01-01 | 40 | 95% | 40.65 | 38.77 | 42.87 |
2020-01-01 | 50 | 95% | 41.22 | 39.10 | 43.71 |
2020-01-01 | 70 | 95% | 42.30 | 39.64 | 45.34 |
2020-01-01 | 100 | 95% | 43.90 | 40.30 | 47.85 |
For instance the return level corresponding to \(T=30\) years for the origin
"2020-01-01"
should be exceeded once (on average) during
the ten years 2021 to 2050. This level is estimated here to be 40.04
Celsius.
Note that an alternative definition of the return level can be
implemented as a function \(\rho(\boldsymbol{\psi})\) of the parameter
vector \(\boldsymbol{\psi}\) of the
model. Under quite general conditions, such a return level can be
regarded as a parameter of the model in a re-parameterization of it,
hence we can infer on it as we do for a parameter. To achieve this in
practice, the return level must be implemented as a R function with
signature function(object, psi)
where object
has class "TVGEV"
and psi
stands for the
parameter vector. The profLik
method can then be used to
infer on the return level using the profile-likelihood method, see
?profLik.TVGEV
. This method can work for any smooth
function having the previous signature.
Rather than trying to generalize the definition of a return level to the time-varying framework, one can simply consider the random maximum \(M^\star := \max_{b} Y_{b}\) where the maximum is over a set \(\mathcal{B}^\star\) of \(n^\star\) blocks. The most common case is when \(\mathcal{B}^\star\) is a future period \(b = b_0^\star + 1\), \(\dots\), \(b_0^\star + n^\star\), such as the lifetime of an equipment, and one want to assess a risk related to the possible exceedance over some high level during this period. Since the blocks are assumed to be independent we have
\[ F_{M^\star}(m^\star;\,\boldsymbol{\psi}) = \prod_{b \in \mathcal{B}^\star} F_{\texttt{GEV}}(m^\star;\,\boldsymbol{\theta}_{b}) \qquad \text{for all } m^\star, \]
where the dependence on the parameter \(\boldsymbol{\psi}\) is through \(\boldsymbol{\theta}_b(\boldsymbol{\psi})\). Note that in general \(M^\star\) does not follow a GEV distribution.
The corresponding quantile function \(q_{M^\star}(p)\) can be computed numerically either by solving \(F_{M^\star}(q) = p\) or by interpolating from couples of values probability, quantile \([p_i,\,q_i]\). For a given probability \(p\), the derivative of the corresponding quantile w.r.t. \(\boldsymbol{\psi}\) can be obtained by the implicit function theorem and can be used to infer on the quantile.
In NSGEV, the quantMax
method can be
used to produce a table of values for the quantile function from a
TVGEV
object. Such a table is given the S3 class
"quantMax.TVGEV"
for which an autoplot
method
has been implemented
newDate <- as.Date(sprintf("%4d-01-01", 2025:2054))
qM <- quantMax(fitNBreak, date = newDate, level = c(0.70, 0.95))
gqM <- autoplot(qM) +
ggtitle("Maximum on years 2025-2054 with delta confidence intervals")
print(gqM)
The plot shows the quantile \(q_{M^\star}(p)\) against the probability of exceedance \(1 -p\) using a Gumbel scale. So if \(M^\star\) happens to be Gumbel (as can happen in the stationary case) the quantile line will be a straight line. When the period of interest \(\mathcal{B}^\star\) is increased the maximum becomes stochastically larger. For instance, with a fixed beginning if the duration \(n^\star\) of the period is increased the maximum \(M^\star\) becomes stochastically larger: all quantiles are increased.
It is worth noting that as far as a small probability of exceedance \(\bar{p} := 1 -p\) is concerned, the expected value of the random number \(N\) of exceedances over the corresponding quantile is \(\mathbb{E}[N] \approx \bar{p}\). This approximation is acceptable when \(\bar{p} < 0.05\).
Note At the this time, only confidence intervals based on the delta method can be computed.
The simulate
method is implemented in many R packages,
with the aim to simulate new data from a given model. It has been
implemented for the TVGEV
class. A fitted model object can
be used to simulate new paths either for the fitting time period or for
a new one.
We can use a different period to better understand how the model anticipates new data
nd <- seq(from = as.Date("2020-01-01"), length = 10, by = "years")
autoplot(simulate(fitNBreak, nsim = 10, newdate = nd))
The simulated paths are here drawn in colour with a legend because
their number is small enough. Note that simulate
is
functionally close to a prediction: the mean of a large number of
simulated paths is close to the expectation used in a prediction and the
uncertainty on the prediction can be assessed by using the dispersion of
the simulated paths.
The generalized residuals are random variables \(e_b\) related to the observations \(y_b\), so they form a block time-series. The distribution of \(e_b\) should be nearly the same for all blocks \(b\) and the dependence between the blocks should be weak. Several definitions of the residuals are possible, in relation with their target distribution. A simple choice is \(e_b := F_{\text{GEV}}(y_b;\, \widehat{\boldsymbol{\theta}}_b)\) which should lead nearly to a uniform distribution on the interval \((0,\,1)\) (Panagoulia, Economou, and Caroni 2014). The inverse transformation is \(y_b = q_{\text{GEV}}(e_b;\, \widehat{\boldsymbol{\theta}}_b)\). When the GEV marginal distributions have a nearly exponential tail \(\xi \approx 0\) using the standard exponential distribution for the generalized residuals can be preferred because this will use a milder transformation of the data. This is the default choice.
The horizontal lines show an interval with approximate probability \(95\%\). The use of these residuals is similar to that in the usual linear regression framework where the distribution of the standardized residuals is close to the standard normal.
As shown before the parameters of the model can be extracted by using
the coef
method. Using a non-standard terminology which
seems specific to the package, we can say that these are the “psi”
parameters. But we can be interested as well by the GEV parameters \(\mu\) , \(\sigma\) and \(\xi\) that we can call “theta” \(\boldsymbol{\theta}= [\mu,\, \sigma,
\,\xi]\). Since these parameters depend on the time \(t\) we can consider them as forming a block
time series. We can “extract” these parameters by using the optional
formal type
of the coef
method; this leads to
a bts
object that can be autoplotted. Remind that the
physical dimensions of the parameters differ - \(\mu\) and \(\sigma\) have the dimension of the \(Y_b\) while \(\xi\) is dimensionless - so we have better
show each series in a facet with its own \(y\) axis.
Several methods can be used to assess the marginal distribution of
\(Y_b\) for a specific time \(t_b\) or at a collection of times given by
a date
argument. With the default date = NULL
,
the methods quantile
, mean
,
moments
for the class "TVGEV"
return time
series objects that can be “autoplotted” to plot the quantiles or the
mean (expectation) against the date as it was shown before. The help
page with examples can be displayed in the usual way, e.g. with
?quantile.TVGEV
.
The density
and cdf
methods both return a
functional time series: a collection of functions depending on the date.
These can be autoplotted as well.
If needed, the time series with class "bts"
can be
coerced into a standard time series inheriting from "ts"
for which many methods have been written in different packages.
A large number \(R\) of response vectors \(\mathbf{y}^{[r]}\) are simulated or “re-sampled” corresponding to \(r=1\), \(\dots\), \(R\). These simulated values can be stored into a matrix with \(B\) rows and \(R\) columns. For each vector \(\mathbf{y}^{[r]}\), the MLE \({\widehat{\boldsymbol{\psi}}}^{[r]}\) of the “psi” parameter vector \(\boldsymbol{\psi}\) is computed. The sample quantiles of the \({\widehat{\boldsymbol{\psi}}}_i^{[r]}\) can then be used to provide a confidence interval on \(\psi_i\).
The bs
method takes a TVGEV
object as its
first argument and computes the bootstrapped “psi” parameters \(\widehat{\boldsymbol{\psi}}^{[r]}\). These
bootstrapped parameters are in fact stored as the rows of the
estimate
element of the returned list.
## [1] "estimate" "negLogLik" "R" "optim" "type"
head(myBoot$estimate, n = 3)
## mu_0 mu_t1 mu_t1_1970 sigma_0 xi_0
## [1,] 31.53458 -0.04764366 0.10654700 1.550017 -0.1142792
## [2,] 32.48720 -0.01944965 0.06659356 1.686552 -0.2510816
## [3,] 31.92257 -0.04299220 0.11209443 1.641668 -0.1688012
Note that the default number of replicates R
is very
small and has been set for a purely illustrative use. A well-accepted
choice \(\geqslant 3000\).
It is easy to derive from the theta parameters the corresponding
block times series of “thetas” \(\widehat{\boldsymbol{\theta}}_b^{[r]}\).
Broadly speaking we have to apply the function psi2theta
to
each line of the matrix returned as the estimate
element of
the result.
mu <- apply(myBoot$estimate, MARGIN = 1,
FUN = function(x, model) psi2theta(psi = x, model = model)[ , 1],
model = fitNBreak)
autoplot(bts(mu, date = fitNBreak$fDate)) + ggtitle("bootstrapped values for mu")
One the bootstrapped values for the “psi” parameters have been
computed, they can be used to later infer on the return periods at a
short computing cost. In that aim, a new element named boot
must be added to the fitted model.
fitNBreak$boot <- myBoot
Note however that this is a kind of “hack”. Some care is needed to
ensure that the new boot
element is consistent with the
right fitted model using the right data.
Since most modern computers use multicore processors, a parallel
computation can be used to achieve substantial saving in the computing
time. See the example in the help page of boot.ts
to see
how this can be achieved.
Two methods can be used to “resample” \(\mathbf{y}\) (Panagoulia, Economou, and Caroni 2014)
Parametric. For \(r=1\), \(\dots\), \(R\), the value \(y_b^{[r]}\) is drawn as \(y_b^{[r]} \sim \text{GEV}(\widehat{\boldsymbol{\theta}}_b)\) where \(\boldsymbol{\theta}_b\) is the MLE of the vector of the three GEV parameters corresponding to the block \(b\), which is a function of \(\boldsymbol{\psi}\).
Non-parametric. For \(r=1\), \(\dots\), \(R\), the generalized residuals \(e_b\) are resampled to form a vector \(\mathbf{e}^{[r]}\) with the wanted size \(B\). Each residual \(e_b^{[r]}\) is back-transformed to a response \(y_b^{[r]} := q_{\text{GEV}}(e_b^{[r]};\, \widehat{\boldsymbol{\theta}}_b)\).
The second method is also known under the names fixed-\(t\) resampling and resampling from the residuals.
The bootstrap \(\widehat{\boldsymbol{\psi}}^{[b]}\) can also be used to infer on the return levels, either conditional or unconditional. For example, the conditional return level \(\rho_b(m)\) corresponding the the period of \(m\) blocks and for the given time \(t_b\) can be inferred by using the quantiles of the return periods which are computed as \(q_{\text{GEV}}(p; \,\widehat{\boldsymbol{\theta}}_b^{[r]})\) for \(p:= 1 - 1/T\).
TVGEV
The "TVGEV"
S3 class and its dedicated methods allow a
simple use of time-varying GEV models, including prediction. Some
functionality of the package such as the simulation are not found in
more classical packages, to our best knowledge. However
Only known functions of time \(x(t)\) can be used as covariates.
Each of GEV parameters can only depend on the time covariates through a linear function of these.
The parameters \(\psi\) can not be share across GEV parameters.
To illustrate 1., we can not use a binary variable indicator of a site in order to build a regional model. To illustrate 2., the GEV location \(\mu\) can be \(\psi_1 x_1(t) + \psi_2 x_2(t)\) but not have the form \(x(t)^\psi\). To illustrate 3., we can not have both \(\mu(t) = \psi_0 + \psi_1 x(t)\) and \(\sigma(t) = \psi_3 \mu\), because the parameters \(\psi_1\) and \(\psi_2\) are then used both by the location \(\mu\) and the scale \(\sigma\).
In practice, the time will be represented by a date object,
and the POSIX format "yyyy-mm-dd"
or in R
"%Y-%m-%d"
will be favored. The R base
package has a "Date"
class which is used by
NSGEV
## [1] "Date"
myDate
## [1] "2020-01-01"
format(myDate, "%d %B %Y")
## [1] "01 January 2020"
myDate <- seq(from = as.Date("1980-01-01"), by = "years", length = 40)
x <- runif(length(myDate))
plot(myDate, x, type = "o", xlab = "")
The lubridate package (Grolemund and Wickham 2011) has functions that can help to cope with date.
You can add new design functions. A design function:
Must have its first argument named date
. This
argument is intended to be passed either as an object with class
"Date"
or as an object which will be coerced automatically
to this class by using as.Date
. For instance, a character
vector in POSIX format should work.
Must return a matrix with the basis function as its columns. It
must also have an attribute named "date"
containing the
date vector. This object should be created either by using the
bts
function (creator for the class) or by simply giving
the class c("bts", "matrix")
to a matrix with the suitable
attribute.
Let us consider again the broken line trend discussed above, namely
\[ \mu(t) = \mu_0 + \mu_1 t + \mu_2 [t - t_0]_+. \]
As mentioned before, the date \(t_0\) of a change of slope can be
considered as an unknown parameter, but the model then no longer has the
form for TVGEV
objects, so it can no longer be estimated by
using directly the TVGEV
function. Yet we can find the
value of \(t_0\) by maximizing the
log-likelihood over a vector of candidate values: estimating a
TVGEV
model for each value will provide the corresponding
log-likelihoods. For a given year we can create a suitable formatted
date with the sprintf
function as follows
yearBreaks <- c(1940, 1950, 1955, 1960:2000, 2005, 2010)
res <- list()
for (ib in seq_along(yearBreaks)) {
d <- sprintf("%4d-01-01", yearBreaks[[ib]])
floc <- as.formula(sprintf("~ t1 + t1_%4d", yearBreaks[[ib]]))
res[[d]] <- TVGEV(data = df, response = "TXMax", date = "Date",
design = breaksX(date = Date, breaks = d, degree = 1),
loc = floc)
}
We now have a list res
of
length(yearBreaks)
objects and the log-likelihood of each
object can be extracted by using logLik
. So we can plot the
log-likelihood against the year and find the “best” object with maximal
log-likelihood.
ll <- sapply(res, logLik)
plot(yearBreaks, ll, type = "o", pch = 21, col = "orangered",
lwd = 2, bg = "gold", xlab = "break", ylab = "log-lik")
grid()
iMax <- which.max(ll)
abline(v = yearBreaks[iMax])
abline(h = ll[iMax] - c(0, qchisq(0.95, df = 1) /2),
col = "SpringGreen3", lwd = 2)
Although the points show a profiled log-likelihood, the regularity conditions that must hold in ML estimation are not fulfilled, because the log-likelihood is not differentiable w.r.t. \(t_0\). As a consequence, the usual level (shown as the lower horizontal line in green here) does not provide a confidence interval on the unknown \(t_0\).
As an experimental feature, inequality constraints on the parameters
\(\psi_i\) can be imposed. The
constraints are simple “box” constraints of the form \(\psi^{\text{L}}_i \leqslant \psi_i \leqslant
\psi^{\text{U}}_i\) where \(\psi^{\text{L}}_i\) and \(\psi^{\text{U}}_i\) are given. These
constraints can only be used when the likelihood is maximized with the
nloptr
package. If needed, they can be given for some
parameters only, letting the other bounds fixed at their default values
\(\psi^{\text{L}}_i = -\infty\) and
\(\psi^{\text{U}}_i = \infty\).
In practice, the constraints are given by using a named vector, and
hence the names of the parameters must be known. These names can be
retrieved if needed by first using estim = "none"
.
fitNBreak2 <-
TVGEV(data = df, response = "TXMax", date = "Date",
design = breaksX(date = Date, breaks = "1970-01-01", degree = 1),
loc = ~ t1 + t1_1970,
estim = "nloptr",
coefLower = c("xi_0" = -0.15), coefUpper = c("xi_0" = 0.0))
## Warning in MLE.TVGEV(object = tv, y = NULL, psi0 = psi0, estim = estim, : some
## estimated parameters at the bounds, inference results are misleading
coef(fitNBreak2)
## mu_0 mu_t1 mu_t1_1970 sigma_0 xi_0
## 32.02318421 -0.02420595 0.07812270 1.72977933 -0.15000000
The GEV shape parameter \(\xi\) is usually taken as constant, not time-varying; inasmuch the log-likelihood is \(\infty\) when \(\xi < -1\), a lower bound \(\psi^{\text{L}}_i > -1\) should normally be set. In practice, this is usually not needed because for numerical reasons the optimization will converge to a local maximum with \(\xi > -1\) and a finite log-likelihood rather than to a parameter vector with the infinite maximal log-likelihood.
The constraints can be used to set a parameter to a given value \(\psi_i^\star\) by taking \(\psi^{\text{L}}_i := \psi_i^\star\) and \(\psi^{\text{U}}_i := \psi_i^\star\).
fitNBreak3 <-
TVGEV(data = df, response = "TXMax", date = "Date",
design = breaksX(date = Date, breaks = "1970-01-01", degree = 1),
loc = ~ t1 + t1_1970,
estim = "nloptr",
coefLower = c("mu_t1" = -0.02, "xi_0" = -0.2),
coefUpper = c("mu_t1" = -0.02, "xi_0" = 0.1))
## Warning in MLE.TVGEV(object = tv, y = NULL, psi0 = psi0, estim = estim, : some
## estimated parameters at the bounds, inference results are misleading
coef(fitNBreak3)
## mu_0 mu_t1 mu_t1_1970 sigma_0 xi_0
## 32.13621050 -0.02000000 0.07108899 1.75700684 -0.18234743