Tuesday, September 15, 2015

Trying out ExtRemes R package : Part 2 - Threshold Approach

I am using  http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf to try out "extRemes" R package.
Below is my notes and thoughts.
On my desk is  a copy of
Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Hurricane Data from "extRemes" package 
Estimated economic damage caused by U.S. hurricanes from 1926 to 1995

Note the cost are normalised.


data("damage")
par(mfrow=c(2,2))
 
plot(damage$Year, damage$Dam, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21) 
 plot(damage$Year, log(damage$Dam), xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)



Need to consider a threshold, balance between bias and variance
threshrange.plot(damage$Dam, r=c(1, 15), nint=20)

Having never seen a plot like before I look at
  • the size of uncertainty bound stabilising
    • for the reparameterized scale 
      •  10 billion too large
    • for shap
      • Just below 6 billion the uncertainty bounds appear to stabilise. 



mrlplot(damage$Dam, xlim = c(0, 12))
I could form a straight line (linear) from the local minima within the CI bounds to 12 billion. 

My own limited understanding of these plots agrees 6 billion may be a reasonable threshold.

Average number of hurricanes  is 2.06 per year

fitD <- fevd(Dam, damage, threshold = 6, type = "GP", time.units = "2.06/year")


plot(fitD)
What can I see

  • the outlier Miami 1926.  
    • This is right on the lower bound of the time series. Has the bound been choose on purpose?, Have the measurement tools changed? 
  • studying extremes so elimination not  options
  • population increase and spread to previously low population areas occurring
  • oil rigs

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)[1] 0.159285373 0.046886265 0.022234267 0.008513151

Katrina cost more than 40 billion (guesstimate)
Sandy cost more than 20 billion (guesstimate)

Lets extend the time series with more available data including the Galverston storms

Pielke et al. 1900-2005 Data


library(xlsx)
dat <- read.xlsx("public_data_may_2007-2.xls", sheetIndex=1 ,header = TRUE)

hur <- dat[,c(1,6)]
hur$damage <- hur$Normalized.PL05 / 100000000
hur <- hur[complete.cases(hur),]

plot(hur$Year, hur$damage, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)


 Galverston 1900, Miami 1926, and Katerina 2005 are the top three.


plot(hur$Year, log(hur$damage*1000), xlab = "Year", ylab = "U.S. Hurricane Damage (million USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)
The plot seems more dense in the recent past. Record keeping, population and what is considered a major event may have changed.

Threshold
There are 207 events, 12 above 20 Billion. A suitable threshold should be less than 20 billion.



mrlplot(hur$damage, xlim = c(0, 20))
Choosing a threshold of 8 Billion
  • Below 10 billion few differences in uncertainty of shape.
  • At 8 billion the reparameterised scale uncertainty is stabilising
  • Mean Excess plot is remarkable boring below 15 Billion.

Average number of hurricanes 207 / 105 = 1.97


fitD <- fevd(damage, hur, threshold = 8, type = "GP", time.units = "1.97/year"
plot(fitD)

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.4800389 0.1962596 0.1009108 0.0382478

What can I see? 
  • Miami 1926 is not as lonely. 
  • The limits on all the axii are much greater.    
  • All the points in the return period/return level plot are within the confidence intervals
What was changed in the data?
  • time period
  • number of large events
  • normalisation of damages
Want more insight into how these changes are reflected in results
  • Trimming to 1901-2004 would remove two major events
  • Trimming to 1926-1995 to remove normalisation of damages
Trim 1901-2004
hur2004 <- hur[which(hur$Year > 1900),]
hur2004 <- hur2004[which(hur$Year <2005),]

average hurricane per year 1.97
keep same threshold 8 billion

 hur1901 <- hur[which(hur$Year > 1900),]
 hur1901 <- hur1901[which(hur1901$Year < 2005),]
fit1901 <- fevd(damage, hur1901, threshold = 8, type = "GP", time.units = "1.97/year")
hur1926 <- hur[which(hur$Year > 1925),] 
hur1926 <- hur1926[which(hur1926$Year < 1996),]
fit1926 <- fevd(damage, hur1926, threshold = 8, type = "GP", time.units = "1.97/year") 





> pextRemes(fit1901, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.44262508 0.16345777 0.07780206 0.02630498
> pextRemes(fit1926, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.46102420 0.17505044 0.08403011 0.02836343
> pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.4800389 0.1962596 0.1009108 0.0382478


fit return level lower ci approx upper ci
1900-2005 20 25.01 42.15 59.28
1901-2004 20 20.95 35.05 49.15
1926-1995 20 20.84 38.644 56.45
1900-2005 100 26.77 102.70 178.63
1901-2004 100 25.96 82.19 138.41
1926-1995 100 20.41 88.80 157.18

Normalisation effected the axii.

Otherwise still feel insight low.

Initial look at EVT Return Levels

Continuing my study of Classical EVT

Main text

Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Also http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf which has been submitted to Journal of Statistical Software

fit1 <- fevd(TMX1, PORTw, units = "deg C")


plot(fit1) 




What the heck am I looking at? Visual ways to evaluate the fit of the model, which I don't have a feel for. The 2 quantiles plots are not convincing. 

Look at the Gumbel model

fit0 <- fevd(TMX1, PORTw, type = "Gumbel", units = "deg C")
plot(fit0) 





What do I see, Gumbel fit is worse than fit1
  • Quantiles differ from regression lines more at higher quantiles ( I am not certain if that matters)
  • The shape of the empirical and modelled density are less similar
  • The Gumbel return levels exceed the CI and appears to be bias to lower temperatures
Bayesian
The most obvious difference is in the Return Period and Level plot. The Baysian return levels are more within the 95% CI. Not certain what this implies.