Tuesday, September 15, 2015

Trying out ExtRemes R package : Part 2 - Threshold Approach

I am using  http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf to try out "extRemes" R package.
Below is my notes and thoughts.
On my desk is  a copy of
Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Hurricane Data from "extRemes" package 
Estimated economic damage caused by U.S. hurricanes from 1926 to 1995

Note the cost are normalised.


data("damage")
par(mfrow=c(2,2))
 
plot(damage$Year, damage$Dam, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21) 
 plot(damage$Year, log(damage$Dam), xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)



Need to consider a threshold, balance between bias and variance
threshrange.plot(damage$Dam, r=c(1, 15), nint=20)

Having never seen a plot like before I look at
  • the size of uncertainty bound stabilising
    • for the reparameterized scale 
      •  10 billion too large
    • for shap
      • Just below 6 billion the uncertainty bounds appear to stabilise. 



mrlplot(damage$Dam, xlim = c(0, 12))
I could form a straight line (linear) from the local minima within the CI bounds to 12 billion. 

My own limited understanding of these plots agrees 6 billion may be a reasonable threshold.

Average number of hurricanes  is 2.06 per year

fitD <- fevd(Dam, damage, threshold = 6, type = "GP", time.units = "2.06/year")


plot(fitD)
What can I see

  • the outlier Miami 1926.  
    • This is right on the lower bound of the time series. Has the bound been choose on purpose?, Have the measurement tools changed? 
  • studying extremes so elimination not  options
  • population increase and spread to previously low population areas occurring
  • oil rigs

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)[1] 0.159285373 0.046886265 0.022234267 0.008513151

Katrina cost more than 40 billion (guesstimate)
Sandy cost more than 20 billion (guesstimate)

Lets extend the time series with more available data including the Galverston storms

Pielke et al. 1900-2005 Data


library(xlsx)
dat <- read.xlsx("public_data_may_2007-2.xls", sheetIndex=1 ,header = TRUE)

hur <- dat[,c(1,6)]
hur$damage <- hur$Normalized.PL05 / 100000000
hur <- hur[complete.cases(hur),]

plot(hur$Year, hur$damage, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)


 Galverston 1900, Miami 1926, and Katerina 2005 are the top three.


plot(hur$Year, log(hur$damage*1000), xlab = "Year", ylab = "U.S. Hurricane Damage (million USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)
The plot seems more dense in the recent past. Record keeping, population and what is considered a major event may have changed.

Threshold
There are 207 events, 12 above 20 Billion. A suitable threshold should be less than 20 billion.



mrlplot(hur$damage, xlim = c(0, 20))
Choosing a threshold of 8 Billion
  • Below 10 billion few differences in uncertainty of shape.
  • At 8 billion the reparameterised scale uncertainty is stabilising
  • Mean Excess plot is remarkable boring below 15 Billion.

Average number of hurricanes 207 / 105 = 1.97


fitD <- fevd(damage, hur, threshold = 8, type = "GP", time.units = "1.97/year"
plot(fitD)

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.4800389 0.1962596 0.1009108 0.0382478

What can I see? 
  • Miami 1926 is not as lonely. 
  • The limits on all the axii are much greater.    
  • All the points in the return period/return level plot are within the confidence intervals
What was changed in the data?
  • time period
  • number of large events
  • normalisation of damages
Want more insight into how these changes are reflected in results
  • Trimming to 1901-2004 would remove two major events
  • Trimming to 1926-1995 to remove normalisation of damages
Trim 1901-2004
hur2004 <- hur[which(hur$Year > 1900),]
hur2004 <- hur2004[which(hur$Year <2005),]

average hurricane per year 1.97
keep same threshold 8 billion

 hur1901 <- hur[which(hur$Year > 1900),]
 hur1901 <- hur1901[which(hur1901$Year < 2005),]
fit1901 <- fevd(damage, hur1901, threshold = 8, type = "GP", time.units = "1.97/year")
hur1926 <- hur[which(hur$Year > 1925),] 
hur1926 <- hur1926[which(hur1926$Year < 1996),]
fit1926 <- fevd(damage, hur1926, threshold = 8, type = "GP", time.units = "1.97/year") 





> pextRemes(fit1901, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.44262508 0.16345777 0.07780206 0.02630498
> pextRemes(fit1926, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.46102420 0.17505044 0.08403011 0.02836343
> pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.4800389 0.1962596 0.1009108 0.0382478


fit return level lower ci approx upper ci
1900-2005 20 25.01 42.15 59.28
1901-2004 20 20.95 35.05 49.15
1926-1995 20 20.84 38.644 56.45
1900-2005 100 26.77 102.70 178.63
1901-2004 100 25.96 82.19 138.41
1926-1995 100 20.41 88.80 157.18

Normalisation effected the axii.

Otherwise still feel insight low.

Initial look at EVT Return Levels

Continuing my study of Classical EVT

Main text

Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Also http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf which has been submitted to Journal of Statistical Software

fit1 <- fevd(TMX1, PORTw, units = "deg C")


plot(fit1) 




What the heck am I looking at? Visual ways to evaluate the fit of the model, which I don't have a feel for. The 2 quantiles plots are not convincing. 

Look at the Gumbel model

fit0 <- fevd(TMX1, PORTw, type = "Gumbel", units = "deg C")
plot(fit0) 





What do I see, Gumbel fit is worse than fit1
  • Quantiles differ from regression lines more at higher quantiles ( I am not certain if that matters)
  • The shape of the empirical and modelled density are less similar
  • The Gumbel return levels exceed the CI and appears to be bias to lower temperatures
Bayesian
The most obvious difference is in the Return Period and Level plot. The Baysian return levels are more within the 95% CI. Not certain what this implies.



Monday, September 15, 2014

Trying out ExtRemes R package : Part 1 - Block Maxima Approach


I used http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf to try out "extRemes" R package.
Below is my notes and thoughts.
On my desk was  a copy of
Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].


Fitting the Generalized Extreme Value distribution function  to the Port Jervis, New York annual maximum winter temperature data using "extRemes" R package

library(extremes) 
data("PORTw") 
plot(PORTw$TMX1, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "darkblue")

Default Run

fit1 <- fevd(TMX1, PORTw, units = "deg C") 
fit1

Output

fevd(x = TMX1, data = PORTw, units = "deg C")
[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value:  172.7426 

 Estimated parameters:  location      scale      shape 15.1406132  2.9724952 -0.2171486  

 Standard Error Estimates:  location      scale      shape 0.39745119 0.27521741 0.07438302  

 Estimated parameter covariance matrix.            location       scale        shapelocation  0.15796745  0.01028664 -0.010869596scale     0.01028664  0.07574462 -0.010234809shape    -0.01086960 -0.01023481  0.005532834 

 AIC = 351.4853  

 BIC = 358.1438 

GEV model has 3 parameters

  • locatation
  • scale
  • shape
The simplest distribution function is  Gumbel, for which is only 2 parameters.
When the Gumbel df. is a good fit, the estimate of the shape parameter should be approx. 0

Here shape is not close to 0.  Gumbel probably not good.

3 measures of the goodness of fit provided.

  • Negative Log-Likelihood Value 
  • AIC
  • BIC

Gumbel Run

fit0 <- fevd(TMX1, PORTw, type = "Gumbel", units = "deg C")
fit0
fevd(x = TMX1, data = PORTw, type = "Gumbel", units = "deg C")
[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value:  175.7782 

 Estimated parameters:
 location     scale
14.799982  2.886128 
 
 Standard Error Estimates:
 location     scale
0.3709054 0.2585040 
 Estimated parameter covariance matrix.
           location      scale
location 0.13757080 0.03173887
scale    0.03173887 0.06682429
 AIC = 355.5563  
 BIC = 359.9954 
All the goodness of fit measure are all larger as expected.

lr.test Likelihood-Ratio Test

Use the likelihood ratio test to confirm this

lr.test(fit1,fit0)
Likelihood-ratio Test
data:  TMX1TMX1
Likelihood-ratio = 6.0711, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.01374
          alternative hypothesis: greater 
I look at the p-value, because it makes sense to me.


Bayesian estimation 

> fB <- fevd(TMX1, PORTw, method = "Bayesian")
> fB
 
fevd(x = TMX1, data = PORTw, method = "Bayesian")
[1] "Estimation Method used: Bayesian"

 Acceptance Rates:
log.scale     shape
0.5379076 0.4971994 
fevd(x = TMX1, data = PORTw, method = "Bayesian")
[1] "Quantiles of MCMC Sample from Posterior Distribution"
               2.5% Posterior Mean         97.5%
location 14.2452034     15.1496558 16.0945492194
scale     2.5276179      3.0935399  3.8803429287
shape    -0.3654228     -0.1975376 -0.0004970114

 Estimated parameter covariance matrix.
              location    log.scale        shape
location   0.240496184  0.009813165 -0.015884315
log.scale  0.009813165  0.124166098 -0.013129196
shape     -0.015884315 -0.013129196  0.008814744
DIC =  1045.476 

L-moments 

fitLM <- fevd(TMX1, PORTw, method = "Lmoments")
fitLM

fevd(x = TMX1, data = PORTw, method = "Lmoments")
[1] "GEV  Fitted to  TMX1  of  PORTw data frame, using L-moments estimation."
  location      scale      shape
15.1775146  3.0286294 -0.2480594 

Significance of Co-variance Information

plot(PORTw$TMX1, PORTw$AOindex)


> fit2 <- fevd(TMX1, PORTw, location.fun = ~AOindex, units = "deg C")
> fit2

fevd(x = TMX1, data = PORTw, location.fun = ~AOindex, units = "deg C")
[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value:  166.7992 

 Estimated parameters:
       mu0        mu1      scale      shape
15.2538412  1.1518782  2.6809613 -0.1812824 
 Standard Error Estimates:
       mu0        mu1      scale      shape
0.35592663 0.31800904 0.24186870 0.06725912 

 Estimated parameter covariance matrix.
               mu0          mu1        scale        shape
mu0    0.126683767  0.002230374  0.010009100 -0.008065698
mu1    0.002230374  0.101129752 -0.002538585  0.002075487
scale  0.010009100 -0.002538585  0.058500466 -0.007020374
shape -0.008065698  0.002075487 -0.007020374  0.004523789
 AIC = 341.5984  
 BIC = 350.4764 
What is going on here?
location parameterisation now has 2 parts mu0 and mu1


mu(x) = mu0 + mu1 * x,   where x = AO Index
lr-test
All the goodness of fit measure are lower when AO Index is consider
Expect utilises the AO index information improves the fit of the model
lr.test(fit1,fit2)
Likelihood-ratio Test
data:  TMX1TMX1
Likelihood-ratio = 11.8869, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.0005653

p-value very small, result as expected

Is the covariance information significant to the scale parameter?

fit3 <- fevd(TMX1, PORTw, location.fun= ~AOindex, scale.fun = ~AOindex, units = "deg C")
> fit3
fevd(x = TMX1, data = PORTw, location.fun = ~AOindex, scale.fun = ~AOindex,
    units = "deg C")
[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value:  166.6593 

 Estimated parameters:
       mu0        mu1     sigma0     sigma1      shape
15.2616373  1.1802783  2.6782395 -0.1464682 -0.1860603 
 Standard Error Estimates:
       mu0        mu1     sigma0     sigma1      shape
0.35610192 0.31724472 0.24242220 0.27629143 0.06862845 

 Estimated parameter covariance matrix.
                mu0           mu1        sigma0       sigma1         shape
mu0     0.126808580 -0.0085301401  0.0098703920 -0.001720545 -0.0084519671
mu1    -0.008530140  0.1006442141 -0.0001225007 -0.014111326  0.0004891586
sigma0  0.009870392 -0.0001225007  0.0587685234 -0.005405442 -0.0073208829
sigma1 -0.001720545 -0.0141113262 -0.0054054420  0.076336953  0.0015286996
shape  -0.008451967  0.0004891586 -0.0073208829  0.001528700  0.0047098637
 AIC = 343.3185  
 BIC = 354.4161 

lr-test
All the goodness of fit measure are HIGHER when AO Index is consider for location and scale
than only location
lr.test(fit2,fit3)
Likelihood-ratio Test
data:  TMX1TMX1
 
Likelihood-ratio = 0.2798, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.5968
alternative hypothesis: greater
p-value is large,

Thursday, September 11, 2014

Basic Time Series with R



Currently play with the Port Jervis, New York annual maximum and minimum winter temperature data, provided with the extRemes R package

Extreme value deal with rare events which are unlikely to follow patterns and assumptions common with complete event view, but standard initial analysis tools can still provide insight.

library(extremes)
data("PORTw")

plot(PORTw$TMX1, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "red")
plot(PORTw$TMX0, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "darkblue")


Visual inspection

  • Trend - none
  • Cycle - none
  • Clustering- none
  • Pair wise correlation - maybe


Both are non-cyclic, and can probably be described using an additive model, since the random fluctuations in the data are roughly constant in size over time:

Exponential Model

Lets try fitting a Simple Exponential Smoothing
convert to a time series object
> maxts <- ts(PORTw$TMX1, start=c(1927))
Fit  model  with no trend or cycle
> fit1 <- HoltWinters(maxts, beta=FALSE, gamma=FALSE)
> fit1
Holt-Winters exponential smoothing without trend and without seasonal component.
Call:
HoltWinters(x = maxts, beta = FALSE, gamma = FALSE)
Smoothing parameters:
 alpha: 0.1368945
 beta : FALSE
 gamma: FALSE
Coefficients:
      [,1]
a 16.49764
plot(fit1)
Simple measure of fit. sum-of-squared-errors 
fit1$SSE
[1] 763.3207 

ARIMA Model

Difference the times series
maxtsdiff <- diff(maxts, differences = 1)

acf(maxtsdiff, lag.max = 20)


Basic visual inspection, may something at 1 year lag.

 maxtsarima <- arima(maxts, order=c(0,1,1))
> maxtsarima
Series: maxts
ARIMA(0,1,1)                  
Coefficients:
          ma1
      -1.0000

s.e.   0.0532
sigma^2 estimated as 9.636:   
log likelihood=-173.07 
AIC=350.15   
AICc=350.33  

BIC=354.56 
And let R work out if and what order ARIMA would be appropriate

> mtsARIMA <- auto.arima(maxts)
> mtsARIMA
Series: maxts
ARIMA(0,0,0) with non-zero mean
Coefficients:
      intercept
        16.3154
s.e.     0.3737
sigma^2 estimated as 9.494:
 
log likelihood=-173.01
AIC=350.02  
AICc=350.21  
          BIC=354.46

No moving average happening

Pair Wise Correlation

Start with scatter plots




Check correlation coefficients

    > cor(PORTw$TMX1, PORTw$TMN0)
[1] 0.1802413> cor(PORTw$TMX1, PORTw$AOindex)[1] 0.3944286> cor(PORTw$TMN0, PORTw$AOindex)[1] 0.01206764> cor(PORTw$TMX1, PORTw$AOindex, method="kendall")[1] 0.3019692

Possibly something to work with Arctic Oscillation Index and Maximium winter temperature 

Sunday, August 24, 2014

Update QGIS from 2.2.0-8 to 2.4.0-1


Updating QGIS on Mac OSX 10.9.4

To update my QGIS need to download 2 packages

http://www.kyngchaos.com/software/qgis

http://www.kyngchaos.com/software/frameworks#gdal_complete

I deleted the old QGIS from the application  directory.

I investigated updating GDAL to GDAL 1.11 using anaconda, but

 the following combinations of packages create a conflict with the
remaining packages:
  - python 2.7*
  - gdal 1.11*

So I installed GDAL 1.11 using
http://www.kyngchaos.com/software/frameworks#gdal_complete

Then installed QGIS 2.4.0

Friday, August 15, 2014

Days since 0001-01-01

Viewing output from CSIRO ACCESS 1.3 CMIP5 run,  I saw the time units were 'days since 0001-01-01' the coder in me had heart palpitations.

The joys of Julian Calendars, leap years,  does anyone know how many days between now and then.

Looking at a ACCESS Port Processing script
https://trac.nci.org.au/trac/access_tools/browser/app/trunk/app.py?rev=85

refdate are 1 (0001-01-01) and 719 163 (1970-01-01)

As I am currently just mucking around, I am just glad that someone knows how many days has passed since January 1, Year 1.

Now I can transform to my frame of reference more readily.

Sunday, June 15, 2014

Climate Change in 4 Dimensions


From April 8 2104 till June 17 I participated in online course

www.coursera.org

Climate Change in 4 Dimensions
University of San Diego.

Overal I learnt a lot about climate change beyond the science aspects

During the course I kept notes to provide some feedback and ideas for improvement.
I captured this in this post.

Lecture 5

Early part is very US centric 

Global Average Warming Chart after Ocean Acidification and Coral , the text is to small to read.

Lecture 6

Very informative and enjoyable, Professor David G. Victor engaged me well.Introduced Stock Problem Economic model to me. Already introducing new ideas to my thoughts


Lecture 9


Weekly activity
Survey is USA oriented. Of course the US should act independent of other countries. The question for me is should Australia act independent of other countries.

This weeks graded quiz was not up to scratch. Usually only one or none of the question are badly worded, easily to be misunderstood, or difficult to attribute to any of the reading or lectures. This week multiple questions including.
Question 8 , 9. difficult to understand/read. 
"Heat waves and heavy precipitation" is not a trend.    

Lecture 10

Feedback: 
Slide from Rogell, Meinshausen 2012 is out of date, wrong name inconsistent with other slides. 
Chaparral is not a common international word, and is not what would grow in Australia. 
Question in the ungraded quiz unclear
Based on global climate models, what regions will see the biggest precipitation increase with global warming?
This question is about North America, Not clear

Concepts

Uses indices to show warming particular in USA, definitely regional/spatial
Indices

  • Mean annual temperature
  • Number of cold days
  • Number of hot days
  • Agricultural Based - frost, super hot days, frost free (growing season) forecast.
    • Keeling curve shows growing season has increased by about 10 days, in agreement with ag-models
Heat Wave

  • Multiple factors some of which more common due to CC
  • Europe 2003, 2006, Australia 2009 (Karoly), Russia 2010, USA 2009


Models don't seem to capture very cold days. 

Midterm

Question 7 unclear
If someone makes a prediction, and it comes true, does that mean that the prediction is correct?
does it mean hypothesis or prediction is true?

Question 9 
Debatable if one or more answers are correct

Week 8 

Lecture 14 
Ice, snow and water
Does NOT mention ice West antarctica Ice Shelf info?
Projections continue to change with increasing understanding?
Parts of the lecture are dating quickly.

Lecture 15
Very directed to a Californian Audience. The level of knowledge about geography of California was assumed to be higher than my own, though I have travelled there.  Think this would exclude some of the coursera audience.

Quiz  Question 9
"Please fill in the blank: By 2050, the number of ‘extremely hot’ days could increase ________."
This statement was not in the lecture notes, so how does one know what location of study to cite? Does it it relate to global, USA, California, San Diego, or Australia extremely hot days.?

Lecture 19

Spelling mistake in the "Check your knowledge quiz" question 4
Arpanet