PhD Days

Tuesday, September 15, 2015

Trying out ExtRemes R package : Part 2 - Threshold Approach

I am using http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf to try out "extRemes" R package.
Below is my notes and thoughts.
On my desk is a copy of
Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Hurricane Data from "extRemes" package
Estimated economic damage caused by U.S. hurricanes from 1926 to 1995

Note the cost are normalised.

data("damage")
par(mfrow=c(2,2))

plot(damage$Year, damage$Dam, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)

plot(damage$Year, log(damage$Dam), xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)

Need to consider a threshold, balance between bias and variance

threshrange.plot(damage$Dam, r=c(1, 15), nint=20)

Having never seen a plot like before I look at

the size of uncertainty bound stabilising

for the reparameterized scale

10 billion too large

for shap

Just below 6 billion the uncertainty bounds appear to stabilise.

mrlplot(damage$Dam, xlim = c(0, 12))

I could form a straight line (linear) from the local minima within the CI bounds to 12 billion.

My own limited understanding of these plots agrees 6 billion may be a reasonable threshold.

Average number of hurricanes is 2.06 per year

fitD <- fevd(Dam, damage, threshold = 6, type = "GP", time.units = "2.06/year")

plot(fitD)

What can I see

the outlier Miami 1926.

This is right on the lower bound of the time series. Has the bound been choose on purpose?, Have the measurement tools changed?

studying extremes so elimination not options
population increase and spread to previously low population areas occurring
oil rigs

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)[1] 0.159285373 0.046886265 0.022234267 0.008513151

Katrina cost more than 40 billion (guesstimate)

Sandy cost more than 20 billion (guesstimate)

Lets extend the time series with more available data including the Galverston storms

Pielke et al. 1900-2005 Data

library(xlsx)
dat <- read.xlsx("public_data_may_2007-2.xls", sheetIndex=1 ,header = TRUE)

hur <- dat[,c(1,6)]
hur$damage <- hur$Normalized.PL05 / 100000000
hur <- hur[complete.cases(hur),]

plot(hur$Year, hur$damage, xlab = "Year", ylab = "U.S. Hurricane Damage (billion USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)

Galverston 1900, Miami 1926, and Katerina 2005 are the top three.

plot(hur$Year, log(hur$damage*1000), xlab = "Year", ylab = "U.S. Hurricane Damage (million USD)",cex = 0.75, col = "darkblue", bg = "lightblue", pch = 21)

The plot seems more dense in the recent past. Record keeping, population and what is considered a major event may have changed.

Threshold

There are 207 events, 12 above 20 Billion. A suitable threshold should be less than 20 billion.

mrlplot(hur$damage, xlim = c(0, 20))

Choosing a threshold of 8 Billion

Below 10 billion few differences in uncertainty of shape.
At 8 billion the reparameterised scale uncertainty is stabilising
Mean Excess plot is remarkable boring below 15 Billion.

Average number of hurricanes 207 / 105 = 1.97

fitD <- fevd(damage, hur, threshold = 8, type = "GP", time.units = "1.97/year"

plot(fitD)

pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)

[1] 0.4800389 0.1962596 0.1009108 0.0382478

What can I see?

Miami 1926 is not as lonely.
The limits on all the axii are much greater.
All the points in the return period/return level plot are within the confidence intervals

What was changed in the data?

time period
number of large events
normalisation of damages

Want more insight into how these changes are reflected in results

Trimming to 1901-2004 would remove two major events
Trimming to 1926-1995 to remove normalisation of damages

Trim 1901-2004

hur2004 <- hur[which(hur$Year > 1900),]

hur2004 <- hur2004[which(hur$Year <2005),]

average hurricane per year 1.97

keep same threshold 8 billion

hur1901 <- hur[which(hur$Year > 1900),]
hur1901 <- hur1901[which(hur1901$Year < 2005),]
fit1901 <- fevd(damage, hur1901, threshold = 8, type = "GP", time.units = "1.97/year")

hur1926 <- hur[which(hur$Year > 1925),]
hur1926 <- hur1926[which(hur1926$Year < 1996),]
fit1926 <- fevd(damage, hur1926, threshold = 8, type = "GP", time.units = "1.97/year")

> pextRemes(fit1901, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.44262508 0.16345777 0.07780206 0.02630498
> pextRemes(fit1926, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.46102420 0.17505044 0.08403011 0.02836343
> pextRemes(fitD, c(20, 40, 60, 100), lower.tail = FALSE)
[1] 0.4800389 0.1962596 0.1009108 0.0382478

fit	return level	lower ci	approx	upper ci
1900-2005	20	25.01	42.15	59.28
1901-2004	20	20.95	35.05	49.15
1926-1995	20	20.84	38.644	56.45
1900-2005	100	26.77	102.70	178.63
1901-2004	100	25.96	82.19	138.41
1926-1995	100	20.41	88.80	157.18

Normalisation effected the axii.

Otherwise still feel insight low.

Initial look at EVT Return Levels

Continuing my study of Classical EVT

Main text

Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Also http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf which has been submitted to Journal of Statistical Software

fit1 <- fevd(TMX1, PORTw, units = "deg C")

plot(fit1)

What the heck am I looking at? Visual ways to evaluate the fit of the model, which I don't have a feel for. The 2 quantiles plots are not convincing.

Look at the Gumbel model

fit0 <- fevd(TMX1, PORTw, type = "Gumbel", units = "deg C")

plot(fit0)

What do I see, Gumbel fit is worse than fit1

Quantiles differ from regression lines more at higher quantiles ( I am not certain if that matters)
The shape of the empirical and modelled density are less similar
The Gumbel return levels exceed the CI and appears to be bias to lower temperatures

Bayesian

The most obvious difference is in the Return Period and Level plot. The Baysian return levels are more within the 95% CI. Not certain what this implies.

Monday, September 15, 2014

Trying out ExtRemes R package : Part 1 - Block Maxima Approach

I used http://www.ral.ucar.edu/~ericg/extRemes/extRemes2.pdf to try out "extRemes" R package.
Below is my notes and thoughts.
On my desk was a copy of
Coles, S. (2001). An introduction to statistical modeling of extreme values / Stuart Coles, London ; New York : Springer, [2001].

Fitting the Generalized Extreme Value distribution function to the Port Jervis, New York annual maximum winter temperature data using "extRemes" R package

library(extremes)

data("PORTw")

plot(PORTw$TMX1, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "darkblue")

Default Run

fit1 <- fevd(TMX1, PORTw, units = "deg C")

fit1

Output

fevd(x = TMX1, data = PORTw, units = "deg C")
[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 172.7426

Estimated parameters: location scale shape 15.1406132 2.9724952 -0.2171486

Standard Error Estimates: location scale shape 0.39745119 0.27521741 0.07438302

Estimated parameter covariance matrix. location scale shapelocation 0.15796745 0.01028664 -0.010869596scale 0.01028664 0.07574462 -0.010234809shape -0.01086960 -0.01023481 0.005532834

AIC = 351.4853

BIC = 358.1438

GEV model has 3 parameters

locatation
scale
shape

The simplest distribution function is Gumbel, for which is only 2 parameters.
When the Gumbel df. is a good fit, the estimate of the shape parameter should be approx. 0

Here shape is not close to 0. Gumbel probably not good.

3 measures of the goodness of fit provided.

Negative Log-Likelihood Value
AIC
BIC

Gumbel Run

fit0 <- fevd(TMX1, PORTw, type = "Gumbel", units = "deg C")

fit0

fevd(x = TMX1, data = PORTw, type = "Gumbel", units = "deg C")
[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 175.7782

Estimated parameters:
location scale
14.799982 2.886128

Standard Error Estimates:
location scale
0.3709054 0.2585040

Estimated parameter covariance matrix.
location scale
location 0.13757080 0.03173887
scale 0.03173887 0.06682429

AIC = 355.5563

BIC = 359.9954

All the goodness of fit measure are all larger as expected.

lr.test Likelihood-Ratio Test

Use the likelihood ratio test to confirm this

lr.test(fit1,fit0)
Likelihood-ratio Test
data: TMX1TMX1
Likelihood-ratio = 6.0711, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.01374 alternative hypothesis: greater

I look at the p-value, because it makes sense to me.

Bayesian estimation

> fB <- fevd(TMX1, PORTw, method = "Bayesian")
> fB

fevd(x = TMX1, data = PORTw, method = "Bayesian")
[1] "Estimation Method used: Bayesian"

Acceptance Rates:
log.scale shape
0.5379076 0.4971994

fevd(x = TMX1, data = PORTw, method = "Bayesian")

[1] "Quantiles of MCMC Sample from Posterior Distribution"

2.5% Posterior Mean 97.5%
location 14.2452034 15.1496558 16.0945492194
scale 2.5276179 3.0935399 3.8803429287
shape -0.3654228 -0.1975376 -0.0004970114

Estimated parameter covariance matrix.
location log.scale shape
location 0.240496184 0.009813165 -0.015884315
log.scale 0.009813165 0.124166098 -0.013129196
shape -0.015884315 -0.013129196 0.008814744

DIC = 1045.476

L-moments

fitLM <- fevd(TMX1, PORTw, method = "Lmoments")
fitLM

fevd(x = TMX1, data = PORTw, method = "Lmoments")
[1] "GEV Fitted to TMX1 of PORTw data frame, using L-moments estimation."
location scale shape
15.1775146 3.0286294 -0.2480594

Significance of Co-variance Information

plot(PORTw$TMX1, PORTw$AOindex)

> fit2 <- fevd(TMX1, PORTw, location.fun = ~AOindex, units = "deg C")
> fit2

fevd(x = TMX1, data = PORTw, location.fun = ~AOindex, units = "deg C")
[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 166.7992

Estimated parameters:
mu0 mu1 scale shape
15.2538412 1.1518782 2.6809613 -0.1812824

Standard Error Estimates:
mu0 mu1 scale shape
0.35592663 0.31800904 0.24186870 0.06725912

Estimated parameter covariance matrix.
mu0 mu1 scale shape
mu0 0.126683767 0.002230374 0.010009100 -0.008065698
mu1 0.002230374 0.101129752 -0.002538585 0.002075487
scale 0.010009100 -0.002538585 0.058500466 -0.007020374
shape -0.008065698 0.002075487 -0.007020374 0.004523789

AIC = 341.5984

BIC = 350.4764

What is going on here?
location parameterisation now has 2 parts mu0 and mu1

mu(x) = mu0 + mu1 * x, where x = AO Index

lr-test
All the goodness of fit measure are lower when AO Index is consider
Expect utilises the AO index information improves the fit of the model

lr.test(fit1,fit2)
Likelihood-ratio Test
data: TMX1TMX1
Likelihood-ratio = 11.8869, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.0005653

p-value very small, result as expected

Is the covariance information significant to the scale parameter?

fit3 <- fevd(TMX1, PORTw, location.fun= ~AOindex, scale.fun = ~AOindex, units = "deg C")
> fit3
fevd(x = TMX1, data = PORTw, location.fun = ~AOindex, scale.fun = ~AOindex,
units = "deg C")
[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 166.6593

Estimated parameters:
mu0 mu1 sigma0 sigma1 shape
15.2616373 1.1802783 2.6782395 -0.1464682 -0.1860603

Standard Error Estimates:
mu0 mu1 sigma0 sigma1 shape
0.35610192 0.31724472 0.24242220 0.27629143 0.06862845

Estimated parameter covariance matrix.
mu0 mu1 sigma0 sigma1 shape
mu0 0.126808580 -0.0085301401 0.0098703920 -0.001720545 -0.0084519671
mu1 -0.008530140 0.1006442141 -0.0001225007 -0.014111326 0.0004891586
sigma0 0.009870392 -0.0001225007 0.0587685234 -0.005405442 -0.0073208829
sigma1 -0.001720545 -0.0141113262 -0.0054054420 0.076336953 0.0015286996
shape -0.008451967 0.0004891586 -0.0073208829 0.001528700 0.0047098637

AIC = 343.3185

BIC = 354.4161

lr-test
All the goodness of fit measure are HIGHER when AO Index is consider for location and scale
than only location

lr.test(fit2,fit3)
Likelihood-ratio Test
data: TMX1TMX1

Likelihood-ratio = 0.2798, chi-square critical value = 3.841, alpha = 0.050,
Degrees of Freedom = 1.000, p-value = 0.5968
alternative hypothesis: greater

p-value is large,

Thursday, September 11, 2014

Basic Time Series with R

Currently play with the Port Jervis, New York annual maximum and minimum winter temperature data, provided with the extRemes R package

Extreme value deal with rare events which are unlikely to follow patterns and assumptions common with complete event view, but standard initial analysis tools can still provide insight.

library(extremes)
data("PORTw")

plot(PORTw$TMX1, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "red")
plot(PORTw$TMX0, type = "l", xlab = "Year", ylab = "Maximum winter temperature", col = "darkblue")

Visual inspection

Trend - none
Cycle - none
Clustering- none
Pair wise correlation - maybe

Both are non-cyclic, and can probably be described using an additive model, since the random fluctuations in the data are roughly constant in size over time:

Exponential Model

Lets try fitting a Simple Exponential Smoothing
convert to a time series object

> maxts <- ts(PORTw$TMX1, start=c(1927))

Fit model with no trend or cycle

> fit1 <- HoltWinters(maxts, beta=FALSE, gamma=FALSE)
> fit1
Holt-Winters exponential smoothing without trend and without seasonal component.
Call:
HoltWinters(x = maxts, beta = FALSE, gamma = FALSE)
Smoothing parameters:
alpha: 0.1368945
beta : FALSE
gamma: FALSE
Coefficients:
[,1]
a 16.49764

plot(fit1)

Simple measure of fit. sum-of-squared-errors

fit1$SSE
[1] 763.3207

ARIMA Model

Difference the times series

maxtsdiff <- diff(maxts, differences = 1)

acf(maxtsdiff, lag.max = 20)

Basic visual inspection, may something at 1 year lag.

maxtsarima <- arima(maxts, order=c(0,1,1))
> maxtsarima
Series: maxts
ARIMA(0,1,1)
Coefficients:
ma1
-1.0000

s.e. 0.0532

sigma^2 estimated as 9.636:

log likelihood=-173.07

AIC=350.15

AICc=350.33

BIC=354.56

And let R work out if and what order ARIMA would be appropriate

> mtsARIMA <- auto.arima(maxts)
> mtsARIMA
Series: maxts
ARIMA(0,0,0) with non-zero mean
Coefficients:
intercept
16.3154
s.e. 0.3737
sigma^2 estimated as 9.494:

log likelihood=-173.01

AIC=350.02

AICc=350.21

BIC=354.46

No moving average happening

Pair Wise Correlation

Start with scatter plots

Check correlation coefficients

> cor(PORTw$TMX1, PORTw$TMN0)
[1] 0.1802413> cor(PORTw$TMX1, PORTw$AOindex)[1] 0.3944286> cor(PORTw$TMN0, PORTw$AOindex)[1] 0.01206764> cor(PORTw$TMX1, PORTw$AOindex, method="kendall")[1] 0.3019692

Possibly something to work with Arctic Oscillation Index and Maximium winter temperature

Sunday, August 24, 2014

Update QGIS from 2.2.0-8 to 2.4.0-1

Updating QGIS on Mac OSX 10.9.4

To update my QGIS need to download 2 packages

http://www.kyngchaos.com/software/qgis

http://www.kyngchaos.com/software/frameworks#gdal_complete

I deleted the old QGIS from the application directory.

I investigated updating GDAL to GDAL 1.11 using anaconda, but

 the following combinations of packages create a conflict with the

remaining packages:

  - python 2.7*

  - gdal 1.11*

So I installed GDAL 1.11 using
http://www.kyngchaos.com/software/frameworks#gdal_complete

Then installed QGIS 2.4.0

Friday, August 15, 2014

Days since 0001-01-01

Viewing output from CSIRO ACCESS 1.3 CMIP5 run, I saw the time units were 'days since 0001-01-01' the coder in me had heart palpitations.

The joys of Julian Calendars, leap years, does anyone know how many days between now and then.

Looking at a ACCESS Port Processing script
https://trac.nci.org.au/trac/access_tools/browser/app/trunk/app.py?rev=85

refdate are 1 (0001-01-01) and 719 163 (1970-01-01)

As I am currently just mucking around, I am just glad that someone knows how many days has passed since January 1, Year 1.

Now I can transform to my frame of reference more readily.

Sunday, June 15, 2014

Climate Change in 4 Dimensions

From April 8 2104 till June 17 I participated in online course

www.coursera.org

Climate Change in 4 Dimensions
University of San Diego.

Overal I learnt a lot about climate change beyond the science aspects

During the course I kept notes to provide some feedback and ideas for improvement.
I captured this in this post.

Lecture 5

Early part is very US centric

Global Average Warming Chart after Ocean Acidification and Coral , the text is to small to read.

Lecture 6

Very informative and enjoyable, Professor David G. Victor engaged me well.Introduced Stock Problem Economic model to me. Already introducing new ideas to my thoughts

Lecture 9

Weekly activity
Survey is USA oriented. Of course the US should act independent of other countries. The question for me is should Australia act independent of other countries.

This weeks graded quiz was not up to scratch. Usually only one or none of the question are badly worded, easily to be misunderstood, or difficult to attribute to any of the reading or lectures. This week multiple questions including.
Question 8 , 9. difficult to understand/read.
"Heat waves and heavy precipitation" is not a trend.

Lecture 10

Feedback:
Slide from Rogell, Meinshausen 2012 is out of date, wrong name inconsistent with other slides.
Chaparral is not a common international word, and is not what would grow in Australia.
Question in the ungraded quiz unclear

Based on global climate models, what regions will see the biggest precipitation increase with global warming?

northern

western

southern

eastern

This question is about North America, Not clear

Concepts

Uses indices to show warming particular in USA, definitely regional/spatial
Indices

Mean annual temperature
Number of cold days
Number of hot days
Agricultural Based - frost, super hot days, frost free (growing season) forecast.

Keeling curve shows growing season has increased by about 10 days, in agreement with ag-models

Heat Wave

Multiple factors some of which more common due to CC
Europe 2003, 2006, Australia 2009 (Karoly), Russia 2010, USA 2009

Models don't seem to capture very cold days.

Midterm

Question 7 unclear
If someone makes a prediction, and it comes true, does that mean that the prediction is correct?
does it mean hypothesis or prediction is true?

Question 9
Debatable if one or more answers are correct

Week 8

Lecture 14
Ice, snow and water
Does NOT mention ice West antarctica Ice Shelf info?
Projections continue to change with increasing understanding?
Parts of the lecture are dating quickly.

Lecture 15
Very directed to a Californian Audience. The level of knowledge about geography of California was assumed to be higher than my own, though I have travelled there. Think this would exclude some of the coursera audience.

Quiz Question 9

"Please fill in the blank: By 2050, the number of ‘extremely hot’ days could increase ________."

This statement was not in the lecture notes, so how does one know what location of study to cite? Does it it relate to global, USA, California, San Diego, or Australia extremely hot days.?

Lecture 19

Spelling mistake in the "Check your knowledge quiz" question 4
Arpanet