COTOR Challenge Round 3

Estimate $500k x $500k layer

Solution by Steve Fiete

Given data and notation:

70 claims per year for each of seven years. The process which generates claims is the same each year except the expected value increases by an constant inflation factor each year. Enumerate the years starting with year=0 for the first year.

If x is a random claim in year 7, then the challenge is to estimate the expected value of min(500,000,max(0,x-500,000)) using a 95% confidence interval.

Also calculate a 95% confidence interval for the sample mean of 70 claims in year 7. This will be denoted as xbar.

Analysis

The basic approach is to fit loss distributions to first dollar unlimited claims for each year, and estimate the annual inflation rate. We try a variety of distributions, then use the one with the best fit to make inferences about losses in the 500k x 500k layer. While this is a common method, two aspects of this analysis are not standard industry practice:

  1. Estimating the inflation rate simultaneously with the loss distribution parameters for all 7 years using all 490 claims.
  2. Evaluating goodness of fit of the 7 distributions (1 for each year) by rolling the expected and actual severity distributions into a single p-p plot.

Typical methods of estimating severity trends involve estimating the average severity in each year (or quarter, or month), smoothing those averages, then fitting a trend. Since we are assuming a uniform trend over time we can simply incorporate the trend estimation into the estimation procedure for the other loss distribution parameters. This way all of the claims in all years influence the estimate of the scale parameter in the first year, the trend parameter, and any other loss distribution parameters. The advantage is that the maximum amount of information is used to estimate every parameter.

To evaluate confidence intervals for both the true cost of the layer, and actual mean cost of the layer we use simulation. Most of the calculations are done in SAS; the parts done in Excel are included with this document.

The claims trend rate is denoted as r. For simplicity, since each distribution considered has 2 parameters we will use theta as the scale parameter and alpha as the second parameter. Theta varies by year so that the expected value will increase by the trend rate. For year=k theta is denoted as theta(k) The following distributions were evaluated:

Gamma:

Pdf: f(x)=(x/theta)^alpha*exp(-x/theta)/(x*G(alpha))

theta(k+1)=(1+r)*theta(k)

Lognormal: x~LN(theta,alpha)

E(log(x))=theta

theta(k+1)=theta(k)+log(1+r)

Weibull:

Pdf: f(x)=alpha*(x/theta)^alpha * exp(-(x/theta)^alpha)/x

theta(k+1)=(1+r)*theta(k)

Inverse Weibull:

Pdf: f(x)=alpha*(theta/x)^alpha * exp(-(theta/x)^alpha)/x

theta(k+1)=(1+r)*theta(k)

Pareto:

Pdf: f(x)=alpha*theta^alpha * (x+theta)^(-alpha-1)

theta(k+1)=(1+r)*theta(k)

The pdf parameterizations were taken from Loss Models by Klugman, Panjer, and Wilmot.

For each distribution theta(0), alpha, and r are estimated using maximum likelihood estimation. The next step is to evaluate goodness of fit.

The Excel workbook “pp plots.xls” shows p-p plots for each model comparing actual versus fitted loss distributions. The workbook also shows the parameter estimates for each distribution, the first dollar expected value of severity in year 0, and the expected value of loss in the 500k x 500k layer in year 7. For each model there are 7 different distributions – 1for each year. These can be combined into a single p-p plot. The Excel workbook “pp build.xls” shows how to combine data from 7 different gamma distributions into a single size-of-loss table. This table can then be made into a p-p plot. They are not shown, but the same approach was used to build the p-p plots for the other distributions.

P-P plots were used because they allow us to examine goodness of fit across the entire range of possible outcomes. A single number, such as log-likelihood, to describe goodness of fit does not.

The pareto distribution has the best fit. We will assume claims within each year are generated by a pareto distribution. The next step is to estimate a distribution of possible parameter estimates.

Using the estimates from the original data we simulate 70 claims per year for each of the 7 years. With the simulated data we estimate the pareto and inflation parameters again. Using these parameter estimates we calculate the expected value of the 500k x 500k layer. Finally, we simulate 70 claims for year seven, and calculate the sample mean of the 500k x 500k layer. This process is iterated 100 times. After it is done we have a 100 expected layer values and sample layer means. The 5th and 95th percentiles of these samples provide our confidence intervals.

The Excel workbook “simuations.xls” shows the results of 100 trials of simulating 70 claims per year in years 0 through 6 and estimating r, alpha, and theta(0). Using these parameters the expected value of loss in the 500k x 500k layer is calculated. The 100 trials are sorted by this expected value. The 5th and 95th values determine the confidence interval for the true mean in the layer.

With each simulation trial we also simulate 70 claims for year 7. Using this simulated data we can calculate an actual sample mean for the layer. The entire simulation and calculation is shown in “year 7 simulation.xls” Note that alpha and theta are different for each of the 100 trials because in each trial they were estimated from claims simulated for years 0-6.

Results

The 95% confidence interval for the expected value in the layer has a lower bound of 6,984, and an upper bound of 17,669. The point estimate from the original data set is 12,738. The 95% confidence interval for the actual mean of 70 claims has a lower bound of 0, and an upper bound of 29,384.

The trend rate estimated in the pareto model is 18.5%. If we apply this annual trend rate to each of the 490 claims to bring them to the year 7 level, then calculate the mean of each set of 70 claims we get the following:

7,924 / 7,143 / 10,385 / 12,745 / 16,985 / 906 / 13,013

All 7 sample means lie within the 95% confidence interval for xbar. This last observation is really just a sniff test to make sure we did not simulate our way into a clearly unreasonable conclusion.