Data Analysis Project
December 13, 2004
Wave Height Prediction and the Time of Significant Swell
I have always had a strong affinity for the ocean and its environment. When I was quite young I was introduced to surfing and have never looked back. Surfing being a lifestyle and not just a hobby, is ever present on my mind. With the ever growing demands on my time by the other areas of my life, it is difficult to sit around, board under arm, waiting for the next swell to appear. I am looking to identify a reliable and accurate method to determine when swell events are going to occur and for how long they are going to occur so that I can plan ahead for the waves so I can keep myself and those around me happy while sating my surfing itch.
Waves are the product of wind and fetch (the distance that the wind blows over.) Ideally, waves are generated by a strong wind over a long fetch. This produces what is called a ground swell. A ground swell is a long period swell that tends to have a greater significant wave height (dependent on the wind speeds generating the swell) and by observation they seem to last longer than a shorter period swell caused by a wind over a smaller fetch. Storms out at sea in combination with local weather have an impact on the size, frequency, and length of a swell window (time that a dominant wave period begins to show on a near shore buoy.)
By analyzing the conditions at different buoys I would like to be able to determine the size and length of time that a swell will be adequate for surfing. Localized wave heights will be determined to be sufficient for surfing if they are registering on the local buoys at 3ft.
In general, I expect that I will find the most significant swell patterns to result from hurricanes that will arrive from the south in the late summer and early fall months and from Northeaster storms that dominate the winter and early spring weather pattern. In addition, there are rare events that occur throughout the year where local weather patterns are not at all conducive to wave generation, but a major swell event occurs. These instances may be the most useful in identifying the offshore weather conditions that produce swell and allow me the necessary information to predict localized swell events before they physically evidence themselves.
There are other factors besides the weather that affect waves. The bottom contours of the ocean floor, the steepness of the shoreline and the material (sand, reef, rock) that the ocean floor is made of all affect the local wave patterns. I am choosing to ignore these factors in my study because the New Jersey shore is mostly a sandy bottom resulting in what is called a beach break, as apposed to a reef or a point break. Beach breaks have sand bars which are the shelves that slow the lower (underwater) portion of the wave down causing the top of the wave to continue at speed, therefore resulting in a breaking wave. Beach breaks themselves tend to be highly variable with waves breaking in different areas along the beach depending upon the shifting of the sand bars. I am choosing to ignore this variability because I am not looking at specific breaks along the shore, but rather I am using the data buoy to indicate general conditions that produce waves of a significant height knowing that these general conditions will result in variable conditions depending on ones location along the New Jersey shore. Having local knowledge of the areas I most frequent and having built up years of “eyes on data” I can then translate the general buoy data into more specific data about my favorite spots, but it is nearly impossible to quantify my local knowledge, and I don’t want to give away all of my secrets anyway.
I have chosen to record 10 months worth of data at three buoys. These ten months are important because the summer months where there is traditionally little wave producing activity should provide a nice contrast to the rest of the year where waves are generated from the late summer / early fall hurricane season and the winter / early spring Northeaster storms.[1]
The data I have gathered is from three NOAA buoy stations. I have chosen these stations because two of them will capture the storm activity out at sea that produce swell for the New Jersey shore, and the third will give an indication of local wave height and swell period.
- Station 44011 – Georges Banks; to capture the northeastern storm patterns in the winter months
- Station 41001 – CapeHatteras; to capture the hurricane patterns in the fall
- Station 44009 – Delaware Bay; to capture the local wave height and weather conditions for the New JerseyShore affecting the surfing conditions
From these buoys I have collected six variables: wind direction, wind speed, wave height, dominant wave period, barometric pressure, and the time at which these readings were recorded. Wind speed, wind direction, and barometric pressure are all contributing factors in the size and duration of the swell events. Wave height and dominant wave period are the resultant readings from the effects of weather patterns that would generate a swell event. The time variable will allow me to identify the length of time that a swell event impacts the local beaches. Before beginning my multiple regression analysis I have evaluated my set of predictors and decided to focus on the wave height at the Delaware Bay Buoy (44009) as the target variable and the wave height, dominant wave period, and the barometric pressure at the Georges Bank Buoy (44011) and the Hatteras Buoy (41001) as my predictor variables. I have chosen to dismiss wind direction and wind speed at the off shore buoys as they are captured in the barometric pressure readings, being the likely resultant of changes in weather patterns. Additionally, the wave heights at the offshore buoys will be affected directly by the wind speed. Therefore because the wind data can be embedded in the wave heights and barometric readings, excluding the data will result in a simpler and cleaner multiple regression.
The data I have used is from January through October 2004 . NOAA historical data is problematic because it is voluminous and not well presented. NOAA presents its historical buoy data as readings for every hour of the day, resulting in a data sheet of 17 columns wide and 8,760 rows long. Therefore it was necessary for me to cull the data from a text file in order to have a manageable data set. I decided to take the buoy data twice a day, at 6:00 and 18:00. Having a twelve hour gap between readings should be a good balance between recording manageable data and not missing any significant swell train, since in my experience, swell patterns tend to last longer than 12 hours barring any rare circumstances (offshore gusts of 30+ mph) that would register as outliers.
I have chosen 2004 as my year to study because I can compare the actual data with journal entries of swell events that I kept through out the year. Cross referencing my observed reference with the statistical analysis should provide a better learning tool for me as to how to read the buoy data. While the journal entries will not be recorded in this project, they will serve as a guide and yardstick to remember swell events and help to identify those buoy characteristics that lead to conditions suited to surfing.
Wave Height Prediction at Delaware Buoy
Can the wave height for onshore buoys be predicted by using data from offshore buoys? Before beginning my multiple regression analysis I have reevaluated my set of predictors and decided to focus on the wave height at the Delaware Bay Buoy (44009) as the target variable and the wave height, dominant wave period, and the barometric pressure at the Georges Bank Buoy (44011) and the Hatteras Buoy (41001) as my predictor variables. I have chosen to dismiss wind direction and wind speed at the off shore buoys as they are captured in the barometric pressure readings, being the likely resultant of changes in weather patterns. Additionally, the wave heights at the offshore buoys will be affected directly by the wind speed. Therefore because the wind data can be embedded in the wave heights and barometric readings, excluding the data will result in a simpler and cleaner multiple regression.
First I want to get an idea of how my target data is distributed:
Looking at the histograms of data, the first chart evidences the wave heights are not normally distributed, they are long right tailed. This makes sense because it is not possible to have wave heights of less than zero and while the majority of points are within the one foot to five foot range, it is not all that abnormal to have waves heights exceeding this range and on the rare occasion waves heights will exceed ten feet. By taking the log10 of the waves heights, the data becomes more normally distributed, an indication that perhaps we will want to continue with the logged values.
Scatter Plots
As the first step of my multiple regression I want to look at the scatter plots of my target variable versus each of my predictor variables:
The scatter plots of the target variable relative to the predictor variables confirms that that the data is in fact long right tailed, thus the spray effect. Because of the long right tailed nature of the data, it now makes sense to use the logged data in order to get a relationship that is more linear with better variance properties.
Looking at the scatter plots of the logged data:
By taking the logged values we have a more normalized data set that will lead to a more accurate regression model, and these values should be used going forward. The scatter plots indicate that there is a stronger relationship between wave heights at the of shore buoys with the wave height at the onshore buoy. Barometric pressure seems to have some relationship, but not nearly as strong as the wave height relationships. The dominant period relationship appears to be the weakest relationship in the data. This intuitively makes sense. Wave height to wave height relationships are the most direct relationship and the effects of very large waves offshore are often visible at the onshore buoys and even at the beach. This is an apples to apples comparison.
A note on the relationship between barometric pressure and wave height: A lower barometric pressure indicates more unstable weather, or stormy weather. Waves are most often created by winds from storms, therefore a negative relationship is expected. The lower the barometric pressure, the more volatile the weather will be, increasing the chance of winds. Lower barometric pressures will there for indicate, at least to some degree, storm activity and subsequently the possibility for wave generating conditions. It is particularly interesting to see that the relationship between barometric pressure and onshore wave height is stronger in the southern regions (Hatteras) as opposed to the northern regions (Georges Bank.) A possible explanation for this may be the intense and concentrated low pressure events and resultant wave activities from hurricanes. The Hatteras buoy sits directly in the path of some storms and tends to be at least in some close vicinity for the tracks of most hurricanes that affect swell events on the eastern seaboard. The proximity to such intense low pressure systems could have the effect of magnifying the relationship between barometric pressure and wave height.
The relationship between dominant period and wave height seems to be the weakest and appears to have the most variability of the relationships. This makes sense because while longer periods tend to be indicative of more organized wave events, the fetch at which we are looking in the Atlantic tends to be relatively small in comparison to the pacific ocean where the storms generated are much further off shore resulting in more organized swell events with a more consistent and longer dominant wave period.
Regression Analysis: LOGD_Bay-WVH versus LOGHatteras-, LOGHatteras-, ...
The regression equation is
LOGD_Bay-WVHT = 13.6 + 0.411 LOGHatteras-WVHT + 0.289 LOGHatteras-DPD
- 24.5 LOGHatteras-BARO + 0.189 LOGG_Bank-WVHT
- 0.186 LOGG_Bank-DPD + 20.0 LOGG_Bank-BARO
Predictor Coef SE Coef T P
Constant 13.568 8.027 1.69 0.091
LOGHatteras-WVHT 0.41106 0.04306 9.55 0.000
LOGHatteras-DPD 0.28928 0.06721 4.30 0.000
LOGHatteras-BARO -24.529 3.515 -6.98 0.000
LOGG_Bank-WVHT 0.18854 0.03559 5.30 0.000
LOGG_Bank-DPD -0.18627 0.06363 -2.93 0.004
LOGG_Bank-BARO 20.024 2.552 7.85 0.000
S = 0.149992 R-Sq = 47.2% R-Sq(adj) = 46.7%
Analysis of Variance
Source DF SS MS F P
Regression 6 11.6986 1.9498 86.67 0.000
Residual Error 581 13.0711 0.0225
Total 587 24.7698
The standard error of the estimate indicates that given these predictors which are logged, I can predict with in a multiplicative effect of 10+.3, or +1.99 times a predicted wave height. Therefore for a predicted wave of 3 feet, an actual wave height of 1.5 feet to 5.97 feet, and I will be able to predicted with in this range 95% of the time. The .3 standard error indicates that there is a strong relationship between the target and its predictors and that the offshore wave heights, dominant periods, and barometric pressures are doing a relatively good job of predicting the close to shore wave heights at the Delaware Bay buoy.
Looking at the R-Sq, 47.2% of the variability of the wave height at the Delaware buoy is accounted for in the model. The linear relationship between the predictors tells us that for example a change in 1 of the logged Hatteras wave height will result in approximately multiplying the Delaware Bay wave height by 10.41, or 2.57 feet, all else being held constant.
The only variable that is proving to be suspect is the logged Georges Bank dominant period, with a P value of .004. All of the other P values indicated that the predictors are strong in telling me what the wave height will be at the Delaware Bay wave height.
An unexpected result is the different barometric reading coefficients. This could be explained by the full year readings where high pressure systems dominate the Georges Bank area for a great deal of the year due to the convergence of the jet stream air flow and the Gulf Stream water flow, this combination leads to little storm activity from May through October.
Descriptive Statistics: LOGD_Bay-WVHT
Variable N N* Mean SE Mean StDev Minimum Q1 Median
LOGD_Bay-WVHT 588 0 0.55414 0.00847 0.20542 0.09577 0.40808 0.52031
Q3 Maximum
0.70068 1.12344
The data in the logged wave height at the Delaware Bay buoy covers a range of 1.028. The middle 50% of the data covers a range of .293. Therefore, Predicting within a factor of ~ 2 is somewhat useful in predicting wave heights using the defined predictors. However, the ability to predict waves that are adequate to surf is not refined enough if I define a surfable wave as 3 feet or higher. Any predicted wave of 6 feet or higher, will indicate that 95% of the time I will have waves that I can surf. However, any predicted wave below 6 feet and larger than 1.5 I cannot be sure that I will be able to ride the waves.
Residual Plots: Is this the right model?
The residual plots for the model seem to be pretty good which indicates that I am using the right model. What the residuals do tell me is that there is evidence of many unusual observations that contribute the inability of the model to accurately predict wave heights in a narrow enough range to determine if the waves are fit for surfing.
Further analysis will be useful if I break out the predictors and perform multiple regressions for the Hatteras Buoy vs. the Delaware Buoy and the Georges Bank Buoy vs. the Delaware Buoy independently. The effects of the wide net I have cast may be affecting the results of the model. I have taken two offshore buoys and tried to determine their effect on one onshore wave heights over a whole year. In order to refine the model, it will be useful to focus my data and regressions. If I focus on the Georges Bank buoy and its effects on the onshore waves heights for a better define period, say January through April, and focus on the Hatteras Buoy’s effects on the onshore wave heights from August through October, the model may reveal greater predictive abilities. I have demonstrated something that intuitively makes sense. It is necessary to take into account the areas and times where each off shore buoy is most affected and determine that relationship to the onshore resultant wave heights.
Separate Models for Each Predictor Buoy
To see if I can get more accurate models, I have broken out the regression for the two buoys, using the same predictors and tailoring the data to only reflect those time periods where the buoys are most affected by consistent storm activity. For the Hatteras buoy, I have focused the data on the months August through October (A-O) and for the Georges Bank buoy I have focused on the months January through April (J-AP.)
Regression Analysis: A-O:LOGD_Bay versus A-O:LOGHatte, A-O:LOGHatte, ...
The regression equation is
A-O:LOGD_Bay-WVHT = 42.7 + 0.444 A-O:LOGHatteras-WVHT
+ 0.190 A-O:LOGHatteras-DPD - 14.2 A-O:LOGHatteras-BARO
Predictor Coef SE Coef T P
Constant 42.69 18.17 2.35 0.020
A-O:LOGHatteras-WVHT 0.44358 0.07037 6.30 0.000
A-O:LOGHatteras-DPD 0.1898 0.1020 1.86 0.064
A-O:LOGHatteras-BARO -14.171 6.042 -2.35 0.020
S = 0.144582 R-Sq = 41.1% R-Sq(adj) = 40.1%
Analysis of Variance
Source DF SS MS F P
Regression 3 2.62202 0.87401 41.81 0.000