ENTER TITLE HERE (14 Pt Type Size, 10 Words Max, Uppercase, Bold, Centered) s3

Age-based Multilevel Regression Modelling of Melanoma
Incidence in the USA

Antony Brown Carsten Maple Malcolm Keech

Computing and Information Systems

University of Luton

Park Square, Luton, Bedfordshire, LU1 3JU

ENGLAND

Abstract

The changing incidence of cancer is an increasing problem, and it would be of great benefit to be able to predict future changes. Brown and Maple have previously proposed a method for modelling such data [2]. The model appeared to perform better than existing methods when applied to data available at that time. In this work we apply the techniques to newly available data that confirms the accuracy of the estimates that were predicted in [2]. We also present suggestions for improvements to the method.

Key-Words: Epidemiology, Melanoma, Modelling, Regression, Prediction.

1  Introduction

Prediction of future incidence of cancer rates allows better planning and resource allocation for prevention and treatment. This becomes increasingly important for cancers whose incidence is on the increase, as without proper prediction the burden could far outstrip the provisions made. In the past, a range of modelling techniques have been applied to various factors of this problem.

Accurate modelling and prediction is useful since any trends that are identified can be compared to underlying trends in other phenomena. This comparison can then be used to help confirm or deny causative factors for the disease. Likewise, known (or suspected) causative factors may be included in a model to improve the results of predictions.

The type of data required for a study depends on the model used and the intentions of the research. There is a balance to be struck between a depth and breadth of data. Generally, the more variables obtained for each individual, the fewer individuals used in the data. For example, a clinical trial would be able to obtain a large amount of information about its participants, including such things as familial history, eye colour and even genetic samples. However, the number of participants would also be relatively small since collection of data this detailed is very time consuming and costly. Conversely, a cancer registry would have record data for a large amount of people but, as the data would be drawn from a variety of sources such as hospitals and clinics, only information that is routinely collected would be available. Hence, it may be considered that the volume of data is of the same order for any reasonable study.

Models considering associations between genetic makeup and melanoma would obviously need access to the in-depth data that would come from a special clinical study and would therefore be limited in the breadth of data. A study looking at the links between age and melanoma would benefit from having access to a much larger pool of data but any predictions would be more generalised than a model with more predictive variables. This illustrates how the aim of a study determines the breadth and hence depth of the data required.

This article uses data from the National Cancer Institute's SEER program [1]. This data originally consisted of incidence rates for the U.S. 1973-1997 categorised by sex and split into 5-year age groups. Since the publication of [2], data from three further years, 1998-2000, have been publicly released and the models have been used to predict these new data points. The data comes from a large population, and so can be used to predict the general trends in melanoma incidence for the U.S. It is important to note that the number of melanoma cases for ages less than 20 are too low to produce reliable incidence rates, and so will be excluded from this study, as was the case in our previous study.

This article reintroduces the previously presented novel prediction method [2], evaluates its performance with some newly available data and presents a proposed extension of the method.

2  Epidemiology of Melanoma

Whilst the positive relationship between UV exposure and other forms of skin cancer have long been shown to exist, the relationship with melanoma is much more complicated, as can be seen in [1], [3] and [4].

For example, from [3] it can be seen that the body sites most commonly affected by melanoma are not those that receive the most exposure to sunlight. In addition, occupations that usually receive large amounts of sun exposure (such as unskilled labour, farming etc…) have a lower risk of melanoma than other occupations, contrary to what might first be thought.

One proposed explanation for the relationship between melanoma and sunlight is that it is not a cumulative effect, but depends on sporadic exposure to larger amounts of UV radiation than is usual. Positive links have been suggested between melanoma incidence and economic factors, as well as with managerial occupations [3]. This might possibly be explained by the link between increased salary and number of exotic foreign holidays, which would expose the individual to unaccustomedly large amounts of UV radiation for brief periods of time. However, this type of association is very hard to confirm as there are numerous other factors that could also play a role.

As with the vast majority of cancers, gender has also been shown to have a significant role in the epidemiology of melanoma, see [3]. The overall incidence rates for women are lower than for men, and the age specific distribution for the sexes are different as well, which will be shown later. There are also differences between the sites of the body most commonly affected in men and women.

Race also has an effect on melanoma incidence, with darker skinned races having much lower rates than lighter skinned races. The link between fair complexion and increased risk of melanoma has been shown, with features such as fair hair, light-coloured eyes and a tendency to freckle all showing an increased risk [4].

The presented model, however, only looks at the effects of age, sex and time on melanoma incidence, as we are interested in a general population model rather than one that concerns the risks of individuals. The data for these variables is more readily available on the larger scale needed to make a population specific prediction of use. The data used will only come from those races classified as ‘white’ as their incidence is much higher than other races, and it is important to keep as many possible causative variables constant.

3  Existing Methods

A variety of modelling techniques have been applied to cancer incidence/mortality. In general, non-linear models give better representations of the actual patterns present in the data, and so are useful etiological studies .

However, non-linear models can prove unreliable when outside of the range of data they were fitted to, they are better suited to interpolation as opposed to. This is due to the fact that they react best to local data rather than global and as such they are better suited to interpolation not extrapolation. Without some way to govern this effect, they are often unsuitable for predictive purposes, for which linear techniques prove more useful, since their behaviour is more stable. This section will review some of the techniques that have been most influential in the development of the proposed novel method.

3.1  Linear, Log-linear and Non-Linear models

Linear and Log linear models with the purpose of prediction of cancer were proposed by Dyba et al [5]. These models are non-linear in parameters, but linear in form. This combines the flexibility of a non-linear model with the stability of a linear one.

The models produce separate predictions for various age groups within the population, but the parameters for all age groups are determined at the same time. The advantage of this is that it allows individual age groups to have separate rates of increase, whilst still allowing each rate to be influenced by all of the data.

These models are specifically designed for prediction purposes, and as such they are not bound by any constraints that closely resemble the physical processes that actually take place. However, the predictions given have significant prediction intervals, which decreases their usefulness.

The ability to allow the future behaviour of each age group to be influenced by the overall trend in the rest of the data is a useful one, and efforts have been made to incorporate this into the novel method proposed.

3.2 Spline Regression

The spline regression technique fits a series of polynomial equations (usually quadratic or cubic) to the historical data of the disease [6]. The flexibility of the spline allows an extremely accurate fit to rapidly changing data. This means that the spline model will usually give very accurate results within the interval for which it was fitted, but is much less accurate for prediction of values outside that region.

Therefore, splines are well suited to studies trying to find/prove associations between cancer incidence and some causative factor, for example annual sunlight exposure. The spline model can provide a very useful guide to the relationship between exposure and disease, see [6].

It is possible to apply cubic splines to the data, but careful consideration has to be given to ensure that their behaviour outside the data they are based on is consistent with historic data.

3.3 Multilevel Modelling

A multilevel model simultaneously models a situation at several levels of detail, with the purpose of seeing these levels as a whole, rather than as independent pieces. Such models can allow trends on one scale of the model to affect those on another, given a more complete picture of the real situation.

In the case of melanoma, this technique has been applied to mortality by Langford, Bentham and McDonald [7]. The model was used to investigate the relationship between Ultra Violet B (UVB) exposure and melanoma mortality in Europe. Thus, geographical groupings were used as the different levels of the model. The data is grouped into countries, which in turn are split into geographical regions, each containing counties. Several models were produced by iteratively fitting a generalized least squares estimation to the mortality data from nine European countries.

Previously, a quadratic relationship between melanoma and latitude, in regards to Ultra Violet B (UVB) exposure, had been identified. However, the results showed that there was no clear relationship between melanoma mortality and UVB. Some countries showed a positive relationship, others showed no significant relationship and a few even displayed a negative relationship. This discrepancy is explained by the fact that whilst the whole data set may display a positive correlation with UVB, the multilevel model shows the underlying trends that contribute to this overall effect. Thus, it becomes apparent that the relationship with UVB varies with different populations, and so is more complicated than was previously supposed.

This model was applied to the evaluation of past trends and their relationship to various causative factors and not prediction. However, the ability to see trends in the data from a variety of perspectives, and to determine how one trend is affected, by another would be very useful in a prediction model. Therefore the proposed method includes a limited multilevel aspect and can be extended to include further levels of detail.

4  The Proposed Method

This method incorporates the idea of a multilevel approach, combined with a proportional view of the data. All of the data is expressed as a proportion of the larger data set, which produces two levels to the model. This novel applications of proportions to this type of data helps produce more consistent predictions

The raw data from the cancer registry is converted into age-adjusted (to the standard world population) incidence rates, where i is the age group and j is the gender, to allow comparisons between age groups. In order to help distinguish the incidence trend from random fluctuations, data smoothing techniques are applied. In the case of this method, a 3-year moving mean will be used. This is applied a limited number of times to ensure the loss of detail does not interfere with the regression process.

The model is created using two different sets of data, the first being the yearly sum of the incidence rates from all age groups, . The second is the age specific proportion of the sum of the incidence that the incidence makes up, . The product of these two values gives the age specific incidence rate. Therefore, the product of estimations to these values will be estimations to the age specific incidence rates, .

The proportion value, , contains information about the age specific incidence, but it is in relation to S. This helps reduce the effects of random variations in the data, by expressing them as a proportion of the total as well as in absolute terms. This allows regressions to be fitted to the proportion and sum, and the two combined, to produce a prediction for incidence.

The advantage of this is that the trend for each age group is kept intact, but it is present in the context of the other data groups. This allows more data points to influence each model, which in turn increases the accuracy of the predictions by allowing them to be guided by the overall trend as well as the age specific trends.

Exponential regressions are used for the total incidence, as it can be shown that these reflect the general trend. A variety of regression types are used for the proportion model, to help determine the ideal model type.

The models are fitted to a ten-year range of data, and forecast over the five years following that range. The use of just ten years worth of data for each model is intended to ensure the fit represents only the latest trend of the data, as their purpose is to predict future incidence.

The reason for approaching the problem in this manner is that the actual incidence rates themselves contain too many fluctuations, even after smoothing, to allow exact fitting with regression techniques. In addition, the variations in behaviour between age groups means that no single regression model would be able to fit all age groups well.

The motivation behind the desire to apply a single type of model to all of the age groups is that each set of data is not independent from the others. The people from one age group are exposed to a similar environment to all of the other groups, they also share some cultural effects with the other groups. This means that the different incidence rates are a reflection of each group's different biological and behavioural responses to their common environment and culture. Therefore, we make the assumption that the trend for each age group is based, in part, on this underlying function.