Paradigm shift in Applied Mathematics

Agnar Höskuldsson, Centre for Advanced Data Analysis, Eremitageparken 301, 2800 Kgs Lyngby, Denmark[1]

Summary.

It is characteristic for data from ‘nature and industry’ that they have reduced rank for inference. It means that full rank solutions, which usually can be computed, normally do not give satisfactory solutions. Consider a set of linear equations, Xb=y, where b is the unknown solution and X is NK matrix. With invertible X the solution is b=X-1y. Suppose that the coefficients in X=(xij) are functions of t. Then b/t =-X-1(X/t)b. It shows that the solution is very sensitive to the coefficient values, when X is close to singular. The paradigm shift in applied mathematics is to build up the mathematical model in steps by using weighing schemes. Each weighing scheme produces a rank one addition containing a score and a loading vector that that are expected to perform a certain task. The weighing scheme reflects the emphasis of this specific mathematical model. Optimisation procedures are used to obtain ‘the best’ solution at each step. At each step the optimisation is concerned with finding a balance between the estimation task and the prediction task of the model. The mathematical modelling stops, when the prediction aspect of the model can not be improved. This approach has been applied to wide range of fields within applied sciences. In each case they provide with superior solutions compared to the traditional ones, because they find the rank (possibly the full rank) that should be used. The types of mathematical models, where this approach has been applied, include linear models, generalised linear models, non-linear models, dynamic models, and different types of extensions.

1 Data in Research and Industry

In research and industry there are certain basic features and objectives that are prevailing, when data analysis is needed. Some basic points are commented upon.

Many variables. It is common that there are many variables that enter the analysis. An example is a NIR (Near Infra-Red) instrument. It may give automatically 1056 data values, when a measurement of an object is carried out. This results in 1056 variables that enter the model. Other optical measurement instrument (Raman, and others) may give 3200 measurement values. Digitization of images may similarly result in thousands or tenths of thousands of variables.

Few samples (objects). It is common that there are not measured many objects. There can be many different reasons for this. It may be time consuming or expensive to carry out the measurement of an object. It may also be the policy of the company not to measure too many objects. E.g., a company producing measurement equipments wants data to reflect the situation at the customer. And the customer may typically only work with small amount of objects at each time.

Important to find appropriate variables. Even though the instrument gives many variables, it may not be appropriate to use all variables in a given modelling task. E.g., when using NIR data in analysing chemical composition of materials (oil, milk, etc), it is often best to use only around 10% or so of the variables. 90% of the variables do not contain ‘predictive’ information. It means that if they were included in the model, worse predictions are obtained compared to the model, where they are not included.

Reduced rank for inference from data. Typically the useful rank of data is much smaller than the number of variables. A company that has worked intensively over more than 10 years with IR/NIR data has the following experience: Out of K=1056 variables it is adequate to work only with between 60 and 140 variables and only to use a rank of the model between 4 to 8. This is also the typical situation, when the new approach is applied to the case, where the X-data (design or instrumental data) are NIR data and Y-data (the response or output data) are chemical concentrations. It is important to note that there usually is not a numerical problem to obtain the full inverse or generalized inverse.

Prediction is the objective of primary interest. In research and industry the primary interest is concerned how good predictions are obtained from the estimated model. Statisticians often emphasize the importance of interpretation of the solution values. In many cases, perhaps in most cases, when there are many variables, the interpretation of parameter values, like is standard in program packages in statistics, is not reliable. One example is the following. Typically, parameter estimates together with their standard deviation are given. From this list of pairs, people make interpretation of which variables are significant or which are not. People tend to forget that these interpretations are marginal ones. Some variables may be significant,but if the effects of some other onesare removed from the model, they may turn out not to be significant. There are also examples of stepwise search in data that are very popular among people, where results are very data dependent, and may not be reliable, when new data becomes available. This is often easily demonstrated by dividing the samples into two parts of equal size. Stepwise search in one part may give very different results compared to the other part.

Difficult to specify a detailed model. It is usually difficult for the experimenter to specify a detailed model for a given data. It is a part of the tradition of natural sciences that one should formulate a model that reflects the physics or chemistry or mechanics etc of the situation. But typically the knowledge available is only vague. Knowledge of the situation is obtained by studying the data and the variation that they show. From this information new models are built up with the purpose of getting good predictions.

Important to test the solution found. In practice it is important to study the sensitivity of the solution. This is often done by cross-validation, where a part of data is excluded and the excluded part is evaluated by the model found by the data used. Consider an example, where data are geometrically situated like ‘a comet with long tail’. The model may look good, but if the data that correspond to the ‘tail’ is removedand only the ‘head’ is modelled, the new solution found may be bad.

Graphic illustrations of variation in data. It is important to study graphically the inherent variation in the data. There is often unexpected variation that only appears, when data are studied closer graphically. An example is the case, when there are grouping in data. It might be that the main variation is the variation along groups. If the data are not analysed graphically, one might not detect the variation within the groups, which might be the important one.

Presentation of results in simple terms. Scientists and industrial people want simple and clear presentation of the results of the modelling task. The main emphasis should be on the prediction tasks derived from the model. This is opposite to the case of program packages in statistics. In many cases one must be expert in the specific methods in order to make correct interpretation of the results presented.

1

[1]Email: