METHOD FOR APPROXIMATION
OF DIVERSE INDIVIDUAL SORTING RULES
Alexey B.Petrovsky
Institute for Systems Analysis, Russian Academy of Sciences,
Prospect 60 Let Octyabrya, 9, Moscow 117312, Russia,
Introduction
The need to classify objects by their properties arises often in multiple criteria decision making, patterns recognition, cluster analysis, and so on. The difficulties of object classification increase when the same objects can exist in several copies, which are described with the similar or various values of attributes, and the different attributes can be repeated within the object description. These kinds of problems are, for example, the competitive selection of projects by several experts with many qualitative criteria, the recognition of graphic symbols, the problem-oriented sorting textual documents. In all these cases an object A (a project, a symbol, a document) can be presented as the following set of attributes gj:
A = {nA(g1)*g1, nA(g2)*g2,...,nA(gj)*gj,...}.
Here nA(gj)*gj denotes that the attribute gj occurrs nA(gj) times within the description of the object A. In the examples above nA(gj) is equal to the number of experts who evaluated the project A with the criteria estimate gj, or nA(gj) is equal to the valuation of recognized symbol A by a comparison with the standard symbol gj, or nA(gj) is equal to the number of the word gj within the text of document A.
The multiset or set with repeating elements is the very convenient mathematical model to present and analyze a collection of objects that are described with many qualitative attributes and can exist in several copies with various values of attributes. This paper considers the problems how to generate classes of such objects and define boundaries between classes. The technique for the object classification and construction of general decision rule is used searching the best decomposition of multisets in the metric spaces.
Multiset Model
The multiset or set with repeating elements (also called the bag [1,3,4,5]) is the very convenient mathematical model to present and analyze a collection of objects that are described with many qualitative attributes and can exist in several copies with various values of attributes. Unlike the set, each element may occur in the multiset more than once. Let G={g1,g2,...,gj,...} be a crisp set where all elements gj are different. A is called a multiset drawn from G if A can be presented by the set of pairs as A={(nA(g)*g)}, where nA(g) is called a counting function. This function defines the number of occurrences of the element g in the multiset A, and nA: G®N+. The element g is said to be a member of the multiset A (gÎA), and there are k copies of g in A, if nA(g)=k>0. If nA(g)=0, then gÏA. When nA(g) is equal to cA(g)={0,1}, the multiset A becomes an ordinary set. The set G={gj} is said to be a generic domain for the collection of multisets X, if all multisets from X are composed from the elements of G. The multiset is called: the empty multiset Æ, if nÆ(g)=0 for "gÎG; the maximal multiset U, if nU(g) = sup nA(g) for all multisets. The cardinality of multiset A is |A|=; and the dimensionality of multiset A is [A]=.
Define the following operations under multisets:
a union of multisets AB = {g | nAÈB(g) = max (nA(g), nB(g)) };
an intersection of multisets AB = {g | nAÇB(g) = min (nA(g), nB(g)) };
a sum or addition of multisets A+B = {g | nA+B(g) = nA(g) + nB(g) };
a difference of multisets A-B = {g | nA-B(g) = nA(g) - nAÇB(g) };
a symmetrical difference of multisets AΔB = {g | nAΔB(g) = |nA(g) - nB(g)| };
a multiplication of multiset A on scalar k k×A = {g | nk×A(g) = knA(g), k>0 };
a multiplication of multisets A×B = {g | nA×B(g) = nA(g)×nB(g) };
the complement of multiset A = {g | = nU(g) - nA(g) }.
The metric spaces of multisets were introduced in [3]. Different metric spaces (X,d) can be determined for the same collection of objects by introducing various types of distances d(A,B). The following metrics can exist in multiset spaces:
d0(A,B) = m(AΔB); d1(A,B) = m(AΔB)/m(U); d2(A,B) = m(AΔB)/m(AB).
Here m(A) is the measure of multiset A that can be determined in various ways, for instance, as a linear combination m(A)=åwjnA(gj), wj>0. Functions d1(A,B) and d2(A,B) satisfy the normalization condition 0£d(A,B)£1. Note, that due to the continuity of the multiset measure, the distance d2(A,B) is undefined for A=B=Æ. So d2(Æ,Æ)=0 by the definition.
Problem of Object Classification
The classification deals with combining the initial collection of objects into several groups or sorting them out the predefined categories. Information about the object properties can be presented with the set of attributes which values are numerical and/or verbal. From the formal logic point of view, a procedure of object classification can be written as a sequence of the following decision rules:
IF áconditionsñ, THEN ádecisionñ. (1)
There are direct and indirect classifications. The direct classification is an enumeration of the objects within the class. So in this case the term áconditionsñ includes the names of objects or the list of attribute values that describe the objects. The indirect classification is based on the properties common for the class. And the term áconditionsñ expresses the relations between different attributes and/or their values. The term ádecisionñ marks that the object(s) belongs to the specific class.
When the objects are sorted by many experts, there is a family of decision rules which may be similar, diverse and contradictory. Individual sorting rules are coincident or similar when the objects with the identical or resemble values of attributes are included in the same class. Contradictory rules assign the weak-discernible objects into diverse classes. The inconsistencies of individual rules may be caused, for instance, by errors in the expert classification of objects, the incoherence between expert' estimates of objects and decision classes, and other reasons. Note that knowledge bases of expert systems are built in the same manner.
If the number of objects and attributes is rather small, then decision rules are reviewed and utilized relatively easily. The more the family of decision rules is, the more difficult to analyze these rules. In this case the problem arises, how to generate the simple approximating rule(s) which would be the most coincident with the individual sorting rules. The final decision rule would include a minimal number of attributes and assign the objects into the given classes with the admitted accuracy. A construction of the generalized decision rule allows also to discover divergences in the initial sorting rules and correct them if it would be necessary.
The problem of classification of multiattribute qualitative objects has some additional peculiarities. First, the amount, complexity and peculiarities of information to specify qualitative objects are essentially lager and more varied then for the quantitative objects. Second, a multiplicity and redundancy of factors, that express a substance of the problem considered, are possible. Third, the multiattribute space and indexes of similarity/difference between objects are to be chosen corresponding to the qualitative nature of object properties. And finally, in order to classify qualitative objects, a lot of verbal and numerical data are to be taken into consideration simultaneously and processed without unfounded transformations (like “averaging”, “mixing”, “weighing” data, and so on). So the special procedures to collect and process these kinds of data are needed [2-4].
Presentation of multiattribute objects
Let X={A1,...,Ak } be a collection of k objects, that are described with many qualitative attributes, and sorted previously into some classes X1,...,Xf. Suppose that A1,...,Ak are projects which are considered to be included in the program or get a grant. These projects are evaluated by n experts with m qualitative criteria Q1,Q2,...,Qm. For instance, the questionnaire includes the following criteria for the project estimation: a project contribution to the program goals, a novelty of the approach to solve the task, a character of project results, a qualification level of the team, and so on. Each criterion has a nominative or ordered scale of verbal estimates. The scale of the criterion Qs “Qualification level of the team” can look like this:
qs1 – the team is one of the best by the experience and qualification level;
qs2 – the team has the experience and qualification level sufficient for the project realization;
qs3 – the team has the experience and qualification level insufficient for the project realization;
qs4 – an experience and qualification level of the team is unknown.
Several experts evaluate each project with all criteria Q1,...,Qm and make one of the following recommendations:
r1 – to approve the project;
r2 – to reject the project;
r3 – to consider the project later after its correction.
Obviously, estimates of projects and expert recommendations may coincide or not.
So the object description consists of several groups of attributes G={Q1,...,Qm,R}. Each group Qs={qses} (s=1,m; es=1,hs) is the attribute family which expresses the object properties. The group of attributes R={rt} characterizes the object assignment into the class Xt (t=1,f). For example, Qs is the criterion estimates of projects like above, R is expert recommendations concerning the project approval. The object Ai (i=1,k) can be presented as the following multiset AiÎX drawn from the domain G={gj}:
Ai = {(ni(gj)*gj)} = {(ni(qses)*qses), (ni(rt)*rt)}. (2)
Here ni(gj) is a number of attribute gj (j=1,h, h=h1+...+hm+f). For example, ni(gj) is equal to a number of experts who has estimated the object Ai with the attribute gj. The arguments in the formula (2) are associated with the decision rule (1) as follows: the various combinations of attribute values ni(qses) correspond to the term áconditionsñ, the object Ai belonging to the class Xt (AiÎXt iff ni(rt)¹0 and ni(rp)=0, p¹t) reflects the term ádecisionñ.
In order to simplify the problem, assume that the collection of objects X=(A1,...,Ak} is to be sorted into only two classes Xa and Xb. In these case the objects' collection X can be presented as the following decompositions of multisets:
, (3)
where Istes=IsesIt; Irt=IrIt; It is the subset of indexes i for AiÎX with ni(rt)¹0 and ni(rp)=0, p¹t; Ises is the subset of indexes i for AiÎX with ni(qses)¹0, ni(qvev)=0, v¹s, ni(rt)=0; Ir is the subset of indexes i for AiÎX with ni(rt)¹0, ni(qses)=0.
The demand to sort objects into two classes is not the principle restriction. Everytime when objects are to be classified into more than two classes, it is possible to divide the collection X into two groups, then into subgroups, and so on. For instance, the competitive projects can be classified into the projects approved and not approved, then the not approved projects can be divided into the projects rejected and considered later, and so on.
Approximation of individual rules
The main idea how to approximate a large family of sorting rules with a compact decision algorithm or simple decision rule can be formulate as follows. In the metric space of multisets (X,d) the pairs of new multisets are generated for every group of attributes Q1,...,Qm,R. The multisets within each pair are to be on the maximal distance d with each other, and the most coincident with the initial expert sorting the objects into the classes Xa and Xb. Combinations of the attributes, that define the pairs of the multisets generated, produce the generalized decision rule. Obviously, the decomposition R={Ra, Rb} is the best partition of the object collection X={Ai} into the classes Xa and Xb. The distance between multisets Ra, Rb in the metric space (X,d) is maximal and equal to
d(Ra,Rb) = max d(Ra,Rb) = d*. (4)
In the case of the ideal classification, the maximal distances are d0*=knh, and d1*=d2*=1.
The problem to approximate rules for sorting a collection of multiattribute objects is transformed into the problem how to find the best binary decompositions Qs={Qsa,Qsb) where the multisets Qsa,Qsb are the most far from each other in the metric space (X,d). In other words, the following m optimization problems should be solved:
d(Qsa,Qsb) ® max d(Qsa,Qsb) = d(Qsa*,Qsb*). (5)
Here {Qsa*,Qsb*} is the best binary decomposition of the multiset Qs. The solution of each problem (5) is submultisets Qst*1, Qst*2 (t=a,b), which are determined by the so-called approximating attribute qs*. Combinations of the approximating attributes {qs*} for various numbers s of attribute groups define the conditions to assign the object AiÎX into a certain class Xt.
The approximating attributes qs* can be ordered according to values of distances d(Qsa*,Qsb*). Then attributes qs* which occupy the first places in this ranking are to be included in the generalized decision rule. The more the distances d(Qsa*,Qsb*) are near to the maximal distance d*, the more the approximation of individual sorting rules will be accurate. The accuracy of sorting rules approximation can be estimated by the expression
δ = 1 - [d(Qsa*,Qsb*)/d*], (6)
where the distance d* is determined by (4). The approximating attributes {qs*}, which provide the demanded accuracy of approximation δ, are included in the generalized decision rule (1) for the objects classification.
The relations between the collection of objects X={Ai} and the set of their attributes G={gj} can be expressed with the matrix C=||ni(gj)||. The matrix C is used often in data analysis, pattern recognition and called the "object-attribute" table, information table or decision table. In our case information on properties of the multiattribute objects Ai and belonging Ai to a certain decision class can be presented as the decision table C, that has a dimension k´h, and consists of 2(m+1) boxes which correspond to multisets Qsa , Qsb and Ra , Rb (k is a number of objects, h is a number of object attributes, m+1 is a number of attributes groups). The reduced decision table C’=||ni’(gj)|| has a dimension 2´h, and consists of two rows na’(gj) and nb’(gj) which correspond to the classes Xa and Xb.