Disseminating statistical data by short quantified sentences of natural language

Miroslav Hudec ()[1]

Keywords: data dissemination, linguistic summary, linguistic quantifier, quality of summary

1.  Introduction

Data summarization by statistical methods is a convenient way, but understandable for rather small group of specialists [1]. Another option is summarization which is not as terse as summarization by numbers. For example, we can say: mean value is 2358.42 with standard deviation of 428.3265, or linguistically: most of entities are near mean value, few entities are near mean value and the like. The latter structure, known as linguistic summary (LS), provides valuable summarization for variety of statistical data users such as businesses and journalists.

Further, LSs can be used as query conditions in data retrieving tasks [2]. An example of such condition is SELECT regions WHERE most of municipalities has high unemployment rate and high ratio of arable land. Obviously, region may fully or partially meet the query condition, which allows us to rank regions downwards from the best one and moreover visualize result on a thematic map by marking regions with different hues according to the respective query matching degrees.

Generally, entities are expressed by attributes or dimensions. NSIs data users may be interested to see whether particular summary such as most of visits from remote countries have short stay holds [3]. Another option is mining all relevant summaries (summaries with high validity) regarding particular data set. This case can be solved as an operational research task [4]. In order to illustrate this approach Section 2 briefly explain LSs, Section 3 is dedicated to short examples and discussion, and Section 4 concludes this paper.

2.  Methodology of linguistic summarization form the data

Linguistic summaries have been developed to express relational, concise and easily understandable knowledge about the data. The concept of LSs has been introduced by Yager [5] and further developed in e.g. [6] and [7]. Since the best way for communication and mining information for people is the natural language, LSs are in the line with the concept computing with words introduced by Zadeh [8].

LSs for summarizing the whole data set is of the following structure: Q entities in database are (have) S, where Q is relative quantifier and S is summarizer, both expressed by linguistic terms. The validity is computed in the following way [5]:

/ (1)

where n is the number of tuples (records) in adata set (cardinality), is the proportion of tuples in adata set that satisfy summarizer S and µQ is the membership function of chosen relative quantifier (few, about half, most of, …).

LS focused on a restricted part of a data set has the form Q R entities in database are (have) S, where R is a restriction (expressed by linguistic term). The validity is computed in the following way [6]:

/ (2)

where is the proportion of tuples in a data set that satisfy S and belong to R, t is a t-norm (often minimum function is used) and µQ is the membership function of chosen relative quantifier.

For instance, summarizer or restriction high pollution (HP) can be expressed as R type fuzzy set (Fig 1a.):

Figure 1. Concept high pollution expressed as fuzzy set (a) and crisp set (b)

In Fig 1.a values 50 and 60 delimit uncertain area, i.e. area where belonging to set is matter of degree. If we apply classical set (Fig 1b.), then two similar values are differently treated: the value 55 mg of measured pollutant is not considered as high pollution, whereas value of 55.000003 is.

Quantifier most of is relaxation of the universal quantifier all. (Fig 2.):

where y stands for the proportion in (1) and (2). The domain of relative quantifier is unit interval.

Figure 2. Linguistic quantifier most of

3.  Illustrative examples and discussion

This section illustrates approach suggested in Section 2 and provides further discussion.

3.1.  Illustrative examples

Illustrative example 1 User wishes to get regions where most of municipalities has small attitude above sea level. Parameters expressing fuzzy set small altitude are mined form the data. The result is shown in Table 1. Table 1 shows that regions Bratislava, Trnava and Nitra fully meet the query condition, whereas region Bánska Bystrica is more hilly than flat. Two regions are not selected, because they do not meet query condition. The result corresponds with the map of Slovak Republic.

Table 1: Retrieved regions

Region / Validity of the summary
Bratislava / 1
Trnava / 1
Nitra / 1
Trenčín / 0.7719
Košice / 0.6314
Bánska Bystrica / 0.2116

In addition this way of dissemination is able to keep sensitive data undisclosed, because LSs are calculated regarding data on lower level (mikrodata) but result is aggregated to the respective higher levels [9].

Illustrative example 2 An agency wishes to know which summaries explain length of visits of tourists from the remote countries. The attribute length is divided into three overlapping granules: short, medium and long. The term set for relative quantifier consists of terms few, about half and most of. Hence, we should evaluate nine sentences. Construction of the sets short, medium and long for the attribute duration depends on particular categorization or user’s preferences, which are not further examined due to the limited space. Possible answer is shown in Table 2. We see that short and long stay dominate with few medium long stays.

3.2.  Discussion

LSs are able to capture vagueness or semantic uncertainty of analysed phenomena by fuzzy sets and visualize results in an understandable way. A linguistically summarized sentence can be read out by a text-to-speech synthesis system, which is a valuable option when the visual attention should not be disturbed as well as for disabled people.

Although LSs are applied at the final stages of statistical data production, they could improve data collection by tailored motivation of respondents [10]. We could offer sophisticated LSs to businesses, which highly cooperate in surveys, for example. Businesses are often interested in summarized information rather than long sheets of data. It especially holds for smaller businesses, which cannot afford data mining specialists. By this approach we can mitigate paradox explained by Ross [11]: “We find that a paradox is steadily developing in a rapidly changing world, in that statistical users are becoming ever more demanding for timely data, but are less willing to provide their own data to statistical institutes”. This paradox presumably appeared from the fact that respondents cooperate in many official surveys, but on the other hand they often are not able to easily find relevant information extracted from databases on NSI data portals.

Table 2 Summaries and their respective validities

LS / validity
few visits from remoted countries are of short stay / 0.1472
few visits from remoted countries are of medium stay / 0.8575
few visits from remoted countries are of long stay / 0
about half visits from remoted countries are of short stay / 0.8528
about half visits from remoted countries are of medium stay / 0.1425
about half visits from remoted countries are of long stay / 1
most of visits from remoted countries are of short stay / 0
most of visits from remoted countries are of medium stay / 0
most of visits from remoted countries are of long stay / 0

4.  Conclusion

Linguistic summaries play a pivotal role in summarizing information from the data when uncertainty related to the semantic meaning of the phenomena cannot be neglected. In the paper we have speculated possibilities for applying LSs in statistical data dissemination, because linguistically summarized information is understandable for large scale of statistical data users. Furthermore, a linguistically summarized sentence can be read out by a text-to-speech synthesis system, which brings benefit for disabled people or when visual attention of data user is focused on something else. In addition, when summarization is focused on territorial units, validities of summaries can be visualised on thematic maps by different hues of the selected colour. Finally, this novel way of data dissemination could motivate respondents to cooperate in surveys.

Future tasks should be focused on adjusting quality measures of LSs to particularities of statistical data, analysing dissemination needs, summarizing from SDMX data cubes and developing tool. These tasks can be solved in cooperation between NSIs data dissemination units and scientists working in this field.

References

[1] Yager, R.R., Ford, M., Canas, A.J.: An approach to the linguistic summarization of data. In: 3rd International Conference of Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU 1990), pp. 456-468, Paris (1990).

[2] Hudec, M.: Fuzziness in Information Systems. Springer Int.Publishing, Switzerland, 2016.

[3] Mišút, M., Hudec, M.: Linguistically summarizing mobile positioning data managed in the STAR scheme. In review.

[4] Liu, B.: Uncertain logic for modeling human language. J.Uncertain Syst. 5, 3–20 (2011).

[5] Yager, R.R., 1982. A new approach to the summarization of data. Inf. Sciences 28, 69-86.

[6] Rasmussen, D., Yager, R.R., 1997. Summary SQL - A Fuzzy Tool for Data Mining Intell. Data Analysis 1, 49-58.

[7] Kacprzyk, J., Zadrożny, S.: Protoforms of linguistic database summaries as a human consistent tool for using natural language in data mining. International Journal of Software Science and Computational Intelligence 1, 100–111 (2009).

[8] Zadeh, L.A., 2001. From computing with numbers to computing with words - from manipulation of measurements to manipulation of perceptions. In: Wang, P. (Ed), Computing with Words. New York: John Wiley & Sons, pp. 35 – 68.

[9] Hudec M. (2013) Fuzzy database queries in official statistics: Perspective of using linguistic terms in query conditions. Statistical Journal of the IAOS, 29(4): 315-323.

[10] Hudec M, Torres Van Grinsven V. (2013) Business’ participants motivation in official surveys by fuzzy logic. In: 1st Eurasian Multidisciplinary Forum, (EMF 2013), Tbilisi, 24 – 26 October, Vol. 3, pp. 42-52.

[11] Ross, M. P.: Official Statistics in Malta – implications of Membership of the European Statistical System for a small country/NSI. 95th DGINS Conference, 2009.

4

[1] Faculty of Economic Informatics, University of Economic in Bratislava, Slovakia