Profiling Users by Estimating Composite 
and Multi-valued Attributes from Big Data Sources for Social Statistics Purposes
Jacek Maślankowski ()[1]
Keywords:Big Data, Social Statistics, Text Mining, Web Mining, Social Mining
1.Introduction
In recent years the value of big data in statistics has increased according to the development of useful tools and methods to ensure the reliability of the results of analysis. Especially the comparison of the results of social media sentiments analysis and other indicators, such as consumer confidence, looks promising [1]. However, there are still some lacks of the methodology, especially in terms of representativeness of the data scraped from the web. Therefore it is necessary to provide information on population with detailed attributes that can be extracted from the data or at least estimated. Those attributes can be divided into composite (e.g., address, name, etc.) as well as multi-valued (e.g., phone numbers).The problem of the low quality of such attributes was known in past and several methods for increasing the quality have been proposed [2]. The goal is to incorporate such methods to big data methodology to ensure the high quality of the data. This is usually referred to the reliability of the data, in terms of completeness, accuracy, consistency and integrity [3].
The aim of the paper is to show the methodology of extracting big data sources for social statistics purposes. Due to the fact that in most of the web pages it is quite difficult to profile users, a methodology for estimating attributes has been proposed.The methodology used in this paper assumes that in some circumstances, the set of attributes used to describe a statistical unit can be estimated based on the content of the text being analysed. The paper presents both suggested methodology as well as results of the case study applied using this methodology. The case study of using web page data presented in the paper confirms the necessity of attributes estimating when making big data analysis.
The hypotheses used in this paper are as follows: H1: The reliability of the data can be increased by estimating values for specific entities; H2: The representativeness of the web data does not allow applying it directly for social statistics purposes. Therefore text mining and machine learning tools cannot be applied without knowing what type of entities will be used for further analysis. This leads to the conclusion that selecting proper entities and attributes from big data sources allows enhancing social statistics surveys.
2.Methods
The simple form of big data analysis may be related to the MapReduce paradigm, which is easy to implement programming model by research and industrial communities [4]. However this model does not provide most of the tools that will enable to extract the entity with all the attributesfrom the unstructured data source. Typical MapReduce algorithms, including WordCount and Regular Expressions, are making analysis of the text without separating the results into different attributes of the entities. Although it is possible to apply Text Mining and Machine Learning tools to increase the value of the results, there is a need to develop new methods according to the requirements of the particular survey.
Before starting analysis, the first step is to analyse the readiness of the data source [5]. Although several different methods have been proposed, there is still no framework that can be applied to any data source.
Therefore we propose the method for profiling users from social media and other data sources, in the case study web pages, that are based on text mining techniques. Sometimes the methods of providing analysis from the social media are known as social big data [6] or social mining. The paper also includes using the subclass of the text mining that is well known as a web mining. This especially allows making identification of demographic attributes related to the human generated information [7]. The users profiling in this paper relates to using this information for social statistics purposes.
The data was extracted using algorithms implemented in Python language on Apache Sparkas a tool. The population of the survey was a selected group of a social media users and people that make comments on selected web portals. Due to the accessibility of the API, it was decided to use a Twitter for a social media part of the case study. Web scraping was used as a second method for accessing the data from public news portals. For this purposes a machine learning algorithms has been prepared and tested. It can also be concluded that current machine learning algorithms do not fulfil all the expectations for big data analysis [8].
The paper concentrates on unstructured data analysis. Such type of data is approximately up to 95% of the sources used for big data processing [9]. This is the effect of the 
data-driven characteristics of methods applying big data tools [10]. Therefore the decision was to develop a framework for gathering composite and multi-valued attributes from the unstructured datasets. However to verify the results of analysis, structured and semi-structured data were used to confirm the first hypothesis of the paper.
3.Results
The main findings in the paper is a proposal of a set of combined methods used to extract users profiles from both social media as well as webpages. The results of the case study, that in fact was a survey conducted on social media and webpages, show that several useful and reliable attributes can be extracted to enhance social statistics surveys. It especially includes social confidence andintention to vote. On the other hand, a new phenomenon, such as media education can also be analysed. The results are presented using geographic and demographic attributes of the entities.
In the survey, the entity is a person who is active on social media as well as persons that are making comments to various events in the country. Although there is a noise in the data, and opinions in some circumstances, can be a bit confused for algorithms, valuable and reliable information can be extracted. However the general conclusion from the analysis is that machine learning algorithms have to be modified according to changing data patterns in the data source. Therefore the testing phase is repeated several times in regular time periods.
In the presented paper three different cases were used to make analysis and enhance the social statistics: intentions to vote (mostly covered in statistics from OECD [11]), media education – how people trust in media and social confidence. The framework that is presented in the paper can also be applied for other social statistics purposes, taking into account the specification of the data source.
4.Conclusions
The goal of the paper has been achieved and the results of the analysis using the framework were presented. The survey conducted using big data tools allows formulating conclusions on the appliances of big data sources for social statistics, especially in terms of the quality of the data as well as accessibility of the methods to increase the data quality. Firstly, the data sources are very noisy which is obvious and well known. There are still no methods that will cleanse them and provide a reliable information regarding the attributes of the entities being analysed.
The hypothesis H1 has been confirmed by comparing the results of analysis with the data from official statistics. Although there are differences between results from traditional surveys and from big data sources, the changes over time are correlated in both sources. It has to be noted that big data source has a larger population comparing to the population in traditional surveys. On the other hand, population presented in big data sources is limited to active social media users and people that leave comments on webpages. This means that we have to expect some differences in the results.It confirms the second hypothesis, which refers to representativeness of alternative data sources in social statistics, such as big data.
Therefore there is a need to build the framework to extract high quality multi-valued as well as composite attributes from the unstructured dataset. However this will not resolve all the issues related to provide a reliable information. In fact it is very risky to substitute traditional data sources with big data analysis for identifying the scale of intention to vote, social confidence and media education. Apart from that, the results presented in the paper are very promising and may have a big impact on future way of conducting social surveys.
References
[1] P.J.H. Daas, M.J.H. Puts, Social media sentiment and consumer confidence, Statistics Paper Series, No. 5, September, ECB, (2014).
[2] I. Kononenko, On Biases in Estimating Multi-Valued Attributes, IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2, Morgan Kaufmann Publishers, (1995), 1034-1040.
[3] L. Cai,Y. Zhu, The Challenges of Data Quality and Data Quality Assessment in the Big Data Era, Data Science Journal 14, May, (2015).
[4] S. Sakr, A. Liu, A. G. Fayoumi, The family of MapReduce and large-scale data processing systems, ACM Computing Surveys (CSUR): Volume 46 Issue 1, October, (2013).
[5]G. Bello-Orgaza, J.J. Jung, D. Camachoa, Social big data: Recent achievements and new challenges, Information Fusion, Volume 28, March (2016), 45–59.
[6] Y. Lu, X. Fang, J. Zhan, Data Readiness Level for Unstructured Data, BigDataScience '14 Proceedings of the 2014 International Conference on Big Data Science and Computing, No. 36, ACM New York (2014).
[7]B. Fortuna, D. Mladenic, M. Grobelnik, Application of semantic annotations to predicting users' demographics, ESAIR '10 Proceedings of the third workshop on Exploiting semantic annotations in information retrieval, ACM New York (2010).
[8]T. Condie, P. Mineiro, N. Polyzotis, M. Weimer, Machine learning for big data, SIGMOD '13 Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, (2013).
[9]A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and analytics, International Journal of Information Management, Volume 35, Issue 2, April (2015), 137–144.
[10]X. Wu, X. Zhu, G.-Q. Wu, Data mining with big data, IEEE Transactions on Knowledge and Data Engineering, Volume: 26, Issue: 1, January (2014), 97-107.
[11] Education at a Glance 2016, OECD, Paris (2016).
1
[1]Department of Business Informatics, Faculty of Management, University of Gdańsk, Poland;
Central Statistical Office, Statistical Office in Gdańsk, Poland
