Scope of security & security testing in BIG DATA
Abstract
Introduction and Scope:
All the industries in the present are handling huge data and this data is processed and also segregated as per to their needs. Once the required data is gathered we store them in databases or in current scenarios it is cloud. Here we need to think about one key aspect “security “. How would we deal hack or leak of the data? Here we have an option of saving or storing the data in databases with some sort of security form as encrypted or tokenized. Is the data stored is secure then how testing industry would ensure that it is secured?
Business benefits:
- Reduced chances of losing consumer/business
- Protection of end user’s data
- Avoiding penalty/fine from client side or IT law for breaching security norms
- Increase in quality of data handling in the market
- Rated high for Compliance of security in market
Research/ Study content:
- Research was carried out to find solutions for securing data which is sensitive. Few options are encryption & decryption, tokenization, masking etc.
- Testers who test such applications will require knowledge on databases and data classification in order to understand the importance of data which should be secure enough.
- While using big data it is more appreciable that the security is highly important as we deal with huge data like terabytes of data within an hour.
Overview of Big Data
What is Big Data?
Big data is collection of variety data in sets named data sets which are large enough making us a tedious &complex task for processing & usage of traditional databases. So to handle such huge data or industry term BIG DATA we need an architecture as it is basically characterized by 4Vs, Volume, Velocity, Variety and veracity of data. Data might be sensitive and also does not affect if this data is out without security. Below picture best describes the logic involved in why architecture is important.
IMG-1: Why architecture is needed for big data.
Data Storage & Management:
As we see in architecture after data acquisition we need to store it at some place to process and use it when needed. So as we all know traditional databases are structured and the data now a days which we use are unstructured and of different sources,so we use NoSQL for storing big data. NoSQL is different from traditional RDBMS, basic difference between them is RDBMS requires schema but for NoSQL schema doesn’t play a major role hence making it handling unstructured data easily.
Challenges in Big Data:
- Insecure Computation
There are many ways an insecure program can create big security challenges for a big data solution including:
- An insecure program can access sensitive data such as personal profile, age credit cards etc.
- An insecure program can corrupt the data leading to in current results.
- An insecure program can perform Denial of Service into your Big Data solution leading to financial loss.
- End-point input validation/filtering
Big data collects data from variety of sources. There are two fundamental challenges in data collection process:
- Input Validation: How can we trust data? What kind of data is untrusted? What are untrusted data sources?
- Data Filtering: Filter rogue or malicious data.
The amount of data collection in Big Data makes it difficult to validate and filter data on the fly.
The behaviour aspect of data poses additional challenges in input validation and filtering. Traditional Signature based data filtering may not solve the input validation and data filtering problem completely. For example a rogue or malicious data source can insert large legitimate but incorrect data to the system to influence prediction results.
- Granular access control
Existing solutions of Big Data are designed for performance and scalability, keeping almost no security in mind. Traditional relational databases have pretty comprehensive security features in terms of access control in terms users, tables and rows and even at cell level. However, many fundamental challenges prevent Big Data solutions to provide comprehensive access control:
- Security of Big Data is still an ongoing research.
- Non relational nature of data breaks traditional paradigm of table, row or cell level of access control. Current NoSQL databases dependents on 3rd party solutions or application middleware to provide access control.
- Ad-hoc Queries poses additional challenge wrt to access control. For example, imagine end user could have submitted legitimate SQL queries to Relational Databases.
- Access control is disabled by default.
- Insecure data storage and Communication
There are multiple challenges related to data storage and communication in Big Data:
- Data is stored at various Distributed Data Nodes. Authentication, authorization and Encryption of data is challenge at each node.
- Auto-tiering: Auto partitioning and moving of data can save sensitive data on a lower cost and less sensitive medium.
- Real Time analytics and Continuous computation requires low latency with respect to queries and hence encryption and decryption may provide additional overhead in terms of performance.
- Secure communication among nodes, middlewares and end users is another area of concern.
- Transactional logs of big data is another big data itself and should be protected same as data.
- Privacy Preserving Data Mining and Analytics
Monetization of Big data generally involves doing data mining and analytics. However, there are many security concerns pertaining to monetizing and sharing big data analytics in terms of invasion of privacy, invasive marketing, and unintentional disclosure of sensitive information, which must be addressed.
For example, AOL released anonymized search logs for academic purposes, but users were easily identified by their searchers. Netflix faced a similar problem when users of their anonymized data set were identified by correlating their Netflix movie scores with IMDB scores.
Security & Privacy:
In recent times we have seen many Security hacks& breaches happening, for example fewmonths ago SBI ATM cards were blocked as there was asecurity breach. In the same way Many of the bank Database’s/ Employers Database’s and other external Database’s consists of our privacy details like PAN number, AADHAR number, Bank account numbers, Debit/Credit card details etc.
IMG2: Data breach News.
So what could be the reason for the above incident happen? And if we notice there are lot similar incidents which we hear/see in daily routines.There are two possibilities which could have happened. One, might be the customer has given the details in reply to few scam emails/ text messages and the second might be the bank servers/database’s are compromised/hacked.
So as long as the DB’s are storing the data “as is”, security and privacy are at risk, we mean data is vulnerable. To stop this we must start preventive measures. Below are three ways which can secure our data.
- Remove the data if it is no longer needed.
- Replace the sensitive data if it can be replaced with some other data.
- Protect the sensitive data if it is required for processing.
Let’s take an example that a bank DB having sensitive and non-sensitive data. So below are few scenarios where how the above states 3 solutions can be implemented.
- If a customer has closed his account and no more he does transactions with the bank then his data can be removed from the database.
- Most of the banks have the sensitive data of a customer for uniqueness to identify the person swiftly and pull up records for any transaction. If the banks can identify a non-sensitive field which is unique for every customer then they can replace the sensitive data field with the newly identified field.
- If they cannot identify a new element then they can store the sensitive data with some sort of protection.
So for security few vendors provide Encryption/Decryption, Tokenization/ Detokenization, Masking / Unmasking and Authentication/ Authorization as solutions.
Below is an image of a vendor “Vormetric“displaying various solutions for security.
IMG-3: Solutions for security of data by Vormetric.
Scope of security testing:
If the data is being stored in databases or in cloud and for security if the above said techniques are followed then what is the scope for testing in this scenarios.Well from testing point
- We can validate the data by retrieving the data from databases. As we retrieve the data directly from the data bases we should see encrypted/masked/tokenized/ data as the result.
- Once the data is retrieved we could try to decrypt/unmask/DE tokenizing the data and seeing whether the retrieved data is matching to the plain text or not.
- We can try log in to the database and check the access rights using various credentials. Each role has few limited access or full access. So by checking the access we can ensure that the data is not mishandled or if mishandled we can easily track that person responsible for that particular data breach or leak.
References & Appendix
- Infosys lab briefings VOL11 NO 1 2013
- International journal of advanced research in computer science and software engineering Volume 5, Issue3, March 2015 ISSN:2277 128X
Author Biography
Bhanu Prakash Meher Regulagedda is working with Capgemini, Hyderabad as test analyst. He has an experience of 1 year inFunctional/ Regression and exploratory testing. He is part of security testing. Test automation is hispassion.ISTQB certified.
Rupa Reddy K is working with Capgemini, Hyderabadas test analyst. She has an experience of 1 year inFunctional/ Regression and exploratory testing. ISTQB certified.
THANK YOU!