BIG DATA

https://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?_r=1&adxnnl=1&pagewanted=all&adxnnlx=1338379219-eNhY7MdpE5y3Or1QDNd3aQ

In a NY Times article in the Sunday review last February the following appeared:

“GOOD with numbers? Fascinated by data? The sound you hear is opportunity knocking.

Mo Zhou was snapped up by I.B.M. last summer, as a freshly minted Yale M.B.A., to join the technology company’s fast-growing ranks of data consultants. They help businesses make sense of an explosion of data — Web traffic and social network comments, as well as software and sensors that monitor shipments, suppliers and customers — to guide decisions, trim costs and lift sales. “I’ve always had a love of numbers,” says Ms. Zhou, whose job as a data analyst suits her skills.

To exploit the data flood, America will need many more like her. A report last year by the McKinsey Global Institute, the research arm of the consulting firm, projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, whether retrained or hired.

The impact of data abundance extends well beyond business. Justin Grimmer, for example, is one of the new breed of political scientists. A 28-year-old assistant professor at Stanford, he combined math with political science in his undergraduate and graduate studies, seeing “an opportunity because the discipline is becoming increasingly data-intensive.” His research involves the computer-automated analysis of blog postings, Congressional speeches and press releases, and news articles, looking for insights into how political ideas spread.

The story is similar in fields as varied as science and sports, advertising and public health — a drift toward data-driven discovery and decision-making. “It’s a revolution,” says Gary King, director of Harvard’s Institute for Quantitative Social Science. “We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.”

Welcome to the Age of Big Data. The new megarich of Silicon Valley, first at Google and now Facebook, are masters at harnessing the data of the Web — online searches, posts and messages — with Internet advertising. At the World Economic Forum last month in Davos, Switzerland, Big Data was a marquee topic. A report by the forum, “Big Data, Big Impact,” declared data a new class of economic asset, like currency or gold.”

Rick Smolan, creator of the “Day in the Life” photography series, is planning a project later this year, “The Human Face of Big Data,” documenting the collection and uses of data. Mr. Smolan is an enthusiast, saying that Big Data has the potential to be “humanity’s dashboard,” an intelligent tool that can help combat poverty, crime and pollution. Privacy advocates take a dim view, warning that Big Data is Big Brother, in corporate clothing.

What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. There is a lot more data, all the time, growing at 50 percent a year, or more than doubling every two years, estimates IDC, a technology research firm. It’s not just more streams of data, but entirely new ones. For example, there are now countless digital sensors worldwide in industrial equipment, automobiles, electrical meters and shipping crates. They can measure and communicate location, movement, vibration, temperature, humidity, even chemical changes in the air. “

Definition

"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

(A petabyte (derived from the SI prefix peta- ) is a unit of information equal to one quadrillion (short scale) bytes, or 1024 terabytes. The unit symbol for the petabyte is PB. The prefix peta (P) indicates the fifth power to 1000:

·  1 PB = 1000000000000000B = 10005 B = 1015 B = 1 million gigabytes = 1 thousand terabytes)

For the enterprise CTO, speaking of "big data" implies the need for a strategy for dealing with large quantities of data.

MIKE2.0, an open approach to Information Management, defines big data

in terms of useful permutations, complexity, and difficulty to delete individual records.

In a 2001 research report[16] and related conference presentations, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner continues to use this model for describing big data.[17]

Examples

Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

Technologies

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report[18] suggests suitable technologies include A/B testing, association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, predictive modelling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis and visualisation. Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.[citation needed]

Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, backup, and optimize the use of the large data tables in the RDBMS.[19][20]

The practitioners of big data analytics processes are generally hostile to slower shared storage[citation needed], preferring direct-attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.

There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favour it.[21]

Impact

When the Sloan Digital Sky Survey (SDSS) began collecting data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.[22] In total, the four main detectors at the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010 (13,000 terabytes).[23]

More Big Data impacts:

·  Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - the equivalent of 167 times the information contained in all the books in the US Library of Congress.

·  Facebook handles 40 billion photos from its user base.

·  FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide. [24]

·  The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates. [25]

·  Decoding the human genome originally took 10 years to process; now it can be achieved in one week.[22]


The impact of “big data” has increased the demand of information management specialists in that Oracle, IBM, Microsoft, and SAP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.[22]

Big data has emerged because we are living in a society which makes increasing use of data intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Basically, there are more people interacting with data or information than ever before.[22] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[12] and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[22]

Critique

Danah Boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data.[26] This approach may lead to results biased in one way or another. Integration across heterogeneous data resources - some that might be considered “big data” and others not - presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.[27] Broader critiques have also been leveled at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data will always need to be contextualized in their social, economic and political contexts.[28] Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, “big data,” no matter how comprehensive or well analyzed, needs to be complemented by “big judgment.”[29]

According to SAS, one of the big players in the – Big Data – challenge:

http://www.sas.com/big-data/index.html?gclid=CNPvvojtp7ACFQlN4AodxBuCXA

·  Volume. Many factors contribute to the increase in data volume – transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. In the past, excessive data volume created a storage issue. But with today's decreasing storage costs, other issues emerge, including how to determine relevance amidst the large volumes of data and how to create value from data that is relevant.

·  Variety. Data today comes in all types of formats – from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions. By some estimates, 80 percent of an organization's data is not numeric! But it still must be included in analyses and decision making.

·  Velocity. According to Gartner, velocity "means both how fast data is being produced and how fast the data must be processed to meet demand." RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time. Reacting quickly enough to deal with velocity is a challenge to most organizations.

Big data according to SAS

At SAS, we consider two other dimensions when thinking about big data:

·  Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something big trending in the social media? Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas is suddenly the must-do vacation activity. Daily, seasonal and event-triggered peak data loads can be challenging to manage – especially with social media involved.

·  Complexity. When you deal with huge volumes of data, it comes from multiple sources. It is quite an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.

Ultimately, regardless of the factors involved, we believe that the term big data is relative; it applies (per Gartner’s assessment) whenever an organization’s ability to handle, store and analyze data exceeds its current capacity.

Uses for big data

So the real issue is not that you are acquiring large amounts of data (because we are clearly already in the era of big data). It's what you do with your big data that matters. The hopeful vision for big data is that organizations will be able to harness relevant data and use it to make the best decisions.

Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:

·  Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.

·  Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.

·  Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.

·  Quickly identify customers who matter the most.

·  Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate.