How Do We Make 1TB 1PB Or 1EB of Data Smart

Smart Data and Wicked Problems

Paul L. Borrill

Founder, REPLICUS Software

Abstract

Doug Lenat says we are plagued by the problem of our data not being smart enough. In this presentation, we first explore why we want smarter data, and what it means. We look behind the scenes at the frustrations that knowledge warrior’s experience in their digital lives. Problems that are easy to fall victim to, such as information overload, the constant unfolding of additional tasks that get in the way of getting real work done (Shaving the Yak), and the seemingly endless toll on our time & attention in order to manage our digital lives. We illuminate these problems with insights gained from design considerations for a 100PB distributed repository and peel the onion on these problems to find that they go much, much deeper than we imagined: connecting to “wicked problems” in mathematics, physics, and philosophy: what is persistence? Why is time and space not real? Why is the notion of causality is so profoundly puzzling? And the impossibility of solving certain problems with a God’s Eye View. Finally, we propose a prime directive comprising three laws, and six principles for design, so that if our data becomes smart, that it does so in ways that truly serve us: simple, secure, resilient, accessible, and quietly obedient.

Paul Borrill. “Smart Data and Wicked Problems” – TTI Vanguard. V1.2 12-Feb-2008Page 1 of 1

1. Introduction.

“The ultimategoal of machine production

from which, it is true, we are as yet far removed

– is a system in which everything uninteresting is done by machines and human beings are reserved for the work involving varietyandinitiative”
~ Bertrand Russell

As our commercial operations, intellectual assets, professional and personal context progressively migrate to the digital realm, the need to simply, reliably and securely manage our data becomes paramount. However, managing data in enterprises, businesses, communities and even in our homes, has become intolerably complex. This complexity has the potential to become the single most pervasive destroyer of productivity in our post-industrialized society. Addressing the mechanisms driving this trend and developing systems & infrastructures that solve these issues, creates an unprecedented opportunity for scientists, engineers, investors and entrepreneurs to make a difference.

In enterprises, massive budgets are consumed by the people hired to grapple with this complexity, yet the battle is still being lost. In small & medium businesses, it is so difficult to hire the personnel with the necessary expertise to manage these chores, that many functions essential to the continuation of the business, such as disaster recovery, simply go unimplemented. Most consumers don’t even back up their data, and even for those who should know better, the answer to “why not” is that it’s just too cumbersome, difficult and error prone.

1.1 Why make data smart?

“What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it”

~ Herbert Simon

Human attention is, at least to our species, the ultimate scarce resource. Attention represents both a quantitative as well as a qualitative aspect of life, which is all we have. The moments we are robbed of by the voracious appetites of our systems to demand our tending, the less life we have available to live, for whatever form our particular pursuit of happiness may take: serving others, designing products, creating works of art, scientific discovery, intellectual achievement, saving the earth, building valuable enterprises, or simply making a living.

Why we want to make data smart is clear: so that our data can, as far as possible, allow us to find and freely use it without us having to constantly tend to its needs: our systems should quietly manage themselves and become our slaves, instead of us becoming slaves to them.

What this problem needs is a cure; not more fractured and fragmented products or an endless overlay of paliatives that mask the baggage of the storage industry’s failed architectural theories, which in turn rob human beings of their time and attention to manage the current mess of fragility and incompatibility called data storage systems.

1.2 Three Laws of Smart Data

“Men have become tools of their tools”

~Henry David Thoreau

Now that we recognize we are living inside an attention economy, we might ask what other resources we can bring to bear on this problem. It doesn’t take much to realize that there are rich technological resources at our disposal that are rather more abundant: CPU cycles, Memory, Network Bandwidth and Storage Capacity.

We propose the following laws for Smart Data:

Smart Data shall not consume the attention of a human being, or through inaction, allow a human being’s attention to be consumed, without that human being’s freely given concurrence that the cause is just and fair.

Smart Data shall obey and faithfully execute all requests of a human being, except where such requests would conflict with the first law.

Smart Data shall protect its own existence as long as such protection does not conflict with the first or second law.

1.3 Wicked problems

Rittel and Webber[1] suggest the following criteria for recognizing a wicked problem:

There is no definitive formulation of a wicked problem.
Wicked problems have no stopping rule.
Solutions to wicked problems are not true-or-false, but good-or-bad.
There is no immediate and no ultimate test of a solution to a wicked problem.
Every solution to a wicked problem is a "one-shot operation"; because there is no opportunity to learn by trial-and-error, every attempt counts significantly.
Wicked problems do not have an enumerable (or exhaustively describable) set of solutions.
Every wicked problem is essentially unique.
Every wicked problem can be considered to be a symptom of another problem.
Discrepancies in representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem's resolution.
The designer has no right to be wrong.

Note that a wicked problem[2] is not the same as an intractable problem.

Related concepts include:

Yak shaving: any seemingly pointless activity which is actually necessary to solve a problem which solves a problem which, several levels of recursion later, solves the real problem you're working on. MIT’s AI lab named the concept, which is popularized by Seth Godin and others.

Gordian Knots: Some problems only appear wicked until someone solves them. The Gordian Knot is a legend associated with Alexander the Great, used frequently as a metaphor for an intractable problem, solved by a bold action ("cutting the Gordian knot").

Wicked problems can be divergent or convergent, depending upon whether they get worse as we recursively explore the next level of problem to be solved, or it gets better.

1.4 Knowledge Warriors

If we apply our intelligence and creativity, we can conserve scarce resources by leveraging more abundant resources. Many of us devise personal strategies to counter this trend toward of incessant Yak shaving to keep our data systems clean and to conserve our productivity. This is the zone of the Knowledge Warrior.

We begin each section with the daily activities and reasonable expectations of knowledge warriors as they interact with their data, and go on to explore the connection to the deep issues related to design of smart data. While we hope to extract useful principles for designers of smart systems to follow, we cannot hope in such a small space to provide sufficient evidence or proofs for these assertions. Therefore, connections to key references in the literature are sprinkled throughout the document and those reading it on their computers are encouraged to explore the hyperlinks.

I make no apology for a sometimes-controversial tone, the breadth of different disciplines brought into play, the cognitive dislocations between each section, or the variability in depth and quality of the references. It is my belief that the necessary insights for making progress on this problem of data management complexity cannot be obtained by looking through the lens of a single discipline; and that the technology already exists for us to do radically better than do the systems currently available on the market today.

Section 5 contains this paper’s central contribution.

2. 100PB Distributed Repository

The arithmetic for a 100PB distributed repository is rather straightforward: 12 disks per vertical sled[3], 8 sleds per panel, 6 panels per 19” rack yields >500 disks per rack. In current 2008 capacities this yields >0.5PB per rack. So five datacenters containing 40 racks each are required for ~100PB raw capacity.

Alternatively, mobile data centers built from 20-foot shipping containers[4] (8 racks/container), yields ~5PB per container, or 10PB in 40-foot containers. Thus: 10 x 40-foot or 20 x 20-foot containers are required for a 2008 100PB deployment. It is not difficult to imagine a government scale project contemplating a 100 container deployment yielding 1EB, even in 2008. Half this many containers will be needed in 2012, and a quarter (25 x 40-foot containers = 1EB) in 2016, just 8 years from now.

Table 1 : Anticipated Disk Drive Capacities

RPM / 2008 / 2012 / 2016
Capacity / 7200 / 1TB / 4TB / 8TB
Performance / 10K / 400GB / 800GB / 1.6GB
High-perf. / 15K / 300GB / 600GB / 1.2TB

What may surprise us, is when we consider costs of the disk drives alone in this arithmetic exercise: Normalizing for a mix of performance and capacity 3-1/2” drives, and assuming an average of $200 /disk – constant over time – yields, for a 100PB deployment, approximately $26M in 2008, $13M in 2012, and $6.5M in 2016. Table 2 below projects costs for Government (100PB) Enterprise (10PB), Small & Medium Businesses (SMB) (100TB), and Personal/Home (10TB) from 2008 to 2016.

Table 2 : Anticipated Disk Drive Cost

Capacity / 2008 / 2012 / 2016
Lg Gov / 1EB / $260M / $130M / $65M
Sm Gov / 100PB / $26M / 13M / $6.5M
Lg Ent. / 10 PB / $2.6M / $1.3M / $650K
Sm Ent. / 1 PB / $260K / $130K / $65K
SMB / 100TB / $26K / $13K / $6.5K
Personal / 10 TB / $2K / $1K / $500

Given the history of data growth and the voracious appetite of governments, industry and consumers for data storage, it is reasonable to assume that scenarios such as the above are not just possible, but inevitable in the years to come.

But this is not an accurate picture for stored data.

2.1 True Costs

The quantitative picture above may be accurate in disk drive costs, but anyone with experience in the procurement and operational management of digital storage will recognize it as a fantasy.

While disk procurement costs are in the 20c/TB range, the costs of fully configured, fully protected and disaster recoverable data can be a staggering two or more orders of magnitude higher than this. For example one unnamed but very large Internet company considers their class 1 (RAID 10, fully protected) storage costs to be in the range $35-$45/GB per year. In such a scenario, if the disk drive manufacturers gave their disks away for free (disk costs = $0), it would make hardly a dent on the total cost of managing storage.

Some of this cost comes understandably from the packaging (racks), power supplies and computers associated with the disks to manage access the data: a simple example of which would be Network Attached Storage (NAS) controllers which range from one per disk, to one per 48 disks. Another factor is in the redundancy overhead of parity and mirroring. At the volume level, RAID represents a 30% to 60% overhead on the usable capacity. This is doubled for systems with a single remote replication site. Disk space for D2D backups of the primary data consumes 2-30 times the size of the RAID set (daily/weekly backups done by block rather than by file[5]), and with volume utilizations as low as 25% on average, we must multiply the size of the RAID set by a factor 4 to get to the true ratio of single-instance data to raw installed capacity.

All of this can be calculated and tradeoffs in reliability, performance and power dissipation can be an art form. However, even with a worst-case scenario, the cost of all the hardware (and software to manage it) still leaves us a factor of five or more away from the actual total cost of storage. If all hardware and software vendors were to give their products away for free, it might reduce the CIO’s data storage budget by about 20%; and as bad as this ratio is, it continues to get worse, year after year, with no end in sight[6].

In order to satisfy Wall street’s obsession with monotonically increasing quarterly returns, digital storage vendors are forced to ignore (and try to hide) the externalities their systems create: primarily, the cost of human capital, in the form of administrative effort to install and manage data storage. This is not even counting the wasted attention costs for knowledge warriors using those systems.

3. Identity & Individuality

"Those great principles of sufficient reason and of the identity of indiscernibles change the state of metaphysics. That science becomes real and demonstrative by means of these principles, whereas before it did generally consist in empty words."

~ Gottfried Leibniz

How smart data “appears” to us is affected by how easy it is to identify and manipulate it. But what is “it”? , and how do we get a handle on “it” without affecting “that” begins to reveal some wickedness.

3.1 Getting a handle on “it”

The first question is to consider how we identify what “it” is, where its boundaries are and what its handles might be. Knowledge warrior’s prefer to identify data by namespace and filename. Administrators prefer to identify data by volume(s), paired with the array(s) and backup sets they manage. Storage system architects prefer not to identify data at all but to aggregate block containers into fungible pools that can be sliced diced and allocated as fungible commodities.

Each optimizes the problem to make their life easier. Unfortunately, the knowledge warrior has the least power in this hierarchy, and ends up with the left over problems that the designers and administrators sweep under the rug.

When it comes to managing “changes” to the data (discussed in detail in the next section), the situation begins to degenerate: knowledge warriors prefer to conceptualize change as versioning triggered by file closes. This creates several problems:

Administrators have difficulty distinguishing changes to a volume by individual users, and have no “event” (other than a periodic schedule) to trigger backups, so whole volumes must be replicated at once.

As the ratio between the size of the volume and the size of the changed data grows: increasing quantities of static data are copied needlessly, until the whole thing becomes intolerably inefficient.

As data sets grow, so does the time to do the backup – this forces the administrators to favor more efficient streaming backups on a block basis to other disks, or worse still, tapes.

Users experience vulnerability windows, where changes are lost between the recovery time objective (RTO) and recovery point objective (RPO) imposed on them by the system administrators.

Palliatives are available for each of these problems: diff’s instead of full backups, more frequent replication points, continuous data protection (CDP), de-duplication etc. Each of these imposes complexity that translates into increased time and expertise required by already over-burdened administrators.

The traditional method of managing change is to identify a master copy of data, to which all others are derivatives. Complexity creeps in when we consider what must be done when we lose the master, or when multiple users wish to share the data from different places. Trying to solve this problem in a distributed (purely peer to peer) fashion has its share of wickedness. But trying to solve it by extending the concept of a master copy, while seductively easier in the beginning, leads rapidly to problems which are not merely wicked, but truly intractable. Problems such as entangled failure models, bottlenecks and single point failures which lead to overall brittleness of the system.

Storage designers find disks to be a familiar and comforting concept: It defines an entity that you can look at and hold in your hand. Identifying the internal structure of the disk, and the operations we can perform is as simple as a linear sequence of blocks, from 0 to N, with “block” sizes of 512 to 4KB. Where N gets larger for each generation of disk technology and the operations are defined by a simple set of rules called SCSI commands.

Disk drive designers, storage area network (SAN) engineers and computer interface programmers work hard to make this abstraction reliable. They almost succeed…

Give the disk abstraction to applications programmers and we soon see it’s warts: No matter how big disks get, they have a fixed size[7], limited performance and they fail, (often unpredictably). These “abrupt” constraints make systems brittle: disks fail miserably and often at the most inopportune time. We get “file system full” messages that stop us in our tracks, massive slowdowns as the number of users accessing a particular disk goes beyond some threshold, or our data becomes lost or corrupted by bitrot[8].

Fear not: Our trusty storage system designers go one level higher in their bottom up design process and invent “volumes” for us. In principle, volumes are resilient to individual disk failures (RAID), as fast as we would like (striping), and as large as we want (concatenation). We can even make them “growable” by extending their size while they are on-line (to take advantage of this requires a compatible file systems technology).