Big and Bigger Data

Published in IEEE Spectrum Magazine, May 2017

Until recently, the word ‘data’ didn’t require a modifying adjective.  But we must have passed a watershed moment when it started to be referred to as big data.  Apparently that wasn’t quite a sufficient description, because people grasped for such bolder terms as humungous data.  Sadly, now it appears that we have run out of appropriate adjectives.   And yet data keeps getting bigger and bigger. So instead of mentioning data, people have begun waving their hands and talking vaguely about the cloud.  This seems to be the perfect metaphor; a mystical vapor hanging over earth, occasionally raining information on the parched recipients below.  It is both unknowable and all-knowing.  It has the answer to all questions, if only we know how to ask and interpret the answers.

This evolution brings two apoplectic images to my mind.  The first is of the current hypothesis in physics that all the information in a black hole resides in the surrounding event horizon.  This is like the idea of the cloud, while on the earth below the server farms  proliferate and remind me of Douglas Adam’s earth-sized computer in the classic novel The Hitchhiker’s Guide to the Galaxy (and the source of the infamous answer ‘42’).

With these imaginary end-states in mind, I wonder: where is all this headed?  Will data increase indefinitely or is there some point of diminishing returns?  Is there such a thing as enough data -or possibly too much data?
There is a popular saying going around that ‘data is the new oil.’  While I think this is an imperfect metaphor, it is true that both oil and data require refining to be useful.  I’m mindful of the information pyramid described in TS Eliot’s poem ‘The Rock:’

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

For the purposes of discussion, let’s say that data are the ones and zeros, information is the words and images, and knowledge is what we glean or learn from that information.  The critical refining is that between information and knowledge.  In refining oil the ratio of the useful product to the amount of crude is probably not a function of the amount of crude.  Not so with information -- the more crude we have to deal with, the less the percentage of useful knowledge.  What we want is the small knowledge that we obtain from the big information.  As the data gets bigger, the job gets harder.  The catch, however, is that unless the big information is big enough, it may not contain the small that we search.

Knowledge inevitably increases, and data probably has to increase even faster.  Fortunately, storage technology seems capable without turning the earth into a computer, but the crunch will be on the AI and algorithms that turn data into knowledge.  We’ve come a long way since Shannon in his classic paper on information theory in 1948 wrote: Frequently the messages have meaning ... these semantic aspects of communication are irrelevant to the engineering problem.

I’m also mindful of the propensity of drawers, closets, and hard drives eventually to become filled with useless junk.  I sometimes blame this on the second law of thermodynamics -- that entropy, i.e., disorder, always increases.  Perhaps this will ultimately be true of the cloud.  Old, useless information accumulates and is too much work to purge.  Moreover, who is to say what is useless and what is not?  Everything is in there, but everything is too much.  Entropy is maximized, full of sound and fury, signifying nothing.