I got my first explanations about Big Data from experts who were my colleagues for a time. These passionate IT guys, surely very knowledgeable about their trade, were not always good about passing somewhat complex concepts in a simple manner to non-specialists. Yet they did well enough to raise my interest to know a bit more.
I then did what I usually do: search and learn on my own. That’s how I bought “Big data: A Revolution That Will Transform How We Live, Work and Think” by Viktor Mayer-Schonberger & Kenneth Cukier.
Without turning myself into an expert, I got farther in the understanding of what is behind big data and got better appreciation of its potentials and the way it surely will “Transform How We Live, Work and Think”, as the book cover claims.
Coping with mass and mess
Big data as computing technique is able to cope not only with huge amount of data, but data from various sources, in various formats, able to show order in an incredible mess the traditional approaches could not even start to exploit.
Big data can link together comments on Facebook, twitter, blogs, websites and companies’ data bases about a product for example, even the data formats are highly different.
In contrast, when using a traditional database software, data need to be neat and complying to predetermined format. It also requires to be disciplined in the way to input data into a field, as the software would be unable to understand that a mistyped “honey moon” meant “honeymoon” and is to be considered, computed, counted.. as such.
Switch from causation to correlation
With big data, the obsession for the “why” (causation) will give way to the “what” (correlation) for both understanding something and making decisions.
Big data can be defined as being about what, not why
This is somewhat puzzling as we are long used to search for causation. It is especially weird when using predictive analytics, the system will tell a problem exists but not what caused it, why it happens.
But for decision-making, knowing what is often good enough, knowing why is not always mandatory.
Correlation was known and used before big data, but with big data and as the computing power it is no more constrained, limited to linear correlations, more complex non linear correlations can be surfaced, allowing a new point of view and even a bigger picture to look at.
I use to imagine it as a huge data cube i can handle at will to look from any perspective.
Latent, inexhaustible value
Correlation will free latent value of data, therefore, the more the better.
What does it mean?
Prior to big data, the limitations of data capture, storage and analysis tend to concentrate on data useful to answer the “why”. Now it is possible to ask huge mass of data many different questions and find patterns, giving answers to (almost any?) “what”.
The future usage of data is not known at the moment it is collected, but with low-cost of storage, it is not (anymore) a concern. Value can be generated over and over in the future, just going through the mass of data with a new question, another research… Data retain latent value until it will be used and used again, without depleting.
That is why big data is considered the new ore and it is not even exhausted when used, it is a kind of infinite usage. That’s why so many companies are eager to collect data, any data, many data.
Do not give up exactitude, but the devotion to it
For making decisions, “good enough” information is… good enough.
With massive data, inaccuracies increase, but have little influence on the big picture.
The metaphor of telescope vs. microscope is often used in the book; when exploring the cosmos, a big picture is good enough even so many stars will be depicted by only a few pixels.
When looking at the big pictures, we don’t need the accuracy of every detail.
What the authors try to make clear is not giving up exactitude, but the devotion to it. There are cases where exactitude is not required and “good enough” is simply good enough.
Big versus little
Statistics have been developed to understand what little available data and/or computing power could tell. Statistics are basically extrapolating the big picture from (very) few samples. “One aim of statistics is to confirm the richest findings using the smallest amount of data”.
The computing power and data techniques are nowadays so powerful that it is no more necessary to work on samples only, it can be done on the whole population (N=all).
I was really dragged into reading “Big data”, a well written book for non-IT specialists. Besides giving me insight of the changes and potentials of real big data, it really changed my approach with smaller data, the way I collect and analyse them, how I build my spreadsheets and how I present my findings.
My takeaways are biased as I consider big data for “industrial”, technical data and not personal ones. The book shares insights about risks of the usage already made of personal data and what could come next in terms of reduction of or threat to privacy.
If you like this post, share it!