Because we can now review data in its native state, we can see patterns
and relationships that are not limited by prior suppositions, biases, or
assumptions.
By David J. Walton
The amount of data we are creating as society has exploded over the
last decade. Consider this fact: Each day, we create more than 70 times
the amount of information in the Library of Congress. Or this one:
Approximately 2.5 billion Internet users generate 2.5 quintillion
(2,500,000,000,000,000,000) bytes of data every day. Why are we
producing so much data? Because we can.
Bandwidth, computer memory, and computer-processing capabilities have
improved exponentially over the last decade. By 2016, it is estimated
that the gigabyte equivalent of all movies ever made will cross global
IP networks every 3 minutes. The average smartphone now has more
computing power than NASA did when Neil Armstrong landed on the moon. At
the same time, each one of us is a walking content generator. Our use
of the Internet, social media, and mobile devices is creating a tsunami
of electronic data. And, as mobile devices get smaller, faster, and more
powerful, they will enable us to generate even more bytes of “likes”
every year.
While the mere existence of so much data is interesting on a
phenomenological level, it is not, in economic terms, worth much. The
key development of big data analytics is our growing ability to turn
this data into valuable information. In order to understand these
analytics, it is helpful to have a little background in the history of
data management and analysis.
When companies moved from paper records to electronic records, they
needed a system to store, manage, and analyze data — and thus, structured data
was born. Structured data is digital information that has been
organized into a common and intentionally designed structure or scheme.
Examples of this include stock-trading data, customer relations
management systems, customer-sales orders, supply-chain documentation,
and inventory data.
All digital data that could not be put in a form that was easily manipulated or analyzed became known as unstructured data.
Both humans and machines generate unstructured data. Many of today’s
software applications and electronic devices create machine data that
users do not even know about. This machine-generated data typically
contains information regarding applications’ or devices’ status and
activity. For example, smart meters automatically send data regarding
electronic usage by a household to a server located at the electric
company. Other examples of machine-generated data include search data,
network data, and health monitor or medical device data. This
machine-to-machine communication is becoming known as “the Internet of
Things.”
Compared to machine-generated data, human-generated data is
infinitely more difficult to manage and organize. It varies widely in
its structure, format, nomenclature, and style. It is also more context
dependent that any other data source. Often, it is necessary to
understand something about the data’s context in order to understand the
data itself. Examples of unstructured human-generated data include
emails, text messages, video files, and social media feeds.
As long as there has been digital data, businesses have been trying
to analyze it. In the 1960s and 1970s, IBM and Oracle developed relational database
software. A relational database is simply a table with rows and
columns that allows the user to categorize, compare, and analyze data.
Relational databases can also be linked together to form several layers
of related data. These databases are still the primary means for
businesses to analyze data. An Excel spreadsheet, for example, is a
basic relational database.
One of the key characteristics of a relational database is that it
does not analyze data in its native or original form. In other words,
relational databases require that data be entered into the database
after it has already been analyzed and processed by the user.
The problem is that processing data in order to log it in a database
can take a long time and be very expensive. If a company wants to
analyze its customer relations data using a relationship database, it
has to make an investment in IT infrastructure and staffing to build the
database, link it to other data systems within the company, and build
the interface. Because this process is so resource intensive, companies
got in the habit of only saving data that was immediately valuable, cost
effective to analyze, or had to be kept for risk-management purposes.
It simply wasn’t worth saving data where the potential cost of
preserving, collecting, and structuring the data outweighed its
immediate analytical value.
Another limitation of having to process data before it goes into a
relational databases is that decisions about the design of the database
have to be made before any data has been analyzed. In essence, you have
to ask the questions first and review the data later. Thus, the value of
a relational database depends on whether you’ve asked or anticipated
the right questions from the beginning.
What has changed over the last several years is our ability to
analyze unstructured data. Based on advances in computer technology and
computer science, we can now analyze massive amounts of data in real
time using analytics that are not limited to relational interfaces. In
other words, big data analytics lets us analyze data in its native state
without having to boot-strap it into the columns and rows of a
relational database. In a macro sense, this greatly enhances our ability
to analyze more kinds of data much faster. But even more important,
because we can review the data in its native state, we can see patterns
and relationships that are not limited by prior suppositions, biases, or
assumptions. Instead of defining what data is relevant before seeing
the data itself, as occurs with a relational database, we can now let
the data speak for itself.
No hay comentarios:
Publicar un comentario