What is Big Data?

“Thanks to the proliferation of highly interactive websites, social networks, online financial transactions, and sensor-equipped devices, we are awash in data,” said Sam Madden, an associate professor in the Department of Electrical Engineering and Computer Science at MIT and leader of the “bigdata@CSAIL” initiative. “With the right tools, we can begin to make sense of the data and use it to solve any number of pressing societal problems ‑- but our existing tools are outdated and rooted in computer systems and technologies developed in the 1970s.”

The three Vs

Pick up any newspaper, business magazine or scientific journal and you’ll find a discussion of “Big Data.”  Organizations are using big data to create new products and generate insights into a wide range of phenomena.  Applications are wide spread, including fraud detection, customer sentiment analysis, ad personalization, stock trading, drug discovery, health care delivery, energy efficiency, and management of computer and telecommunication networks.

While the precise etymology is unclear, the phrase “Big Data” appears to have been coined in the mid-1990s by researchers at Silicon Graphics International (SGI) to describe the rapidly increasing amount of data that organizations were handling.[1]  Since then, the amount of data being collected, stored and processed has grown exponentially, driven, in part, by an explosion in web-based transactions, social media and sensor use.

“Big Data” is neither a single technology nor a single industry; it is a term that applies to data that cannot be processed or analyzed using traditional techniques in a timely or cost-effective manner.  Typically, Big Data is defined in terms of three characteristics of data streams:

1)  High Volume.  Roughly 800,000 petabytes of data were stored around the world in 2000; some observers expect that this figure will reach 35 zettabytes by 2020.[3] While the overall scale of data being collecting and stored is certainly impressive, the real issue is the amount of data handled by individual organizations.  A few statistics illustrate.  Facebook has more than one billion active users with 150 billion friend connections.[4]  Every bit of new content – news feeds, messages, events, photos and ads – is stored and tracked along with the massive amount of data contained in weblogs.  More than 500 terabytes of new data are loaded into the company’s databases every day with the largest Hadoop cluster capable of storing more than 100 petabytes.[5]  The need to store and process massive amount of data is not limited to commercial concerns.  For example, the Large Hadron Collider generates ~15 petabytes of data per year – equivalent to a CD stack roughly 20 km high.[6]  Similarly, the planned Large Synoptic Survey Telescope will produce ~20 terabytes of data per night, resulting in 60 petabytes of raw data and a catalog database of 15 petabytes over ten years of operations.  The total volume of data after processing will be on the order of several hundred petabytes.[7]

2)  High Variety. The increase is volume has been accompanied by an increase in the types of data that organizations store and process.  Up until recently, attention was focused on structured data, i.e., data that are neatly formatted based on a pre-defined formal schema (e.g., relational database).  However, most data do not fit this description.  A great deal of data is unstructured, including text, image, video, audio and sensor data. Semi-structured data, as the name implies, is a mix of structured and unstructured elements. This includes, for example, XML and other markup languages.

3)  High Velocity.  There are two aspects of the need for speed.  The first centers on the ability to handle data as they arrive.  While some data are generated periodically, others such as machine data are delivered in a constant stream.  Taking the Large Hadron Collider as an example again, the 150 million sensors in the facility deliver data 40 million times per second. The second aspect relates to how fast data need to be processed.  While processing historical data for business intelligence reporting or more in-depth analysismight need to be completed within minutes or hours, other tasks are more timesensitive. Certain types of transactions such as processing a trade or placinga targeted ad require the ability to process data in milliseconds.


[2] A petabyte is 10^15 or 1,000,000,000,000,000 bytes. A zettabyte is equal to 10^21 bytes.

[3] Zikopoulos, Paul C. et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data.  McGraw Hill. 2012.

[4] Annual Report, 2012.