Packt+ | Advance your knowledge in tech

You're reading from Practical Big Data Analytics

Product type Book

Published in Jan 2018

Publisher Packt

ISBN-13 9781783554393

Pages 412 pages

Edition 1st Edition

Languages

Java

Concepts

Big Data

Author (1):

Nataraj Dasgupta

Table of Contents (16) Chapters

Title Page

Packt Upsell

Contributors

Preface

Too Big or Not Too Big

Big Data Mining for the Masses

The Analytics Toolkit

Big Data With Hadoop

Big Data Mining with NoSQL

Spark for Big Data Analytics

An Introduction to Machine Learning Concepts

Machine Learning Deep Dive

Enterprise Data Science

Closing Thoughts on Big Data

External Data Science Resources

Other Books You May Enjoy

Leave a review - let other readers know what you think

What is big data?

The term big is relative and can often take on different meanings, both in terms of magnitude and applications for different situations. A simple, although naïve, definition of big data is a large collection of information, whether it is data stored in your personal laptop or a large corporate server that is non-trivial to analyze using existing or traditional tools.

Today, the industry generally treats data in the order of terabytes or petabytes and beyond as big data. In this chapter, we will discuss what led to the emergence of the big data paradigm and its broad characteristics. Later on, we will delve into the distinct areas in detail.

A brief history of data

The history of computing is a fascinating tale of how, starting with Charles Babbage’s Analytical Engine in the mid 1830s to the present-day supercomputers, computing technologies have led global transformations. Due to space limitations, it would be infeasible to cover all the areas, but a high-level introduction to data and storage of data is provided for historical background.

Dawn of the information age

Big data has always existed. The US Library of Congress, the largest library in the world, houses 164 million items in its collection, including 24 million books and 125 million items in its non-classified collection. [Source: https://www.loc.gov/about/general-information/].

Mechanical data storage arguably first started with punch cards, invented by Herman Hollerith in 1880. Based loosely on prior work by Basile Bouchon, who, in 1725 invented punch bands to control looms, Hollerith's punch cards provided an interface to perform tabulations and even printing of aggregates.

IBM pioneered the industrialization of punch cards and it soon became the de facto choice for storing information.

Dr. Alan Turing and modern computing

Punch cards established a formidable presence but there was still a missing element--these machines, although complex in design, could not be considered computational devices. A formal general-purpose machine that could be versatile enough to solve a diverse set of problems was yet to be invented.

In 1936, after graduating from King’s College, Cambridge, Turing published a seminal paper titled On Computable Numbers, with an Application to the Entscheidungsproblem, where he built on Kurt Gödel's Incompleteness Theorem to formalize the notion of our present-day digital computing.

The advent of the stored-program computer

The first implementation of a stored-program computer, a device that can hold programs in memory, was the Manchester Small-Scale Experimental Machine (SSEM), developed at the Victoria University of Manchester in 1948 [Source: https://en.wikipedia.org/wiki/Manchester_Small-Scale_Experimental_Machine]. This introduced the concept of RAM, Random Access Memory (or more generally, memory) in computers today. Prior to the SSEM, computers had fixed-storage; namely, all functions had to be prewired into the system. The ability to store data dynamically in a temporary storage device such as RAM meant that machines were no longer bound by the capacity of the storage device, but could hold an arbitrary volume of information.

From magnetic devices to SSDs

In the early 1950’s, IBM introduced magnetic tape that essentially used magnetization on a metallic tape to store data. This was followed in quick succession by hard-disk drives in 1956, which, instead of tapes, used magnetic disk platters to store data.

The first models of hard drives had a capacity of less than 4 MB, which occupied the space of approximately two medium-sized refrigerators and cost in excess of $36,000--a factor of 300 million times more expensive related to today’s hard drives. Magnetized surfaces soon became the standard in secondary storage and to date, variations of them have been implemented across various removable devices such as floppy disks in the late 90s, CDs, and DVDs.

Solid-state drives (SSD), the successor to hard drives, were first invented in the mid-1950’s by IBM. In contrast to hard drives, SSD disks stored data using non-volatile memory, which stores data using a charged silicon substrate. As there are no mechanical moving parts, the time to retrieve data stored in an SSD (seek time) is an order of magnitude faster relative to devices such as hard drives.