Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Practical Big Data Analytics

You're reading from  Practical Big Data Analytics

Product type Book
Published in Jan 2018
Publisher Packt
ISBN-13 9781783554393
Pages 412 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Nataraj Dasgupta Nataraj Dasgupta
Profile icon Nataraj Dasgupta

Table of Contents (16) Chapters

Title Page
Packt Upsell
Contributors
Preface
Too Big or Not Too Big Big Data Mining for the Masses The Analytics Toolkit Big Data With Hadoop Big Data Mining with NoSQL Spark for Big Data Analytics An Introduction to Machine Learning Concepts Machine Learning Deep Dive Enterprise Data Science Closing Thoughts on Big Data External Data Science Resources Other Books You May Enjoy

What is big data?


The term big is relative and can often take on different meanings, both in terms of magnitude and applications for different situations. A simple, although naïve, definition of big data is a large collection of information, whether it is data stored in your personal laptop or a large corporate server that is non-trivial to analyze using existing or traditional tools.

Today, the industry generally treats data in the order of terabytes or petabytes and beyond as big data. In this chapter, we will discuss what led to the emergence of the big data paradigm and its broad characteristics. Later on, we will delve into the distinct areas in detail.

A brief history of data

The history of computing is a fascinating tale of how, starting with Charles Babbage’s Analytical Engine in the mid 1830s to the present-day supercomputers, computing technologies have led global transformations. Due to space limitations, it would be infeasible to cover all the areas, but a high-level introduction to data and storage of data is provided for historical background.

Dawn of the information age

Big data has always existed. The US Library of Congress, the largest library in the world, houses 164 million items in its collection, including 24 million books and 125 million items in its non-classified collection. [Source: https://www.loc.gov/about/general-information/].

Mechanical data storage arguably first started with punch cards, invented by Herman Hollerith in 1880. Based loosely on prior work by Basile Bouchon, who, in 1725 invented punch bands to control looms, Hollerith's punch cards provided an interface to perform tabulations and even printing of aggregates.

IBM pioneered the industrialization of punch cards and it soon became the de facto choice for storing information.

Dr. Alan Turing and modern computing

Punch cards established a formidable presence but there was still a missing element--these machines, although complex in design, could not be considered computational devices. A formal general-purpose machine that could be versatile enough to solve a diverse set of problems was yet to be invented.

In 1936, after graduating from King’s College, Cambridge, Turing published a seminal paper titled On Computable Numbers, with an Application to the Entscheidungsproblem, where he built on Kurt Gödel's Incompleteness Theorem to formalize the notion of our present-day digital computing.

The advent of the stored-program computer

The first implementation of a stored-program computer, a device that can hold programs in memory, was the Manchester Small-Scale Experimental Machine (SSEM), developed at the Victoria University of Manchester in 1948 [Source: https://en.wikipedia.org/wiki/Manchester_Small-Scale_Experimental_Machine]. This introduced the concept of RAM, Random Access Memory (or more generally, memory) in computers today. Prior to the SSEM, computers had fixed-storage; namely, all functions had to be prewired into the system. The ability to store data dynamically in a temporary storage device such as RAM meant that machines were no longer bound by the capacity of the storage device, but could hold an arbitrary volume of information.

From magnetic devices to SSDs

In the early 1950’s, IBM introduced magnetic tape that essentially used magnetization on a metallic tape to store data. This was followed in quick succession by hard-disk drives in 1956, which, instead of tapes, used magnetic disk platters to store data.

The first models of hard drives had a capacity of less than 4 MB, which occupied the space of approximately two medium-sized refrigerators and cost in excess of $36,000--a factor of 300 million times more expensive related to today’s hard drives. ­Magnetized surfaces soon became the standard in secondary storage and to date, variations of them have been implemented across various removable devices such as floppy disks in the late 90s, CDs, and DVDs.

Solid-state drives (SSD), the successor to hard drives, were first invented in the mid-1950’s by IBM. In contrast to hard drives, SSD disks stored data using non-volatile memory, which stores data using a charged silicon substrate. As there are no mechanical moving parts, the time to retrieve data stored in an SSD (seek time) is an order of magnitude faster relative to devices such as hard drives.

You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018 Publisher: Packt ISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}