Big data analytics constitutes a wide range of functions related to mining, analysis, and predictive modeling on large-scale datasets. The rapid growth of information and technological developments has provided a unique opportunity for individuals and enterprises across the world to derive profits and develop new capabilities redefining traditional business models using large-scale analytics. This chapter aims at providing a gentle overview of the salient characteristics of big data to form a foundation for subsequent chapters that will delve deeper into the various aspects of big data analytics.
In general, this book will provide both theoretical as well as practical hands-on experience with big data analytics systems used across the industry. The book begins with a discussion Big Data and Big Data related platforms such as Hadoop, Spark and NoSQL Systems, followed by Machine Learning where both practical and theoretical topics will be covered and conclude with a thorough analysis of the use of Big Data and more generally, Data Science in the industry. The book will be inclusive of the following topics:
- Big data platforms: Hadoop ecosystem and Spark NoSQL databases such as Cassandra Advanced platforms such as KDB+
- Machine learning: Basic algorithms and concepts Using R and scikit-learn in Python Advanced tools in C/C++ and Unix Real-world machine learning with neural networks Big data infrastructure
- Enterprise cloud architecture with AWS (Amazon Web Services) On-premises enterprise architectures High-performance computing for advanced analytics Business and enterprise use cases for big data analytics and machine learning Building a world-class big data analytics solution
To take the discussion forward, we will have the following concepts cleared in this chapter:
- Definition of Big Data
- Why are we talking about Big Data now if data has always existed?
- A brief history of Big Data
- Types of Big Data
- Where should you start your search for the Big Data solution?
The term big is relative and can often take on different meanings, both in terms of magnitude and applications for different situations. A simple, although naïve, definition of big data is a large collection of information, whether it is data stored in your personal laptop or a large corporate server that is non-trivial to analyze using existing or traditional tools.
Today, the industry generally treats data in the order of terabytes or petabytes and beyond as big data. In this chapter, we will discuss what led to the emergence of the big data paradigm and its broad characteristics. Later on, we will delve into the distinct areas in detail.
The history of computing is a fascinating tale of how, starting with Charles Babbage’s Analytical Engine in the mid 1830s to the present-day supercomputers, computing technologies have led global transformations. Due to space limitations, it would be infeasible to cover all the areas, but a high-level introduction to data and storage of data is provided for historical background.
Big data has always existed. The US Library of Congress, the largest library in the world, houses 164 million items in its collection, including 24 million books and 125 million items in its non-classified collection. [Source: https://www.loc.gov/about/general-information/].
Mechanical data storage arguably first started with punch cards, invented by Herman Hollerith in 1880. Based loosely on prior work by Basile Bouchon, who, in 1725 invented punch bands to control looms, Hollerith's punch cards provided an interface to perform tabulations and even printing of aggregates.
IBM pioneered the industrialization of punch cards and it soon became the de facto choice for storing information.
Punch cards established a formidable presence but there was still a missing element--these machines, although complex in design, could not be considered computational devices. A formal general-purpose machine that could be versatile enough to solve a diverse set of problems was yet to be invented.
In 1936, after graduating from King’s College, Cambridge, Turing published a seminal paper titled On Computable Numbers, with an Application to the Entscheidungsproblem, where he built on Kurt Gödel's Incompleteness Theorem to formalize the notion of our present-day digital computing.
The first implementation of a stored-program computer, a device that can hold programs in memory, was the Manchester Small-Scale Experimental Machine (SSEM), developed at the Victoria University of Manchester in 1948 [Source: https://en.wikipedia.org/wiki/Manchester_Small-Scale_Experimental_Machine]. This introduced the concept of RAM, Random Access Memory (or more generally, memory) in computers today. Prior to the SSEM, computers had fixed-storage; namely, all functions had to be prewired into the system. The ability to store data dynamically in a temporary storage device such as RAM meant that machines were no longer bound by the capacity of the storage device, but could hold an arbitrary volume of information.
In the early 1950’s, IBM introduced magnetic tape that essentially used magnetization on a metallic tape to store data. This was followed in quick succession by hard-disk drives in 1956, which, instead of tapes, used magnetic disk platters to store data.
The first models of hard drives had a capacity of less than 4 MB, which occupied the space of approximately two medium-sized refrigerators and cost in excess of $36,000--a factor of 300 million times more expensive related to today’s hard drives. Magnetized surfaces soon became the standard in secondary storage and to date, variations of them have been implemented across various removable devices such as floppy disks in the late 90s, CDs, and DVDs.
Solid-state drives (SSD), the successor to hard drives, were first invented in the mid-1950’s by IBM. In contrast to hard drives, SSD disks stored data using non-volatile memory, which stores data using a charged silicon substrate. As there are no mechanical moving parts, the time to retrieve data stored in an SSD (seek time) is an order of magnitude faster relative to devices such as hard drives.
By the early 2000’s, rapid advances in computing and technologies, such as storage, allowed users to collect and store data with unprecedented levels of efficiency. The internet further added impetus to this drive by providing a platform that had an unlimited capacity to exchange information at a global scale. Technology advanced at a breathtaking pace and led to major paradigm shifts powered by tools such as social media, connected devices such as smart phones, and the availability of broadband connections, and by extension, user participation, even in remote parts of the world.
By and large, the majority of this data consists of information generated by web-based sources, such as social networks like Facebook and video sharing sites like YouTube. In big data parlance, this is also known as unstructured data; namely, data that is not in a fixed format such as a spreadsheet or the kind that can be easily stored in a traditional database system.
The simultaneous advances in computing capabilities meant that although the rate of data being generated was very high, it was still computationally feasible to analyze it. Algorithms in machine learning, which were once considered intractable due to both the volume as well as algorithmic complexity, could now be analyzed using various new paradigms such as cluster or multinode processing in a much simpler manner that would have earlier necessitated special-purpose machines.
Chart of data generated per minute. Credit: DOMO Inc.
Collectively, the volume of data being generated has come to be termed big data and analytics that include a wide range of faculties from basic data mining to advanced machine learning is known as big data analytics. There isn't, as such, an exact definition due to the relative nature of quantifying what can be large enough to meet the criterion to classify any specific use case as big data analytics. Rather, in a generic sense, performing analysis on large-scale datasets, in the order of tens or hundreds of gigabytes to petabytes, can be termed big data analytics. This can be as simple as finding the number of rows in a large dataset to applying a machine learning algorithm on it.
At a fundamental level, big data systems can be considered to have four major layers, each of which are indispensable. There are many such layers that are outlined in various textbooks and literature and, as such, it can be ambiguous. Nevertheless, at a high level, the layers defined here are both intuitive and simplistic:
Big Data Analytics Layers
The levels are broken down as follows:
- Hardware: Servers that provide the computing backbone, storage devices that store the data, and network connectivity across different server components are some of the elements that define the hardware stack. In essence, the systems that provide the computational and storage capabilities and systems that support the interoperability of these devices form the foundational layer of the building blocks.
- Software: Software resources that facilitate analytics on the datasets hosted in the hardware layer, such as Hadoop and NoSQL systems, represent the next level in the big data stack. Analytics software can be classified into various subdivisions. Two of the primary high-level classifications for analytics software are tools that facilitate are:
- Data mining: Software that provides facilities for aggregations, joins across datasets, and pivot tables on large datasets fall into this category. Standard NoSQL platforms such as Cassandra, Redis, and others are high-level, data mining tools for big data analytics.
- Statistical analytics: Platforms that provide analytics capabilities beyond simple data mining, such as running algorithms that can range from simple regressions to advanced neural networks such as Google TensorFlow or R, fall into this category.
- Data management: Data encryption, governance, access, compliance, and other features salient to any enterprise and production environment to manage and, in some ways, reduce operational complexity form the next basic layer. Although they are less tangible than hardware or software, data management tools provide a defined framework, using which organizations can fulfill their obligations such as security and compliance.
- End user: The end user of the analytics software forms the final aspect of a big data analytics engagement. A data platform, after all, is only as good as the extent to which it can be leveraged efficiently and addresses business-specific use cases. This is where the role of the practitioner who makes use of the analytics platform to derive value comes into play. The term data scientist is often used to denote individuals who implement the underlying big data analytics capabilities while business users reap the benefits of faster access and analytics capabilities not available in traditional systems.
Data can be broadly classified as being structured, unstructured, or semi-structured. Although these distinctions have always existed, the classification of data into these categories has become more prominent with the advent of big data.
Structured data, as the name implies, indicates datasets that have a defined organizational structure such as Microsoft Excel or CSV files. In pure database terms, the data should be representable using a schema. As an example, the following table representing the top five happiest countries in the world published by the United Nations in its 2017 World Happiness Index ranking would be an atypical representation of structured data.
We can clearly define the data types of the columns--Rank, Score, GDP per capita, Social support, Healthy life expectancy, Trust, Generosity, and Dystopia are numerical columns, whereas Country is represented using letters, or more specifically, strings.
Refer to the following table for a little more clarity:
GDP per capita
Healthy life expectancy
World Happiness Report, 2017 [Source: https://en.wikipedia.org/wiki/World_Happiness_Report#cite_note-4]
Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra, and Hive in the open source domain are examples of technologies that provide the ability to manage and query structured data.
Unstructured data consists of any dataset that does not have a predefined organizational schema as in the table in the prior section. Spoken words, music, videos, and even books, including this one, would be considered unstructured. This by no means implies that the content doesn’t have organization. Indeed, a book has a table of contents, chapters, subchapters, and an index--in that sense, it follows a definite organization.
However, it would be futile to represent every word and sentence as being part of a strict set of rules. A sentence can consist of words, numbers, punctuation marks, and so on and does not have a predefined data type as spreadsheets do. To be structured, the book would need to have an exact set of characteristics in every sentence, which would be both unreasonable and impractical.
Data from social media, such as posts on Twitter, messages from friends on Facebook, and photos on Instagram, are all examples of unstructured data.
Unstructured data can be stored in various formats. They can be Blobs or, in the case of textual data, freeform text held in a data storage medium. For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally used to query, index, and other operations.
Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all individuals and hence some of the entries may have just a phone number and vice versa.
Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the individual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.
In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.
Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.
Technology today allows us to collect data at an astounding rate--both in terms of volume and variety. There are various sources that generate data, but in the context of big data, the primary sources are as follows:
- Social networks: Arguably, the primary source of all big data that we know of today is the social networks that have proliferated over the past 5-10 years. This is by and large unstructured data that is represented by millions of social media postings and other data that is generated on a second-by-second basis through user interactions on the web across the world. Increase in access to the internet across the world has been a self-fulfilling act for the growth of data in social networks.
- Media: Largely a result of the growth of social networks, media represents the millions, if not billions, of audio and visual uploads that take place on a daily basis. Videos uploaded on YouTube, music recordings on SoundCloud, and pictures posted on Instagram are prime examples of media, whose volume continues to grow in an unrestrained manner.
- Data warehouses: Companies have long invested in specialized data storage facilities commonly known as data warehouses. A DW is essentially collections of historical data that companies wish to maintain and catalog for easy retrieval, whether for internal use or regulatory purposes. As industries gradually shift toward the practice of storing data in platforms such as Hadoop and NoSQL, more and more companies are moving data from their pre-existing data warehouses to some of the newer technologies. Company emails, accounting records, databases, and internal documents are some examples of DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that leverage multiple nodes to provide a highly-available and fault-tolerant platform.
- Sensors: A more recent phenomenon in the space of big data has been the collection of data from sensor devices. While sensors have always existed and industries such as oil and gas have been using drilling sensors for measurements at oil rigs for many decades, the advent of wearable devices, also known as the Internet Of Things such as Fitbit and Apple Watch, meant that now each individual could stream data at the same rate at which a few oil rigs used to do just 10 years back.
Wearable devices can collect hundreds of measurements from an individual at any given point in time. While not yet a big data problem as such, as the industry keeps evolving, sensor-related data is likely to become more akin to the kind of spontaneous data that is generated on the web through social network activities.
The topic of the 4Vs has become overused in the context of big data, where it has started to lose some of the initial charm. Nevertheless, it helps to bear in mind what these Vs indicate for the sake of being aware of the background context to carry on a conversation.
Broadly, the 4Vs indicate the following:
- Volume: The amount of data that is being generated
- Variety: The different types of data, such as textual, media, and sensor or streaming data
- Velocity: The speed at which data is being generated, such as millions of messages being exchanged at any given time across social networks
- Veracity: This has been a more recent addition to the 3Vs and indicates the noise inherent in data, such as inconsistencies in recorded information that requires additional validation
Finally, big data analytics refers to the practice of putting the data to work--in other words, the process of extracting useful information from large volumes of data through the use of appropriate technologies. There is no exact definition for many of the terms used to denote different types of analytics, as they can be interpreted in different ways and the meaning hence can be subjective.
Nevertheless, some are provided here to act as references or starting points to help you in forming an initial impression:
- Data mining: Data mining refers to the process of extracting information from datasets through running queries or basic summarization methods such as aggregations. Finding the top 10 products by the number of sales from a dataset containing all the sales records of one million products at an online website would be the process of mining: that is, extracting useful information from a dataset. NoSQL databases such as Cassandra, Redis, and MongoDB are prime examples of tools that have strong data mining capabilities.
- Business intelligence: Business intelligence refers to tools such as Tableau, Spotfire, QlikView, and others that provide frontend dashboards to enable users to query data using a graphical interface. Dashboard products have gained in prominence in step with the growth of data as users seek to extract information. Easy-to-use interfaces with querying and visualization features that could be used universally by both technical and non-technical users set the groundwork to democratize analytical access to data.
- Statistical analytics: Statistical analytics refers to tools or platforms that allow end users to run statistical operations on datasets. These tools have traditionally existed for many years, but have gained traction with the advent of big data and the challenges that large volumes of data pose in terms of performing efficient statistical operations. Languages such as R and products such as SAS are prime examples of tools that are common names in the area of computational statistics.
- Machine learning: Machine learning, which is often referred to by various names such as predictive analytics, predictive modeling, and others, is in essence the process of applying advanced algorithms that go beyond the realm of traditional statistics. These algorithms inevitably involve running hundreds or thousands of iterations. Such algorithms are not only inherently complex, but also very computationally intensive.
The advancement in technology has been a key driver in the growth of machine learning in analytics, to the point where it has now become a commonly used term across the industry. Innovations such as self-driving cars, traffic data on maps that adjust based on traffic patterns, and digital assistants such as Siri and Cortana are examples of the commercialization of machine learning in physical products.
Big data is undoubtedly a vast subject that can seem overly complex at first sight. Practice makes perfect, and so it is with the study of big data--the more you get involved, the more familiar the topics and verbiage gets, and the more comfortable the subject becomes.
A keen study of the various dimensions of the topic of big data analytics will help you develop an intuitive sense of the subject. This book aims to provide a holistic overview of the topic and will cover a broad range of areas such as Hadoop, Spark, NoSQL databases as well as topics that are based on hardware design and cloud infrastructures. In the next chapter, we will introduce the concept of Big Data Mining and discuss about the technical elements as well as the selection criteria for Big Data technologies.