Home Data Hands-On Big Data Modeling

Hands-On Big Data Modeling

By James Lee , Tao Wei , Suresh Kumar Mukhiya
books-svg-icon Book
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Introduction to Big Data and Data Management
About this book
Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements. To start with, you’ll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you’ll work with structured and semi-structured data with the help of real-life examples. Once you’ve got to grips with the basics, you’ll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You’ll also learn to create graph data models and explore data modeling with streaming data using real-world datasets. By the end of this book, you’ll be able to design and develop efficient data models for varying data sizes easily and efficiently.
Publication date:
November 2018
Publisher
Packt
Pages
306
ISBN
9781788620901

 

Introduction to Big Data and Data Management

This chapter addresses the concept of big data, its sources, and its types. In addition to this, the chapter focuses on giving a theoretical foundation about data modeling and data management. Readers will be getting their hands dirty with setting up a platform where we can utilize big data. The major topics discussed in this chapter are summarized as follows:

  • Discover the concept of big data and its origins
  • Learn about the various characteristics of big data
  • Discuss and explore various challenges in big data mining
  • Get familiar with big data modeling and its uses
  • Understand what big data management is and its importance and implications
  • Set up a big data platform on a local machine
 

The concept of big data

Digital systems are progressively intertwined with real-world activities. As a consequence, multitudes of data are recorded and reported by information systems. During the last 50 years, the growth in information systems and their capabilities to capture, curate, store, share, transfer, analyze, and visualize data has increased exponentially. Besides these incredible technological advances, people and organizations depend more and more on computerized devices and information sources on the internet. The IDC Digital Universe Study in May 2010 illustrates the spectacular growth of data. This study estimated that the amount of digital information (on personal computers, digital cameras, servers, sensors) stored exceeds 1 zettabyte, and predicted that the digital universe would to grow to 35 zettabytes in 2010. The IDC study characterizes 35 zettabytes as a stack of DVDs reaching halfway to Mars. This is what we refer to as the data explosion.

Most of the data stored in the digital universe is very unstructured, and organizations are facing challenges to capture, curate, and analyze it. One of the most challenging tasks for today's organizations is to extract information and value from data stored in their information systems. This data, which is highly complex and too voluminous to be handled by a traditional DBMS, is called big data.

Big data is a term for a group of datasets so massive and sophisticated that it becomes troublesome to process using on-hand database-management tools or contemporary processing applications. Within the recent market, massive data trends to refer to the employment of user-behavior analytics, predictive analytics, or certain different advanced data-analysis methods that extract value from this new data echo system analytics.

Whether it's day-to-day data, business data, or basis data, if they represent a massive volume of data, either structured or unstructured, the data is relevant for the organization. However, it's not only the dimensions of the data that matters; it's how it's being used by the organization to extract the deeper insights that can drive them to better business and strategic decisions. This voluminous data can be used to determine a quality of research, enhance process flow in an organization, prevent a particular disease, link legal citations, or combat crimes. Big data is everywhere, and with the right tools it can be used to make the data more effective for business analytics.

Interesting insights regarding big data

Some interesting facts related to big data, and its management and analysis, are explained here, while some are presented in the Further reading section. The facts are taken from the source mentioned in the Further reading item.

  • Almost 91% of the world's marketing leaders consume customer data as big data to make business decisions.
  • Interestingly, 90% of the world's total data has been generated within the last two years.
  • 87% of people agree to record and distribute the right data. It is important to effectively measure Return of Investment (ROI) in their own company.
  • 86% of people are willing to pay more for a great customer experience with a brand.
  • 75% of companies claim they will expand investments in big data within the next year.
  • About 70% of big data is created by individuals—but enterprises are subjected to storing and controlling 80% of it.
  • 70% of businesses accept that their marketing efforts are under higher scrutiny.

Characteristics of big data

We explored the popularity of big data in the preceding section. But it is important to know what types of data can be categorized or labeled as big data. In this section, we are going to explore various features of big data. Most of the books available on the market would claim there are six different types, discussed as follows:

  • Volume: Big data implies massive amounts of data. The size of data gets a very relevant role in determining the value out of the data, and it is also a key factor that determines whether we can judge the chunk of data as big. Hence, volume justifies one of the important attributes of big data.
Every minute, 204,000,000 emails are sent, 200,000 photos are uploaded, and 1,800,000 likes are generated on Facebook; on YouTube, 1,300,000 videos are viewed and 72 hours of video are uploaded.

The idea behind such aggregation of massive volumes of data is to understand that businesses and organizations are collecting and leveraging giant volumes of data to reinforce their merchandise, whether it is safety, dependability, healthcare, or governance. In brief, the idea is to turn this abundant, voluminous data into some form of business advantage.

  • Velocity: It relates to the increasing speed at which big data is created, and the increasing speed at which data is stored and analyzed. Processing the data in real time to match its production rate as it gets generated is a remarkable goal of big data analytics. The term velocity generally applies to how fast the data is produced and processed to satisfy the demands; it discovers the real potential in the data. The flow of data is massive and continuous. Data can be stored and processed in different ways, including batch processing, near-time, real-time processing, and streaming:

    • Real-time processing refers to the ability to capture, store, and process the data in real time and trigger immediate action, potentially saving lives.
    • Batch processing refers to feeding a large amount of data into large machines and processing for days at a time. It is still very common today.
  • Variety: It refers to many sources and types of data, either structured, semi-structured, or unstructured. We will get to discuss more on these types of big data in Chapter 5, Structures of Data Models. When we think of data variety, we think of the additional complexity that results from more kinds of data that we need to store, process, and combine. Data is more heterogeneous these days, such as BLOB image data, enterprise data, network data, video data, text data, geographic maps, computer-generated or simulated data, and social media data. We can categorize the variety of data into several dimensions. Some of the dimensions are explained as follows:

    • Structural variety: This refers to the representation of the data; for example, a satellite image of wildfires from NASA is completely different from tweets sent out by people who are seeing the fire spread.
    • Media variety: Data gets delivered in various media, such as text, audio, or video. These are referred to as media variety.
    • Semantic variety: Semantic variety comes from different assumptions of conditions on the data. For example, we can measure its age using a qualitative approach (infant, juvenile, or adult) or a quantitative approach (numbers).
  • Veracity: It refers to the quality of the data, and is also designated as validity or volatility. Big data can be noisy and uncertain, full of biases and abnormalities, and it can be imprecise. The idea that data is of no value if it's not accurate—the results of the big data analysis are only as good as the data being analyzed—creates challenges in keeping track of data quality—what has been captured, where the data came from, and how it was analyzed prior to its use.

  • Valence: It refers to connectedness. The more connected data is, the higher its valences. A high valence dataset is denser. This makes many regular analytical critiques very inefficient.

  • Value: The term, in general, refers to the valuable insights gained from the ability to investigate and identify new patterns and trends from high-volume and cross-platform systems. The idea behind processing all this big data in the first place is to bring value to the query at hand. The final output of all the tasks is the value.

Here's a summed-up representation of the preceding content:

 

Sources and types of big data

We learned that big data is omnipresent and that it can be beneficial for enterprises in one or many ways. With the high prevalence of big data from existing hardware and software, enterprises are still struggling to process, store, analyze, and manage big data using traditional data-mining tools and techniques. In this section, we are going to explore the sources of these complex and dynamic data and how can we consume them.

We can separate the sources of the data into three major categories. The following diagram shows the three major sources of big data:

Let's look into the three major sources one by one:

  • Logs generated by a machine: A lot of the big data is generated from real-time sensors in industrial machinery or vehicles that create logs for tracking user behaviors, environmental sensors, or personal health-trackers and other sensor data. Most of this machine-created data can be grouped into the following subcategories:
    • Click-log stream data: This is the data that is captured every time a user clicks any link on a website. A detailed analysis of this data can reveal information related to customer behavior and deep interactions of the users with the current website, as well as customers' buying patterns.
    • Gaming events log data: A user performs a set of tasks when playing any online game. Each and every move the online user makes in a game can be stored. This data can be analyzed and the results can be helpful in knowing how end users are propeled through a gaming portfolio.
    • Sensors log data: Various types of sensors log data involve radio-frequency ID tags, smart meters, smartwatch sensor data, medical sensor devices such as heart-rate-monitoring sensors, and Global Positioning System (GPS) data. These types of sensors log data can be recorded and then used to analyze the actual status of the subject.
    • Weblog event data: There is extensive use of servers, cloud infrastructures, applications, networks, and so on. These applications operate and record all kinds of data about their events and operation. These data, when stored, can amount to massive volumes of data, and can be useful in understanding how to deal with service-level agreements or to predict security breaches.
    • Point-of-sale event-log data: Almost every product these days has a unique barcode. A cashier in a retail shop or department swipes the barcode of any product when selling, and all the data associated with the product is generated and can be captured. This data can be analyzed to understand the selling pattern of a retailer.
  • Person: People generate a lot of big data from social media, status updates, tweets, photos, and media uploads. Most of these logs are generated through interactions of a user with a network, such as the internet. This data reveal contains how a user communicates with the network. These interaction logs can reveal deep content-interaction models that can be useful in understanding user behavior. This analysis can be used to train a model to present personalized recommendations of web items, including next news to read, or, most likely, products to consider buying. A lot of similar researches are very hot in today's industry, including sentiment analysis and topic analysis. Most of this data is unstructured, as there is no proper format or well-defined structure available. Most of this data is either in a text format, a portable document format, a comma-separated value (CSV), or a JSON file.
  • Organization: We get a massive amount of data from an organization in terms of transaction information in databases and structured data open-stored in the data warehouse. This data is a highly structured form of data. Organizations store their data on some type of RDBMS, such as SQL, Oracle, and MS Access. This data resides in a fixed format inside the field or a table. This organization-generated data is consumed and processed in ICT technology to comprehend business intelligence and market analysis.

Challenges of big data

There are certain key aspects that make the big data very challenging. In this section, we'll discuss some of them:

  • Heterogeneity: There is a great deal of diversity in the information consumed by human beings, and they are indeed tolerated as well. In fact, the nuance and richness of natural language will provide valuable depth. However, machine-analysis algorithms expect consistent knowledge, and can't understand nuance. As a consequence, knowledge must be carefully structured as a first step to (or prior to) knowledge analysis. Computer systems work most efficiently if they can store multiple things that are all identical in size and structure. Economical representation, access, and the analysis of semi-structured knowledge require further work.
  • Personal privacy: There is a lot of personal information that is captured, stored, analyzed, and processed through internet service providers (ISPs), mobile networks, operators, supermarkets, local transportation, educational institutions, and medical and financial service organizations, including hospitals, banks, insurance companies, and credit card agencies. A great deal of information is being stored on social networks such as Facebook, YouTube, and Google. This illuminates that privacy is an issue whose importance, particularly to the customer, is growing as the value of big data becomes more apparent. This personal data is used by mining algorithms to personalize news content and to manage ads, and for other e-commerce advantages. This is clearly a violation of personal privacy.
  • Scale: As the name suggests, big data is massive. When there is an increase in size, there are underlying issues that accompany it in terms of storage, retrieval, processing, transformation, and analysis. As mentioned in the introduction, data volume is scaling much faster than computer resources and CPU speeds, which are static.
  • Timeliness: This is concerned with speed, as the larger the size of the data to be processed, the longer it will take to analyze it. There are many scenarios where in the results of the analysis are required in real-time or immediately. This creates an extra challenge when building a system that can process the big data in a timely manner.
  • Securing big data: Security is also a big concern for both enterprises and individuals. Big data stores can be engaging targets for hackers or complex persistent threats. Security is an essential attribute in the big data architecture that reveals ways to store and provide access to information securely.
 

Introduction to big data modeling

Having a good idea of what big data and its characteristics are, let's now dig into what big data modeling is. Say we have the dataset, which we classify as big data, and before doing any analysis on the dataset, we need to have an idea of how the data looks. The goal of data modeling is to formally explore the nature of data so that you can figure out what kind of storage you need, and what kind of processing you can do on it.

Data modeling is a technique that helps to give meaningful insight into data by defining and categorizing it, and establishing official definitions and descriptors so that the data can be utilized by all information systems in a company.

We can hold at least two primary reasons for performing data modeling:

  • Strategic data modeling facilitates the overall information systems development strategy
  • Data modeling can help in the development of new databases

The data modeling for strategic outlining suggests defining what kind of data you will need for your company processes, while modeling in the context of analysis is more focused on representing data that exists and finding ways to classify it. In the case of big data, that process probably requires finding similarities between data from disparate sources and confirming that they, in fact, describe the same thing. In either case, the end goal is to generate a representation of your data that can be replicated in your database architecture.

Uses of models

In this section, we are going to discuss why we need data models, and the main benefits we can get by studying current data models. A high-level data model illustrates the core concepts and principles of any company in a very simplistic way, employing short descriptions. One of the biggest advantages of developing the high-level model is that it helps us to arrive at common terminology and definitions of the ideas and principles.

A high-level data model utilizes simplistic graphical images to illustrate the core concepts and principles of an organization and what they mean. A database model shows the logical structure of a database, including the relationships and constraints that determine how data can be stored and accessed.

Let's consider a simple student score-recording system. A student has a First name, a Last name, and a unique identifier. Each student is associated with an institution. Each student has a Start date and other data associated with them. We can better represent this using some kind of model than in a paragraph, which is difficult to understand.

Let's convert it into a model:

Model 1.1

Now, let's consider the preceding model. It shows clearly the correlation between students and the Institution provider and how they are saved in multiple tables. It's easier to understand than a paragraph. Now let's analyze this model and see what benefits we get from the model compared with other textual representations:

  • Gaining insight: A detailed model shows the process from various angles. Like in the preceding model, we can see how students are associated with the provider institutions, the different types of plans, and when a course starts. In order to start with data modeling, it is important to know the following:
    • Understanding how the business works in order to understand data flow inside the organization.
    • Understanding what type of data is gathered and stored in the organization.
    • Understanding business processes and relationships. This knowledge guides us in building data and relationships in a data model.
  • Discussion: The detailed data model can be used for discussions with the stakeholders.
  • Knowledge transfer: This can be used as a source of documentation for instructing people or developers. Data modeling is a sort of documentation, both for business stakeholders and technical experts. Starting with providing a common vocabulary that different job roles can share, and by continuing on to providing newcomers with a well-thought-out business glossary, your knowledge to document and convey information about your business is greatly enhanced. In addition to this, the model can be used as a training aid.
  • Verification: The process models are analyzed to find errors in systems or procedures. If your requirements gathering were complete and included the merging of data from multiple sources, as well as query and reporting obligations, you'd have business intelligence opportunities that were nonexistent when your data existed in silos, or in haphazardly-designed databases.
  • Performance analysis: A detailed model made from the data can be used to analyze the performance of the system by employing several available techniques, such as simulations, and dry and run playing in the model.
  • Specification: A relevant model generated from an organization's data can be utilized to create a Software Requirement Specification (SRS) document that can be used as a roadmap between a developer and end user stakeholders.
  • Configuration: The models constructed from data can be applied to configure a system. A detailed model constructed with precision shows the relationship between modules and how a module can communicate with another module. This information can be used by any organization to enforce interoperability among the modules and module configuration parameters, and reduce redundancies.
 

Introduction to managing big data

The intent of big data management is to figure out what kind of infrastructure support you would require for the data. For example, does your environment need to retain multiple replicas of the data? Do you need to do statistical computation with the data? Once these operational requirements have been determined, you'll be able to choose the right system that will let you perform these operations. The big data management answers the following questions:

  • How do we ingest or consume the data?
  • Where and how do we store it?
  • How can we ensure as well as enforce data quality?
  • What operations do we perform on the data?
  • How can these operations be efficient?
  • How do we manage data scalability, variety, velocity, and access?
  • How can we enforce security and privacy at each stage of data modeling?
Big data management is a comprehensive concept that embraces the policies, procedures, and techniques practiced for the collection, storage, governance, organization, administration, and delivery of large repositories of data.

We will go into the details of big data management, and we will be discussing the details of data management as well as their vendors in the next chapter, Chapter 2, Data Modeling and Management Platforms.

 

Importance and implications of big data modeling and management

We have witnessed that big data is of economic and scientific significance. It is a scientific belief that the bigger the data utilized in research, the greater the accuracy. Data is generated every second in real life, which means the volume of data available can never diminish, but it will continue to grow. It is also important to recognize that much of this data explosion is the result of an explosion in devices located at the periphery of the network, including embedded sensors, smartphones, and tablet computers. All of this data creates new opportunities for data analysts in human genomics, healthcare, oil and gas, search, surveillance, finance, and many other areas. In this section, we are going to explore the various benefits of big data management, and in the next section we will discover various challenges of big data management in today's market.

Benefits of big data management

As mentioned, big data is a powerful tool. Thoughtful management of big data gives substantial breakthroughs and leads to more solid business decisions. In this section, we are going to discuss several benefits of big data management:

  • Accelerates revenue: When the data is managed correctly and efficiently, it gives value. Value helps in the acceleration of revenue for small or enterprise businesses.
  • Improved customer service: Several studies show that enterprises that use the previous data to gain business intelligence have improved their customer services as the mined models guide the business by overcoming bottlenecks in the current system.
  • Improves marketing: Big data analysis reveals a deeper analysis of business from the past and current data, and gives information about how to run the business in the future. This gives a guided path for how to deliver critical and innovative marketing solutions.
  • Increased efficiency: The identification of a new source of data has been made moderately easier with an introduction of high-speed tools such as Hadoop. These tools help businesses in analyzing data in real-time, and accelerate decision making.
  • Cost savings: Cloud-based services are getting attention these days and have been successfully used in a lot of enterprise data management. Tools such as Hadoop are cloud-based and are easier to handle. These systems help to reduce costs by providing easier interfaces on which to store, analyze, and visualize big data.
  • Improved accuracy of analytics: The accuracy and reliability of big data analytics have been uplifted by data-management practices. Data management services provide a better and cheaper way to turn data into business intelligence, thus increasing accuracy and the precision of analytics.

Challenges in big data management

With a huge explosion of data in several organizations, businesses have a keen interest in exploring solutions that provide opportunities and insights to increase profits in the business. However, it is still difficult to manage and maintain big data. Some of the major challenges in the big data management process are stated as follows:

  • Expanding data stores: Having an enormous volume of data involved, and the fact that it is continuously growing over time, makes data management very complex and challenging. It is also very critical to perform any sort of operation on this dataset as it can hinder the quality and performance of the analysis. It can be very complex to move a database into an analytical solution due to continuous expansion in data stores and data silos.
  • Data and structural complexity: Enterprises typically have both structured data and unstructured data, and that data resides in a very wide range of formats, including JSON, CSV, a document file, a text file, or BLOB data. An enterprise generally has several thousand applications on its systems, and every one of these applications might scan from and write to several distinct databases. As a result, simply cataloging what styles of data an organization has in its storage systems is often extraordinarily tough.
  • Assuring data quality: It is one of the essences for enterprises to ensure data reliability and accuracy. As mentioned, the deficit of synchronization across data silos and data warehouses can make it complicated for managers to understand which part of the data is accurate and complete. If a user enters the wrong data, the generated output is also incorrect. This is referred to as garbage in, garbage out (GIGO). This type of error is referred to as a human error.
  • Low staffing: It is difficult and challenging to find qualified staff with decent knowledge about the problem domain. A lack of data scientists, database administrators (DBA), data analysts, data modelers, and different big data professionals makes the job of data management very challenging.
  • Lack of executive support: Senior managers generally do not appreciate the importance and value of good data management. It is very difficult to convince them and show the roadmaps of how these management techniques would be beneficial for the organization. In other words, most of the executive managers are happy with their state-of-the-art solutions for the problem domain.
 

Setting up big data modeling platforms

In this section, we are going to set up Cloudera VM on both Windows and macOS. We are going to use this VM for most of the exercises in this book.

Getting started on Windows

We are going to install Cloudera virtual machine (VM) in our system to get started with big data modeling. Follow these instructions to download and install the Cloudera Quickstart VM with VirtualBox on Windows:

  1. Download the software from https://www.virtualbox.org/wiki/Downloads. Once the download is complete, install the downloaded VirtualBox software on your computer.
  2. Download the Cloudera VM from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is over 4 GB. It will take some time to download the software.
  3. Right-click Cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and choose Extract All….
  4. Start the VirtualBox.
  5. Import the VM by going to File | Import Appliance...:
  1. Click the folder icon, select cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the folder where you unzipped the VirtualBox VM, and click Open. The following screenshot is provided for assistance:
  2. Click Next to proceed, and then click Import, as shown in the following screenshot:

  1. The VM image will be imported. As this is a big file, it can take some time:
  1. When the import is finished, launch the Cloudera VM. cloudera-quickstart-vm-5.4.2-0 VM will appear on the left in the VirtualBox window. Select the machine and click the Start button to initiate the VM:
  1. It takes several minutes for Cloudera VM booting to start up initially. It takes a long time, since many Hadoop tools are loaded and started at this booting process:

  1. Once the booting process is finished, you will see the Cloudera VM desktop on the screen:

Getting started on macOS

Setting up the Cloudera VM on a Mac is very similar to setting it up on Windows. If you have macOS, we can set up the Cloudera Quickstart VM with VirtualBox on macOS. Perform the following steps:

  1. Go to https://www.virtualbox.org/wiki/Downloads and download the virtual box. Once downloaded, install VirtualBox in macOS.
  2. Download the Cloudera VM from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is over 4 GB, so it will take some time to download.
  1. Unzip the Cloudera VM. You can do this by double-clicking on the cloudera-quickstart-vm-5.4.2-0-virtualbox.zip folder.
  2. Start the VirtualBox and begin importing. We can import the VM by going to File | Import Appliance.
  3. Click the Folder icon, select cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the folder where you unzipped the VirtualBox VM, and click Open:
  1. Click Continue to proceed:
  1. Click Import:

  1. The VM image will be imported. As this is a big file, it can take some time:

  1. When the import is finished, launch the Cloudera VM. cloudera-quickstart-vm-5.4.2-0 VM will appear on the left in the VirtualBox window. Select the machine and click the Start button to initiate the VM:

  1. It takes several minutes for Cloudera VM booting to start up initially. It takes a long time, since many Hadoop tools are loaded and started during this booting process:

  1. Once the booting process is finished, you will see the Cloudera VM desktop on the screen:
 

Summary

Big data is ubiquitous; it can be found everywhere, from small businesses to enterprise applications. It is of vital importance that this data is captured, stored, retrieved, and analyzed in the best possible way in order to get a deeper analysis and business intelligence. In this chapter, we learned about the concept of big data and where it originates from. We also discussed the different characteristics of big data and the theoretical foundation of data modeling and data management.

In addition to this, readers got a chance to install a big data platform on their local machine for both macOS and Windows machines. We will be using these machines in upcoming chapters to create database models. In the next chapter, we will be discussing data-modeling and data-management platforms.

 

Further reading

  • The Digital Universe Decade—Are You Ready? International Data Corporation, Framingham, MA, 2010, by John Gantz and David Reinsel
  • The Human Face of Big Data by Rick Smolan, Jennifer Erwitt
  • Facts about big data, http://barnraisersllc.com/2012/12/38-big-facts-big data-companies/
  • Big Data Concepts, Challenges and Solution in Hadoop Ecosystem, 10 October 2017, Dr. Ujjwal Agarwal
  • Big Data: A Revolution that Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger, Kenneth Cukier
  • Big Data: Principles and Best Practices of Scalable Real-time Data Systems by Nathan Marz, James Warren
  • Healthcare and Big Data: Digital Specters and Phantom Objects by Mary F.E. Ebeling
About the Authors
  • James Lee

    James Lee is a passionate software wizard working at one of the top Silicon Valley-based start-ups specializing in big data analysis. He has also worked at Google and Amazon. In his day job, he works with big data technologies, including Cassandra and Elasticsearch, and is an absolute Docker geek and IntelliJ IDEA lover. Apart from his career as a software engineer, he is keen on sharing his knowledge with others and guiding them, especially in relation to start-ups and programming. He has been teaching courses and conducting workshops on Java programming / IntelliJ IDEA since he was 21. James also enjoys skiing and swimming, and is a passionate traveler.

    Browse publications by this author
  • Tao Wei

    Tao Wei is a passionate software engineer who works in a leading Silicon Valley-based big data analysis company. Previously, Tao worked in big IT companies, including IBM and Cisco. He has intensive experience in designing and building distributed, large-scale systems with proven high availability and reliability. Tao has an MS degree in computer science from McGill University and many years' experience as a teaching assistant in a variety of computer science classes. In his spare time, he enjoys reading and swimming, and is a passionate photographer.

    Browse publications by this author
  • Suresh Kumar Mukhiya

    Suresh Kumar Mukhiya is a PhD candidate, currently affiliated to the Western Norway University of Applied Sciences (HVL). He is a big data enthusiast, specializing in Information Systems, Model-Driven Software Engineering, Big Data Analysis, Artificial Intelligence and Frontend development. He has completed a Masters in Information Systems from the Norwegian University of Science and Technology (NTNU, Norway) along with a thesis in processing mining. He also holds a bachelor's degree in computer science and information technology (BSc.CSIT) from Tribhuvan University, Nepal, where he was decorated with the Vice-Chancellor's Award for obtaining the highest score. He is a passionate photographer and a resilient traveler.

    Browse publications by this author
Latest Reviews (1 reviews total)
Great book on topic desired.
Hands-On Big Data Modeling
Unlock this book and the full library FREE for 7 days
Start now