Reader small image

You're reading from  Practical Big Data Analytics

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781783554393
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nataraj Dasgupta
Nataraj Dasgupta
author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Right arrow

Chapter 5. Big Data Mining with NoSQL

The term NoSQL was first used by Carlo Strozzi, who, in 1998, released the Strozzi NoSQL opensource relational database. In the late 2000s, new paradigms in database architecture emerged, many of which did not adhere to the strict constraints required of relational database systems. These databases, due to their non-conformity with standard database conventions such as ACID compliance, were soon grouped under a broad category known as NoSQL.

Each NoSQL database claims to be optimal for certain use cases. Although few of them would fit the requirements to be a general-purpose database management system, they all leverage a few common themes across the spectrum of NoSQL systems.

In this chapter, we will visit some of the broad categories of NoSQL database management systems. We will discuss the primary drivers that initiated the migration to NoSQL database systems and how such databases solved specific business needs that led to their widespread adoption...

Why NoSQL?


The term NoSQL generally means Not Only SQL: that is, the underlying database has properties that are different to those of common and traditional database systems. As such, there is no clear distinction that qualifies a database as NoSQL, other than the fact that they do not provide the characteristics of ACID compliance. As such, it would be helpful to understand the nature of ACID properties that have been the mainstay of database systems for many decades, as well as discuss, in brief, the significance of BASE and CAP, two other terminologies central to databases today.

The ACID, BASE, and CAP properties

Let's first proceed with ACID and SQL.

ACID and SQL

ACID stands for atomicity, consistency, isolation, and durability:

  • Atomicity: This indicates that database transactions either execute in full or do not execute at all. In other words, either all transactions should be committed, that is, persisted in their entirety, or not committed at all. There is no scope for a partial execution...

NoSQL databases


In our discussion of NoSQL types and databases, we will primarily focus on the following characteristics of NoSQL databases:

  • In-memory databases
  • Columnar databases
  • Document-oriented databases
  • Key-value databases
  • Graph databases
  • Other NoSQL types and summary

Most types of NoSQL used in the industry today fall into one or more of these categories. The next few sections will discuss the high-level properties of each of these NoSQL offerings, their main advantages, and products in the market that fall into the respective categories.

In-memory databases

In-memory databases, as the name implies, leverage the computer memory; that is, the RAM, to store datasets. Before we look into how in-memory databases work, it would be worthwhile to recollect how data transfer happens in a typical computer:

Simple Data Flow Computer Hierarchy

As shown in the preceding image, data traverses from disk to memory to the CPU. This is a very high-level generalization of the exact process as there are conditions...

Analyzing Nobel Laureates data with MongoDB


In the first exercise, we will use MongoDB, one of the leading document-oriented databases, to analyze Nobel Laureates from 1902-present. MongoDB provides a simple and intuitive interface to work with JSON files. As discussed earlier, JSON is a flexible format that allows representing data using a structured approach.

JSON format

Consider the following table:

Firstname

Lastname

Information

John

15

Subject: History, Grade B

Jack

18

Subject: Physics, Grade A

Jill

17

Subject: Physics, Grade A+

 

The Information field contains a column containing multiple values categorized under Subject and Grade. Such columns that contain multiple data are also known as columns with nested data.

Portability has been an important aspect of transferring data from one system to another. In general, ODBC connectors are used to transfer data between database systems. Another common format is CSV files with the data represented as comma-separated values. CSV files are optimal for structured...

Tracking physician payments with real-world data


Physicians and hospitals alike receive payments from various external organizations, such as pharmaceutical companies who engage sales representatives to not only educate practitioners on their products, but also provide gifts or payments in kind or otherwise. In theory, gifts or payments made to physicians are not intended to influence their prescribing behavior, and pharmaceutical companies adopt careful measures to maintain checks and balances on payments being made to healthcare providers.

In 2010, President Obama's signature Affordable Care Act (ACA), also known in popular parlance as Obamacare, went into effect. Alongside the ACA, a separate legislation known as the Sunshine Act made reporting items of monetary value (directly or indirectly) mandatory for pharmaceutical companies and other organizations. While such rules existed in the past, rarely were such rules available in the public domain. By making detailed payment records made...

The CMS Open Payments Portal


In this section, we will begin developing our application for CMS Open Payments.

The Packt Data Science VM contains all the necessary software for this tutorial. To download the VM, please refer to Chapter 3The Analytics Toolkit.

Downloading the CMS Open Payments data

The CMS Open Payments data is available directly as a web-based download from the CMS website. We'll download the data using the Unix wget utility, but first we have to register with the CMS website to get our own API key:

  1. Go to https://openpaymentsdata.cms.gov and click on the Sign In link at the top-right of the page:

Homepage of CMS OpenPayments

 Click on Sign Up:

Sign-Up Page on CMS OpenPayments

Enter your information and click on the Create My Account button:

Sign-Up Form for CMS OpenPayments

 Sign In to your account:

Signing into CMS OpenPayments

Click on Manage under Packt Developer's Applications. Note that Applications here refers to apps that you may create that will query data available on the...

R Shiny platform for developers


R Shiny introduced a platform for R developers to create JavaScript-based web applications without having to get involved, or, for that, matter even be proficient in JavaScript.

In order to build our application, we will leverage R Shiny and create an interface to connect to the CMS Open Payments data we set up in the prior section.

If you are using your own R installation (locally), you'll need to install a few R packages. Note that if you are using a Linux workstation, you may need to install some additional Linux packages. For example, in Ubuntu Linux, you'll need to install the following. You may already have some of the packages, in which case you'll receive a message indicating that no further changes were needed for the respective package:

sudo apt-get install software-properties-common libssl-dev libcurl4-openssl-dev gdebi-core rlwrap

Note

If you are using the Packt Data Science VM, you can proceed directly to developing the application as these Linux packages...

Summary


This chapter introduced the concept of NoSQL. The term has gained popularity in recent years, especially due to its relevance and direct application to big data analytics. We discussed the core terminologies in NoSQL, their various types, and popular software used in the industry for such capabilities. We concluded with a couple of tutorials using MongoDB and kdb+.

We also built an application using R and R Shiny to create a dynamic web interface to interact with the data loaded in kdb+.

The next chapter will introduce another common technology in data science today, known as Spark. It is yet another toolkit that empowers data scientists across the world today.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta